As we've written about before, we've built Apigee to be highly available. However, our architecture is still somewhat vulnerable at the data center level. Last night Amazon Web Services had a power outage in an Availability Zone in the US East Availability Region that by sheer chance took down Apigee's entire ServiceNet cluster (Sonoa's enterprise-grade API management solution). This affected the API traffic for all Apigee users.
- At 2:23 AM PDT a power distribution unit malfunctioned, which brought down our instances at 2:32.
- By 2:36, our engineering team were already scrambling to discover the extent of the problem, which took about 20 minutes due to the degree of the failure.
- At 3:04 a work-around had been deployed and Apigee was back online.
In the end, thanks to quick work by the engineering team, Apigee was down for less time than the AWS outage lasted. Until now, Apigee had only been down for 5 minutes (also in this past week) during a planned server migration—something which we have now designed into our scripts to avoid in the future.
This kind of failure is something we've known about and already designed a solution for; however, it's going to require some significant engineering work to implement. Our plan is to stripe our service across data centers. We expect to have this solution in place in the next 4-6 months. When that happens, any outage will failover to the redundant datacenter, which will result in downtime of less than a minute. Keep in mind that Amazon Web Services only has a guaranteed uptime of 99.95% by region, making it impossible for us to exceed while being in a single region.
Providing bullet-proof service is of critical importance to our users. If you have any questions, concerns, or suggestions about availability, please let us know in the comments or on http://support.apigee.com. We apologize for any trouble this
Addendum: We did also discover that we weren't properly monitoring the service that handles our SSL traffic, resulting in a longer downtime of roughly four hours. Thanks to Jonathan who reported the outage in the support forum. This was an oversite and has since been added to our Nagios monitoring system.
Addendum 2: We also have plans to provide a status page for Apigee soon. We'll post details as we get closer.
Addendum 3: At 5:39 PDT today AWS reported a 2nd outage affecting the very same boxes (our systems registered the outage at 5:32). We scrambled to recover, bringing full service back online by 5:52. All systems are currently working properly.