We've built Apigee as a highly available (HA) service because Apigee's users depend on us to deliver their traffic. More than anything we have to ensure that our services and proxies are as transparent and as light as possible. Much of the magic behind our proxying technologies comes from our cluster of Sonoa's ServiceNet boxes, which efficiently balance the proxy load and in worst-case scenarios failover to other parts of the cluster. And to test the responsiveness of our systems and people, Brian, Apigee's GM, sometimes conducts surprise fire drills in the middle of the night.
In mid-December, we noticed some strange patterns in our daily traffic reports, and we set out to investigate. While the ServiceNet proxies are highly available, other parts of the system, while important, are somewhat less mission-critical. Eventually we discovered that one customer's database tables were impacting the performance of our analytics server—to the point that it was unable to keep up with all the traffic statistics, making it appear as if traffic had dropped. It appeared like we had suffered a big dip in traffic, even though no proxies had been directly affected. This may have affected other accounts as well, so if in mid-December you noticed any unexpected fluctuation in your Apigee analytics, it may have been related to this.
So what have we done to address this? First, we scrambled to work with the customer who had the massive table in our database. That traffic comes from an iPhone app, which means that lots of dynamic IP addresses were blowing the table out of proportion. We worked directly with them to find a better way to identify the traffic, and this fixed things for everyone. It also taught us more about the needs and use-cases of our developers.
Even more importantly, we've identified a weakness in our architecture, and we're moving to address that with engineering. First, we're building a redundant system to make the handoff from the proxy cluster to the analytics server more robust (and also essentially highly available). Furthermore, we're planning to implement technologies to distribute the database queries to better serve analytics report generation. These changes will start to rollout over the next month or so.
The good news is that we're growing and getting better all time. If you have any questions or comments about our response or how you'd prefer us to handle these situations in the future, please comment or post in our support forum. We're listening, and we really want to hear from you.