Platform API Outage on July 31st and August 1st
On July 31st and August 1st, our Platform API experienced two significant outages that impacted the availability of our services.
What Happened?
On July 31st and August 1st, our Platform API experienced two significant outages that impacted the availability of our services. These incidents were classified as SEV-1, indicating their critical nature, as they disrupted access to several essential services, including our dashboard, RPC, storage, in-app wallets, and other functionalities.
Timeline of Events
Incident 1: July 31st
- 10:02 AM PT: The first incident began when our Platform API hit database connection pool limits due to slow database queries.
- 10:10 AM PT: The issue was acknowledged by our team.
- 10:12 AM PT: We restarted the Platform API, which temporarily alleviated the issue.
- 10:27 AM PT: We identified a slow query and initiated a fix by adding a missing database index.
- 11:05 AM PT: The incident was resolved with no further impact detected.
Incident 2: August 1st
- 9:36 AM PT: The second incident started, mirroring the previous day's issue.
- 9:45 AM PT: The incident was acknowledged.
- 9:55 AM PT: A missing cache was identified, leading to database overloads.
- 10:45 AM PT: Our team implemented a read replica increase and added caching to reduce load, leading to a restoration of services in a degraded state.
- 11:45 AM PT: The issue was traced back to excessive requests from an external source, which were blocked, significantly reducing server load. At this point services were restored fully.
- 12:45 PM PT: The incident was fully resolved with no further impact detected.
Root Cause Analysis
The outages were primarily caused by slow database queries and inadequate caching in front of expensive database operations. Additionally, an unexpected surge in traffic, particularly excessive (though seemingly not malicious) requests from a specific Amazon EC2 IP, exacerbated the situation. The lack of caching for certain responses led to repeated expensive database operations, further straining our Platform API.
Impact on Our Customers
During these outages, our customers were unable to access core services. At times 50% of requests to the Platform API were met with errors, severely disrupting normal operations. We understand that a service degradation of this scale severely impacts our customers and we deeply regret the frustration this caused.
What We’re Doing to Prevent Future Incidents
To prevent such incidents from occurring again, we've taken several steps:
- Caching: We've implemented and validated caching mechanisms to reduce the load on our database.
- Alerting: New alerting protocols are in place to detect anomalies before they escalate into large scale outages.
- Reducing Blast Radius: Critical services will no longer be mixed with less critical ones, minimizing the risk of widespread impact. The effort to split out existing combined services is already underway and will roll out over the next 3-5 days.
- Traffic Management: We've improved our ability to manage and mitigate spammy or malicious traffic.
Moving Forward
We understand the importance of reliability and are committed to learning from these incidents to provide a more robust service. Our team is working on further improvements, and we'll continue to keep you updated.