Early January Outage Post Mortem

Summary

On January 4th a large sustained increase in legitimate user traffic overwhelmed an internal database resulting in hours of cumulative downtime. Subsequent hardware upgrades, software optimizations and improved monitoring have since restored full operation and better prepared us for future surges in traffic.

Traffic Spike

Load during two outages, one rollover and a final graceful upgrade

Timeline

January 4th

  • 10:15 - Response times increase and on-call engineers are paged.
  • 11:00 - On-call engineers begin dropping a subset of traffic to reduce queuing and restore normal operations on a primary database.
  • 12:15 - Now under normal load, primary database are failed over to warm secondaries.
  • 13:00 - New database indices are built to optimize and prepare to restore normal operations.
  • 13:15 - Normal operations are restored and team continues to optimize usage.

January 5th - morning

  • 05:15 - Response times increase and on-call engineers are paged by newly introduced early monitoring tools.
  • 05:45 - Service is gracefully degraded to prevent a full outage for authenticated users.
  • 06:00 - Response times begin to return to normal levels.

January 5th - evening

  • 18:45 - Service is again gracefully degraded to minimize load during a scheduled upgrade to larger secondaries.
  • 19:00 - Secondaries are not sufficiently warmed and rollover is eventually aborted.
  • 20:30 - Normal operations are restored and we begin working a more efficient rollover procedure.

January 6th

  • 10:00 - Our initial plan to warm new hardware by streaming data from a primary to a secondary is projected to take ~72 hours. We decide shortcut this process by copying data directly from a high PIOPS EBS snapshot to a locally attached SSD on the new box, bringing the warming time down to 15 hours.

January 7th

  • 13:00 - Using new database warming procedure to ensure primary indexes are in RAM and newly replicated high performance secondary, the final and most impactful planned upgrade is completed to restore full service.

Root Cause

During this traffic increase, we saw a sustained increase in legitimate user traffic and one primary database was not prepared to scale to meet this load. This database acted as chokepoint in performance that resulted in a complete public outage of coinbase.com.

Though scaling our databases is a normal operation that we’re prepared to execute without interruption, we weren’t prepared for such a large, sudden increase in traffic. In the past we’ve planned in advanced for upgrades where new instances have had ample time to warm over many days. We weren’t prepared to manually expedite this warming resulting in our second outage, and our final upgrade took over 15 hours to migrate and warm indices.

Actions Taken

Since the start of this outage we’ve improved problematic systems and better prepared for future incidents:

  • Scale: We’ve upgraded our clusters from network attached storage to locally attached SSDs resulting in a ~20x decrease in latency on some operations. Vertically scaling our systems is however a short term win. We’re now working on better separating concerns by scaling horizontally and upgrading to more efficiently designed systems. Running on this new hardware across all of Coinbase’s Web & API tier we’re now clocking 35% faster response times.
  • Optimization: We’ve reviewed our least efficient queries and worked with the team to either refactor offending code or build new indices to minimize contention and expensive operations that scaled poorly.
  • Monitoring & Alerting: New monitoring on request volume and response times along with earlier alerting has been implemented to better identify future spikes and minimize future escalations resulting in a full outage.
  • Access: Our top priority at Coinbase is security and we emphasize least privilege. With our consensus model we’ve struck a good balance to empower users, but during this incident we identified and remediated several systems that the team either didn’t have access to or weren’t yet comfortable using.
  • Graceful Degradation: Going into this incident, we relied on several slow-to-enable anti-DDoS & deployable configuration settings to degrade services that extended the duration and impact of this outage. We’ve since developed the ability to rapidly update & much more gracefully degrade services without locking out customers.
  • Readiness: Working through this incident gave new team members a (very public) chance to run through our incident response process and identify gaps in tooling, training and documentation. One of our engineering principles is to “always leave it better than you found it” and we’re working through a backlog now that we’ll be exercising through our regular tabletops and simulations.

War Room

A Typical Coinbase War Room

Moving Forward

The top strategic priority of Coinbase is to provide a secure service to our customers. This often means we prioritize spending time on new security tools over focusing on high availability. As the digital currency market grows and so too does our engineering team, we’ve started investing more heavily in the reliability and availability of our systems. As we ship more of these improvements in ‘17 we’re looking forward to providing a better experience across our services.

If you’re excited about building fast-growing, high quality systems at scale, we’d like to hear from you!

Please note: We’re hiring engineers (both in our San Francisco office and remote anywhere in the world). If you’re interested in speaking with us about a role we’ve set up a coding challenge that you can take in about 30–45 minutes. You can also apply through our careers site if you prefer to start the conversation that way.

Written by Coinbase Engineering