- Published on
We Had an Outage in the US region
- Authors
- Name
- Tom Hacohen
- @TomHacohen
We suffered an outage on Tuesday the 6th of June making it so the vast majority of the API calls in the US REGION were failing (other regions were unaffected). The outage lasted for approximately 30 minutes, and as mentioned, only affected the US region. We are terribly sorry about this, and it's extremely frustrating that after two years of flawless operations with zero issues, this is our second issue in 2023. I want to again apologize to all affected customers. We work very hard to ensure that these things don't happen, but we still managed to fail you and ourselves.
What happened?
We made a change to how we fetch authentication tokens from the database. This code-path is heavily cached so the fetching doesn't happen that often, but it still needs to be fast because it's in a very important code path. The change was fairly simple, it was the first step towards switching from a string column that we were using when fetching the tokens, to a newly created bytes column. As part of this change, the query was changed from something that looks roughly like SELECT * FROM authtoken WHERE token = '...'
to something that looks like SELECT * FROM authtoken WHERE token = '...' OR token_bytes = '...'
. Needless to say, both fields are indexed.
This change worked great and the test suite passed both in the staging environment and on the team's local machines. The canary tests were running along without any issues. There was no noticeable increase in latencies and everything worked great.
However, when deployed to production we started seeing elevated errors and alerts, with the database (and all of the replicas) being pegged at 100% CPU usage. This was causing many queries to timeout and fail, which in turn caused API calls to fail. This was only happening in the US, not in the EU.
Why did it happen?
In retrospect, knowing what the issue was, it makes sense that it was only happening in production. The problem required a large amount of different API users with different authentication tokens making calls in parallel for it to trigger, while our continuous canary tests only emulate a small number of customers (more on that later).
Now the question is, why did it even happen in the first place? As mentioned above, both columns were indexed, so it's not a matter of a missing index. In fact, something else was going on.
We setup a script to run a fake query with EXPLAIN ANALYZE
to better understand what's going on.
The query we ran looked something like this:
EXPLAIN ANALYZE SELECT id FROM authtoken
WHERE
token_bytes = '\x31bffff12121d2104e8e96342728efc2af2d95431b31108b428719817b40595b' OR
token = 'testsk_ORrKzFeI37dV-bnk1heGopi61koa9fmO.us'
LIMIT 1
What we expected to see is something like this, showing how the indices are used together to reduce the search space:
Limit (cost=8.58..12.60 rows=1 width=32) (actual time=0.022..0.022 rows=0 loops=1)
-> Bitmap Heap Scan on authtoken (cost=8.58..12.60 rows=1 width=32) (actual time=0.021..0.021 rows=0 loops=1)
Recheck Cond: ((token_bytes = '\\x31bffff12121d2104e8e96342728efc2af2d95431b31108b428719817b40595b'::bytea) OR ((token)::text = 'testsk_ORrKzFeI37dV-bnk1heGopi61koa9fmO.us'::text))
-> BitmapOr (cost=8.58..8.58 rows=1 width=0) (actual time=0.018..0.018 rows=0 loops=1)
-> Bitmap Index Scan on ix_authtoken_token_bytes (cost=0.00..4.15 rows=1 width=0) (actual time=0.009..0.009 rows=0 loops=1)
Index Cond: (token_bytes = '\\x31bffff12121d2104e8e96342728efc2af2d95431b31108b428719817b40595b'::bytea)
-> Bitmap Index Scan on ix_authtoken_token (cost=0.00..4.43 rows=1 width=0) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: ((token)::text = 'testsk_ORrKzFeI37dV-bnk1heGopi61koa9fmO.us'::text)
Planning Time: 0.520 ms
Execution Time: 0.053 ms
But instead, what we saw was something like the below, which is a full table scan instead of using the indices.
Limit (cost=0.00..8.26 rows=1 width=32) (actual time=1614.602..1614.603 rows=0 loops=1)
-> Seq Scan on authtoken (cost=0.00..308297.01 rows=37344 width=32) (actual time=1614.601..1614.601 rows=0 loops=1)
Filter: ((token_bytes = '\\x31bffff12121d2104e8e96342728efc2af2d95431b31108b428719817b40595b'::bytea) OR ((token)::text = 'testsk_ORrKzFeI37dV-bnk1heGopi61koa9fmO.us'::text))
Rows Removed by Filter: 7481857
Planning Time: 0.478 ms
Execution Time: 1614.623 ms
There could be a variety of reasons as to why an index isn't used when making a query in Postgres. For example, maybe the table is small enough that scanning it is faster than scanning the index. Though this was not the case here, as the table is actually quite large.
Looking at the execution time and the query plan this explains why the CPU was pegged at 100%. It essentially meant that the moment an auth token was evicted from cache, the API endpoints would fetch it from the database and then try to insert it to cache. Though because getting the auth token would potentially take multiple seconds (under load), it meant that more requests were hitting the database trying to fetch the same auth token, and that could mean many thousands of the same slow requests hitting the database at the same time only making the problem worse, and causing all the requests to time out. This, combined with the short statement timeout we have set on the database (so that the database kills slow queries) meant that queries to get the auth tokens were never succeeding, and thus never cached.
Now, the question is, why weren't the indices used? We are still investigating this. We suspected that it may be because this table hasn't been VACUUM
ed enough, but this doesn't feel like it would affect index speed with the queries that we run. We also eliminated the indices being incomplete or invalid by querying the PG "management tables".
We are also consulting with a PG expert about this, and he suggested that it could be related to the autovacuum
operation (well, the ANALYZE
part of it), as that updates the PG stats of the index and helps the query planner decide whether to use the index or not. Though we already have autovacuum
on which should have taken care of it by doing auto analyze on the table. So the verdict is still out.
Incident timeline
This is the timeline of the incident (all times are in UTC):
- 15:34 - The code with the query that exhibited the issue was deployed to prod
- 15:34 - Elevated issues were noticed by the team and investigation started.
- 15:39 - After examining all non-deployed changes, a revert of all of them was merged.
- 15:45 - Deploy of the revert started, though deploys were failing because the database was overloaded, and the code has readiness sanity checks to make sure it can connect to PG (which it wasn't able to). This meant that deployments were being automatically reverted back to the last healthy version (which was the one with the issues). Only after we manually stopped all "healthy" API instances were we able to deploy the new one.
- 16:08 - Errors stopped, and everything was back in order.
Next steps
The most important thing is to make sure this (or anything like this) never happens again. So what are we going to do to support that? The first thing we did is initiate a company-wide feature freeze where only critical changes are allowed to be merged in until after we fix our post-incident changes.
Some of the things we are going to change are: (1) Change our internal canary tests to mimic higher loads from a large number of API tokens, (2) Improve our deployment speed further, especially for rollbacks (not a solution, but will reduce downtime - already in progress), (3) Improve our internal policies to always feature-flag changes in the hot-path (no matter how innocent looking), (4) Automatic multi-region fallbacks in addition to our multi-datacenter fallbacks (it probably wouldn't have saved us here, but we are prioritizing it now as well), and (5) Make auth tokens reads even more resilient to issues with the databases. Though this is not the end of the list, we are also investigating other ways to increase our resiliency in the hot path. We already have multiple database replicas and extensive caching, as well as protections against "runaway" queries in place, though these were not enough to prevent this issue, which is not acceptable.
And though this is adjacent to the issue we faced, I want to mention that we are also releasing a tool that automates sending webhooks to Svix using best practices that would have automatically recovered from this.
I would again like to offer our deepest apologies to those affected. I know you trust us with your webhooks, and we want to make sure to live up to that trust.
In case you haven't already, please check out our status page. You can register there to be notified about incidents, though we of course hope that you never will never receive such notifications ever again. Additionally, you can always contact us via email and Slack in case you notice any issues or you have any questions.