Why You Can't Guarantee Webhook Ordering

Cover image

This post is about one of the topics we get asked about the most: guaranteeing webhook ordering. At first glance it seems like a simple, and easy to implement idea — just send the webhooks in order.

Update: we have released Polling Endpoints which makes it much easier to implement strict webhook ordering. Read more on the Polling Endpoints announcement post.

Update 2: we have now released FIFO Endpoints which send webhooks in order. Read more on the FIFO Endpoints announcement post.

Why do people want guaranteed ordering?

Before we get over to explaining the challenges, let's first talk about why people want it in the first place.

Consider for example a basic billing system where you have two types of entities, a customer and a card. A customer holds information about the person making the payment, for example their home address and telephone number. A card holds information about the payment method (e.g. a credit card) being used, and a reference to its associated customer.

Now imagine you have a form on the website that lets a new customer add their personal details and card information all at once. This form, on the backed will create them both and trigger two webhooks customer.created and card.created.

If you don't ensure ordering, the webhook endpoint may receive either one before the other. So it could potentially receive the card.created endpoint webhook, referencing the customer, before the endpoint ever got note of the customer being created in the first place.

What's the obvious solution to this? Just ensure the order of delivery!

Guaranteeing order when webhooks fail

The first challenge with ensuring ordering comes when considering failed deliveries. What should happen when delivery fails? Should we block the whole queue of messages? Or should we only ensure ordering when things work and there are no errors?

If we block the whole queue, then a single failure when sending a minor webhook would completely distrust webhook delivery, and thus the whole service. Think about it, they won't be getting any messages because of a bug in the delivery of one type of messages. This is obviously not good.

The alternative then is to ensure ordering only when there are no errors. This is simple enough, but the problem is that we don't really ensure ordering. If your customers need to ensure out-of-order delivery in case of failures, they may as well just process out-of-order delivery in general. It's the same code. So in this case ensuring ordering doesn't add any value.

Guaranteeing order doesn't really work

The other, more significant, challenge with ensuring webhook ordering is that even if you do send webhooks in order, they may not be processed in order.

For example, let's assume we have two events: first and second, and you send first first and second second. Now, let's assume the service consuming the webhooks looks roughly like this pseudo-Python code:

def handler_for_event_first(payload):
    # This will not be a sleep, but some other slow call (db, external API)
    sleep(1)
    do_stuff(payload)

    return HTTP_OK_200

def handler_for_event_second(payload):
    do_stuff(payload)

    return HTTP_OK_200

So there are two handlers, one for first that's very slow, and one for second that's fairly fast. Note: we've used the sleep directive to emulate slowness, in reality it will be some other slow call.

Looking at this code, it means that even if you send first first, it will in practice be processed after second and not before. This is because even though the handler will be called first, the function is slower so a good chunk of it will be processed second. In a more real scenario it will probably manifest as a race condition rather when in addition to being out of order, the order will also be very indeterministic.

We can easily work around this by only sending second once first finishes processing (we get a 200 for the webhook handler), though this brings forward two additional problems:

if the handler takes one second to complete (not that rare), our webhook delivery will essentially be limited to one webhook per second, which is terrible, and it'll often be even worse.
Best practice when ingesting incoming webhooks is to do some basic validation, put the webhook in a queue for later processing, and immediately return a successful HTTP response. This means that waiting on first to finish doesn't actually fix the problem, because unless the queue is processed in order (again, waiting for first to finish before attempting to process second), it will suffer from the same issues.

So what can we do?

As you saw above, even if you put the onus on your customers, educate them about webhooks best practices, and have them try to process everything in order; it's still quite fragile.

So what can you do? The best solution is to design your webhooks in a way that doesn't require ordering.

One common solution is using what's called "thin payloads", which are essentially webhooks with identifiers and some additional metadata (like which properties have changed), which give your customers enough information about what's going on, but still have them fetch the most recent information using the API.

Another solution (which you should be doing regardless), is including the entity's modification date (or modification counter) in the payload, so the customer can check whether this event is newer or older than what they currently have stored.

While not recommended, you can also ensure ordering of events by attaching a monotonically increasing sequence number to events, and have your customers ensure the processing order on their end by tracking this sequence number. This is still susceptible to all of the issues described above, but it just shows that ordering can be done even without ordered sending.

Alternative solutions by reframing the problem

The above problems are inherent to how webhooks work. Though if we reframe the problem we can overcome some of these limitations and offer alternative solutions.

One solution is to combine webhooks with polling. It's similar to the previous approach, but instead of sending thin payloads, you just send a "ping" to let the customer know that they should poll, and then have them poll the events firehose for the latest events. (Svix now supports this with Polling Endpoints)

Another solution would be to change the behavior of webhooks to better match the FIFO use-case. Instead of having webhooks as individual events that are triggered independently, you can have them behave more like a stream that is sent together. In order to avoid the throughput problems mentioned above you would probably want to batch messages together, and the consumer will need to make sure to be fast and efficient when consuming. (Svix now supports this the FIFO Endpoints announcement post)

In conclusion

We see a lot of webhooks implementations at Svix, and this question comes up quite often. Though the trade-offs of guaranteeing delivery order for normal webhooks are so significant that it's almost always not the right solution.

However, as discussed above, there are multiple solutions that can help solve the underlying problem. You can design your payloads so that your customers have the information they need to process them irregardless of ordering, or to follow the ordering constraints their systems were designed to follow. Alternatively, you can reframe the problem and solve it by offering a slightly different behavior to regular webhooks by offering your customers Polling endpoints (with "ping" webhooks) and batched FIFO endpoints.

For more content like this, make sure to follow us on Twitter, Github or RSS for the latest updates for the Svix webhook service, or join the discussion on our community Slack.