Webhooks Are Harder Than They Seem

Cover image

Svix is the enterprise ready webhooks sending service. With Svix, you can build a secure, reliable, and scalable webhook platform in minutes. Looking to send webhooks? Give it a try!

At first glance, webhooks seem simple. You take an event in your system—a user signs up, a file gets uploaded—and you send an HTTP POST request to a URL provided by your customer. Done. Easy. Or so it seems...

It's tempting to take the "just make it work" approach: add a few lines of code to fire off a request and move on to the next task. But like so many seemingly small technical challenges, webhooks have layers of complexity that reveal themselves as soon as you try to scale them, maintain them, or get them production ready.

The result? A lot of poor implementations in the wild. Webhooks that fail silently, are very limited, or lack security measures entirely. To make matters worse, webhooks have a higher bar for reliability compared to your API or the rest of your stack. If your API errors or crashes your consumers can immediately retry, but if your webhooks don't get triggered your customers won't even know they were expecting ones.

Key webhooks challenges

The simplicity of webhooks is both their biggest strength and their biggest weakness. Being as simple as they are (just an HTTP POST request) they are easy to consume and interact with, which is part of what made them ubiquitous in the first place. Though this simplicity has also led to simplistic solutions which make some webhooks systems insufficient for production use.

Below are some of the main challenges and considerations one should address when building their own webhook system.

Reliability

HTTP calls fail all the time. Servers go down, timeouts happen, and ephemeral network issues are abound. A webhook system needs to be able to recover from all of those in order to be relied upon. That's why a webhook system needs to have retries in order to ensure successful delivery. It's recommended to have an automatic retry schedule that follows an exponential backoff and spans a couple of days. This is both to increase the likelihood of delivery as well as decrease the delay until a successful retry.

Additionally, you need to make sure your system is reliable so that webhooks are delivered at least once. As even one missed webhook is enough to make it so your customers can rely upon your webhook. 1 missed webhook in 10,000 may sound like good odds for a human, but in practice it means the receiving service can't rely on the webhooks as part of a core workflow or integration.

Take for example Github's webhooks. They don't do retries, which means that CI tasks (like a Vercel build or an external check) don't always get run. This leads to immense frustration by customers, as well as a degraded experience for Github partners and customers.

Another consideration for webhook reliability is that your system needs to be able to deliver to a large amount of HTTP servers which may have their own oddities and misconfigurations. We wrote multiple blog posts about it, for example our blog post about incomplete TLS certificate chains and our blog post about HTTP oddities.

Security

Webhooks should follow traditional HTTP security practices. They should use HTTPS, TLS 1.2+, follow best practice for cipher selection, and more. Though webhooks also come with their own unique challenges that need to be addressed. This is where a lot of implementations get it wrong, because these are challenges that most engineering teams don't have prior experience with.

We at Svix helped create Standard Webhooks to help educate and fix some of the more common challenges, though security is hard, and you may have security issues even when following the spec.

The first challenge is authentication. Unlike other HTTP API calls, webhook requests are usually signed in order to ensure their authenticity. Standard Webhooks makes getting this right fairly easy, though if you're curious, we previously wrote a post about common webhook signatures failure modes.

Another challenge is server side request forgery (SSRF). SSRF happens when an attacker can make servers make requests to internal resources. For example, an attacker may be able to make a server make a request to another microservice or an internal system in order to attack it. This problem is inherent with webhooks as webhooks let attackers set the target URL where webhooks are made to, forcing webhook senders to protect against this.

The last challenge we'll mention in this post is essentially flooding. Causing the server to contact endpoints that are very slow to respond (e.g. slow TCP connection time, slow HTTP response, etc.) and making a lot of such connections brings down the service for everyone else by keeping the system bogged down.

Scalability

Webhooks can be deceptively quiet, until they aren't. A single failure can cascade into a full failure that brings down your whole service. This is not theoretical, Github had significant downtime on multiple occasions in 2023, many of these were caused because of their webhook system bringing their whole system down.

Webhooks can generate load orders of magnitudes higher than your normal system load. So even if your system is scalable, it may not be scalable enough to support your webhooks. There are a few reasons for that.

The first is that one API call may generate multiple webhooks on your system. E.g. let's take Stripe for example. When a payment is made, they'll generate a "charge successful" event to notify about the payment, "invoice paid" to notify about the change of status of the payment, "subscription paid" to notify about the status of the subscription, and potentially "customer updated" to mark the customer as no longer being a delinquent.

The second is that when making webhooks calls, you're making calls to external services that may be slow to process. So if you get 100 requests per second on your system, and each one generates 4 events, you'll get 400 events per second. Though if the consumers take 5 seconds to process each event, you'll now have 400 events on the first second, 800 on the second, 1,200 on the third, and 1,600 on the fourth.

The third is that some webhooks may fail and thus be retried, and this also compounds. Consider a big customer being down, or an AWS region having issues. The request processing time from the previous example may jump to 15 seconds, leading to 5,600 events / s just from the timeout. Though depending on how long the issues last, the retries may kick in, which means it'll actually be 2-4x the above load which means 11,000-22,000 events per second. These exact numbers may not apply to you, but they should be directionally correct. They are also not theoretical, it's something we deal with at Svix all the time.

Quality of service (QoS)

In the previous section we talked about scalability, the second side of scalability is quality of service. It's one thing to make sure the system doesn't buckle under load, but it's another to make sure that it still performs within your latency SLAs (read: webhooks are sent quickly). A webhook sent with a 20s delay because the system is busy, may be as worthless as a webhook that's never sent at all. Consider a customer making a payment that takes 20s to register, or an AI workflow that's delayed by 20s. These make for a terrible experience.

While ensuring a certain quality of service for a certain customer under load is important, what's even more important is ensuring quality of service for customers that are not under load; or in other words avoiding noisy neighbors. While a customer may be willing to accept some processing delays when they generate immense load, other customers that haven't generated the load won't be as tolerant. So it's important to make sure that load on one customer doesn't adversely affect others.

Observability

Webhooks are asynchronous in nature. This means that your customers don't control when they get them, they'll just get them when they happen. This makes observability a requirement for any production use-cases. Consider for example making a call to your bank. When you give them a call you know whether they picked up or not and you can act accordingly (e.g. retry later), that's the equivalent of making an API call. Webhooks, however, are the equivalent of the bank making a call to you. In this scenario you don't know whether the call succeeded or failed (e.g. maybe you went out of service exactly when they called) and without having access to some kind of a call log you'll stay in limbo. Additionally, without that call log telling you why it failed, you may not be able to remedy the problem.

That's why a webhook system should include good observability for its consumers, so that they can diagnose issues, fix them, and redrive failed requests.

Another observability channel is internal rather than customer facing. Webhooks are usually supported by complex internal infrastructure which involves queues, workers, and the likes. These require monitoring and alerting such as measuring queue back-pressure, DLQs, worker auto-scaling, and the likes.

Developer experience

One often-overlooked aspect of building webhooks is the developer experience. This is not just a nice-to-have, as a good developer experience is the difference between webhooks being adopted and webhooks not being adopted. So you should account for this extra work when building your webhook system.

One important aspect of webhooks developer experience is the aforementioned observability. Without this observability developers are flying blind, and are unable to effectively debug webhook delivery which is very important in both the initial implementation and on an ongoing basis.

Though related to that, is building a self-serve UI for your customers to be able to register webhooks, trigger test events, rotate secrets, and the likes. The alternative, which is filling forms and sending support tickets, will hamper developer adoption.

You also want to make sure you meet developers where they are and how they want to use your service. This includes having support for fanning out requests to multiple endpoints (some services only allow one URL) as your customers may have multiple systems that need to be notified, and not having that put the onus on them. The second is making sure to support their security and compliance requirements. For example, while webhook signatures are the recommended way of authenticating webhooks, your customers may have policies in place that require OAuth 2.0 or authentication tokens in addition. Make sure that you support these in order to make adoption as easy and smooth as possible.

Lastly, you may want to support webhook throttling and long timeouts. Regarding timeouts: while it's recommended for webhook consumers to verify payloads and immediately add to internal queues for later processing in order to ensure fast and reliable webhook consumption, not all of your customers will do this. Having long request timeouts when making webhook calls is therefore important in order to support these customers.

As for webhook throttling: like we mentioned in the scalability section above, many scenarios can generate a large amount of webhooks. While you may be able to handle the load, your customers may not, which will lead to webhook failures, and with the retries even more load on your customers which may bring their service down. That's why supporting webhook throttling, which essentially means letting your customers define the maximum webhook delivery rate they can handle, and throttle delivery accordingly can make a significant difference for you and your customers.

Closing words

Webhooks may seem easy at first glance, but many of the challenges they present are non-trivial and also unfamiliar for many engineers. From reliability and security to scalability and developer experience, building a production-worthy webhook system takes more time and work than people initially account for.

That's why we created Svix, to make webhooks easy and reliable. If you're thinking about building your own webhook system, or having issues with your existing one: check us out at Svix.com!

For more content like this, make sure to follow us on Twitter, Github or RSS for the latest updates for the Svix webhook service, or join the discussion on our community Slack.