Building a Webhook Receiver That Handles Production Traffic

Webhooks look simple on paper. A remote service sends an HTTP POST to your endpoint, you process the event, you return a 200. The implementation that works in development often fails within days of going live.

Delivery retries create duplicate events. High-volume senders overwhelm synchronous receivers. Forged requests slip through unsigned payloads. An endpoint that returns a 500 on a slow database query gets retried six times in 30 seconds by Stripe's retry policy. These failures follow a predictable pattern, and building against them from the start is much cheaper than retrofitting after a production incident.

This guide covers what makes a webhook receiver production-ready: signature verification, idempotency, async processing, replay protection, and graceful failure handling.

Industrial pipe manifold valve system
Photo by B_Me on Pixabay

The Request-Response Window Is Too Short for Real Work

The most common webhook implementation mistake is doing synchronous processing inside the request handler. You receive the event, query the database, update records, send a notification, and return 200 when everything finishes.

This creates two failure modes. First, if any processing step takes more than a few seconds, the sender may time out and retry the event. Now you're processing the same event twice. Second, if your database is under load or a downstream dependency is slow, the entire endpoint fails, and the sender queues you for another retry attempt.

The correct architecture separates receiving from processing. The receiver does exactly three things: validate the signature, persist the raw payload, and return 200. Everything else happens in a background worker.

The receiver endpoint should respond in under 500 milliseconds. Redis and other in-memory queue systems handle the handoff between the receiver and the worker efficiently, even at high event volumes. A database-backed queue works for lower volumes and adds no infrastructure complexity.

Signature Verification Is Not Optional

Every production webhook sender worth integrating with signs its payloads. The signature is a hash of the raw payload body combined with a shared secret. Verifying this signature before touching the payload is the first line of defense against forged requests.

The verification process: compute the HMAC of the raw request body bytes using the shared secret, compare against the signature header the sender included, and reject requests that don't match. Most senders include verification utilities in their SDKs. Postman supports webhook simulation, making it easy to test your verification logic against correctly and incorrectly signed requests before a real sender connects.

Two important implementation details here. Always compute the HMAC on the raw bytes of the request body, not on the parsed JSON. Parsers can normalize whitespace, reorder keys, or change encoding in ways that break verification even when the original payload was genuine. And always use a constant-time comparison for the final check. String equality that short-circuits on the first differing character is vulnerable to timing attacks.

Idempotency: Every Event Arrives More Than Once

Webhook delivery is at-least-once, not exactly-once. If your endpoint returns anything other than a 2xx response, including a 500 caused by a slow query, the sender will retry. Stripe retries for up to three days. GitHub retries failed delivery attempts multiple times per day.

Your event processing code must be idempotent: running it twice on the same event produces the same result as running it once. There are two practical approaches.

Track event IDs. Every webhook event includes an identifier in the payload. Before processing, check whether you have already handled this ID. If yes, return 200 immediately without doing any work. A database table with a unique index on the event ID handles this without complex locking.

Design idempotent operations. Instead of "add 10 credits to the account," store the target credit balance this event should produce. Running the same update twice produces the same final state.

The ID-tracking approach is more universal and requires no special consideration about the logic being idempotent. Store the event ID, received timestamp, and processing status. Your worker marks the record as processed on completion. If the same event arrives while the first is still being processed, the receiver returns 200 immediately.

Fingerprint scan security authentication system
Photo by stux on Pixabay

Replay Protection and the Timestamp Window

Signature verification proves a payload is genuine. It does not prove the payload is fresh. A signed payload from six hours ago is still a cryptographically valid signed payload.

Replay attacks are less common in webhook integrations than in authentication flows, but they're worth guarding against in high-security contexts. The mitigation is straightforward: include a short timestamp window. Reject requests where the timestamp in the header is more than five minutes old.

Most production webhook senders include a timestamp alongside the signature specifically for this purpose. Stripe's signature scheme embeds a timestamp in the signature header and documents the five-minute window check as a recommended practice.

"Most teams stop at signature verification and call it secure. Idempotency is equally important. The retry storm that hits at 2am when a slow database query returns 500 generates duplicate records across multiple tables. Once is preventable. Cleaning it up is not." - Dennis Traina, founder of 137Foundry

Handling Failures Without Losing Events

When processing fails after a valid payload has been received, the safest response is to log the failure, leave the raw payload in the queue with a failure status, and not return an error to the sender. You have already acknowledged the event. Returning a 500 at this point would trigger retries for an event you have already stored, creating duplicates.

The background worker handles retry logic independently from the sender. If a downstream call fails, the worker backs off using exponential delay and retries. If it fails past a retry threshold, the event moves to a dead-letter queue for manual inspection.

An endpoint that only accepts, verifies, and stores the raw payload will rarely return a 5xx. The only real failure modes are a database write error or a verification crash. Both are fast, catchable operations that fail quickly rather than hanging.

Testing Before Going Live

Testing webhook receivers is harder than testing regular API endpoints because you need an external system to send signed requests. ngrok exposes a local endpoint to the public internet, allowing external services to send real payloads to your development machine. This is the most direct way to test signature verification, idempotency, and error handling against a real sender.

For automated testing, write unit tests against the signature verification and idempotency logic directly. Mock the raw bytes and a known HMAC secret, then verify that your code accepts valid payloads and rejects tampered ones. Test the idempotency check by calling your processing function with the same event ID twice and asserting the second call is a no-op.

MDN Web Docs has authoritative reference material on HTTP request handling, headers, and response codes for teams building receivers from scratch.

Structuring the Receiver Endpoint

A minimal production-ready receiver follows this sequence:

Parse the raw request body as bytes before any framework deserialization happens.
Verify the signature against the shared secret. Return 400 if it fails.
Check the timestamp window. Return 400 if the payload is older than five minutes.
Extract or generate the event ID from the payload.
Check the idempotency store. If the ID has already been seen, return 200 immediately.
Write the raw payload and event ID to the queue with status "pending."
Return 200.

The worker reads from the queue, processes the event, and updates the status. Processing errors never affect the receiver's response to the sender. The receiver stays fast and stateless.

Queue of numbered tickets on a dispenser
Photo by jessebridgewater on Pixabay

What Good Webhook Infrastructure Looks Like at Scale

At higher volumes, the raw payload store becomes a replayable event log. You can re-process events from any point in time by re-queuing records from the raw store. This is useful for debugging processing bugs discovered after the fact, without needing the sender to resend historical events.

A dead-letter queue for events that fail processing repeatedly provides a clear list of what requires manual attention, without blocking new events from being processed normally.

Two metrics worth monitoring in production: queue depth (events waiting to be processed) and worker error rate (events that fail during processing). If queue depth grows steadily, the workers are not keeping up and need scaling before senders start seeing timeouts. If the error rate spikes, something in the processing logic is failing on a class of events.

The 137Foundry data integration team helps teams design and build event-driven integrations, including webhook receivers with the queue architecture and failure handling described here.

Choosing the Right Queue Backend

For low-volume integrations (under a few thousand events per day), a database table works as a queue. It's inspectable, queryable for stuck events, and requires no additional infrastructure. The tradeoff is write throughput under load.

At higher volumes, a dedicated queue reduces database write pressure and allows independent worker scaling. Redis works well here. The right choice depends on your current infrastructure and your actual current volume, not a hypothetical future load.

The correct default for new integrations is a database-backed queue. You can migrate to a dedicated message broker when the load justifies it. This team has helped multiple clients make that migration without interrupting live event processing.

Final Note

A webhook receiver that works in testing is not the same as one that works when a sender retries an event six times overnight or when two events with the same ID arrive 200 milliseconds apart. The patterns in this guide, signature verification, idempotency, async processing, and replay protection, each address a specific failure mode you will eventually encounter.

Building them in from the start costs a few extra hours. Retrofitting them after the first production incident costs much more.