Webhook Retry Logic: The Complete Guide

By ThunderHooks Team · · 7 min read
Webhook Retry Logic: The Complete Guide

Webhook Retry Logic: The Complete Guide

Webhooks fail. Not sometimes — constantly. The receiving server is deploying, the network hiccuped, a load balancer timed out, the database behind the endpoint hit a connection limit. In production, a 2-5% failure rate on webhook deliveries is completely normal.

The question isn't whether your webhooks will fail. It's what happens next.

A good retry strategy is the difference between "we lost 3% of payment notifications and some customers got free access for a week" and "every event was delivered within 30 minutes, automatically." This guide covers everything you need to build reliable webhook delivery — or evaluate whether your current provider handles it properly.

Why Webhooks Fail

Before picking a retry strategy, it helps to understand the failure modes. They fall into three categories:

Transient failures — the endpoint returned a 500, 502, 503, or the connection timed out. These are the most common. The server was busy, mid-deploy, or a downstream dependency was momentarily unavailable. A retry in 30 seconds will probably succeed.

Persistent failures — the endpoint consistently returns errors. Maybe the consumer's server is down for hours, their SSL certificate expired, or a bug in their handler causes every request to fail. Retrying quickly just wastes resources. You need backoff.

Permanent failures — the endpoint returns 401, 404, or 410. The consumer removed their webhook handler, changed their URL, or revoked credentials. No amount of retrying will fix this. You need to stop and alert someone.

Understanding these categories shapes the retry logic: be aggressive on transient failures, patient on persistent ones, and smart enough to stop on permanent ones.

Retry Strategies Compared

Fixed Interval (Linear)

The simplest approach: wait N seconds between each retry.

Attempt 1: immediate
Attempt 2: 60s later
Attempt 3: 60s later
Attempt 4: 60s later

This is fine for a toy project. In production, it causes thundering herds. If a server goes down for 10 minutes and you're sending webhooks to 5,000 endpoints, they all come back online at the same time and you slam them with retries simultaneously. Fixed intervals also don't adapt — you're equally aggressive whether the failure is a brief blip or a multi-hour outage.

Exponential Backoff

Each retry waits longer than the last, typically doubling:

Attempt 1: immediate
Attempt 2: 30s later
Attempt 3: 1 min later
Attempt 4: 2 min later
Attempt 5: 4 min later

This is dramatically better. Short outages are retried quickly. Long outages don't generate a pile of wasted requests. But pure exponential backoff still has the thundering herd problem — all retries for a given attempt number fire at the same time.

Exponential Backoff with Jitter

The gold standard. Same exponential curve, but each retry adds random jitter to spread the load:

Attempt 1: immediate
Attempt 2: 22-38s later (30s ± jitter)
Attempt 3: 48-72s later (60s ± jitter)
Attempt 4: 1m40s-2m20s later (2min ± jitter)
Attempt 5: 3m30s-4m30s later (4min ± jitter)

Jitter breaks up the synchronized retry waves that would otherwise hammer a recovering server. AWS recommends this pattern. The Standard Webhooks spec recommends it. Every serious webhook provider uses it.

Here's a clean implementation in Go:

package retry

import (
	"context"
	"math"
	"math/rand"
	"net/http"
	"time"
)

type Config struct {
	MaxRetries  int
	BaseDelay   time.Duration
	MaxDelay    time.Duration
	JitterRatio float64 // 0.0 to 1.0
}

func DefaultConfig() Config {
	return Config{
		MaxRetries:  5,
		BaseDelay:   30 * time.Second,
		MaxDelay:    1 * time.Hour,
		JitterRatio: 0.25,
	}
}

func (c Config) DelayForAttempt(attempt int) time.Duration {
	delay := float64(c.BaseDelay) * math.Pow(2, float64(attempt))
	if delay > float64(c.MaxDelay) {
		delay = float64(c.MaxDelay)
	}

	// Apply jitter: ±(JitterRatio * delay)
	jitter := delay * c.JitterRatio
	delay = delay - jitter + (rand.Float64() * 2 * jitter)

	return time.Duration(delay)
}

func DeliverWithRetry(ctx context.Context, cfg Config, req *http.Request) error {
	client := &http.Client{Timeout: 30 * time.Second}

	for attempt := 0; attempt <= cfg.MaxRetries; attempt++ {
		if attempt > 0 {
			delay := cfg.DelayForAttempt(attempt - 1)
			select {
			case <-time.After(delay):
			case <-ctx.Done():
				return ctx.Err()
			}
		}

		resp, err := client.Do(req)
		if err != nil {
			continue // network error, retry
		}
		resp.Body.Close()

		if resp.StatusCode >= 200 && resp.StatusCode < 300 {
			return nil // success
		}

		// Don't retry on client errors (except 429 and 408)
		if resp.StatusCode >= 400 && resp.StatusCode < 500 {
			if resp.StatusCode != 429 && resp.StatusCode != 408 {
				return fmt.Errorf("permanent failure: HTTP %d", resp.StatusCode)
			}
		}
	}

	return fmt.Errorf("exhausted %d retries", cfg.MaxRetries)
}

A few things worth noting in that code. The function distinguishes between retryable and non-retryable status codes — a 404 or 401 won't be retried because no amount of waiting will fix a missing endpoint or bad credentials. A 429 (rate limited) and 408 (request timeout) are retried because they're transient by nature. The context parameter lets the caller cancel retries if the overall operation is no longer relevant.

And the equivalent in Node.js, for those not working in Go:

async function deliverWithRetry(url, payload, opts = {}) {
  const maxRetries = opts.maxRetries ?? 5;
  const baseDelay = opts.baseDelay ?? 30_000;
  const maxDelay = opts.maxDelay ?? 3_600_000;
  const jitterRatio = opts.jitterRatio ?? 0.25;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    if (attempt > 0) {
      const raw = Math.min(baseDelay * 2 ** (attempt - 1), maxDelay);
      const jitter = raw * jitterRatio;
      const delay = raw - jitter + Math.random() * 2 * jitter;
      await new Promise((r) => setTimeout(r, delay));
    }

    const resp = await fetch(url, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify(payload),
      signal: AbortSignal.timeout(30_000),
    });

    if (resp.ok) return;

    if (resp.status >= 400 && resp.status < 500
        && resp.status !== 429 && resp.status !== 408) {
      throw new Error(`Permanent failure: HTTP ${resp.status}`);
    }
  }

  throw new Error(`Exhausted ${maxRetries} retries`);
}

The Standard Webhooks Retry Recommendation

The Standard Webhooks specification — backed by Svix, and adopted by companies like Clerk and Liveblocks — recommends a specific retry schedule:

  • 5 retry attempts after the initial failure
  • Exponential backoff with jitter
  • Intervals roughly at: 30s, 5m, 30m, 2h, 8h

That gives the consumer's endpoint up to ~10 hours to recover before the event is considered permanently failed. The progressively longer delays mean short outages resolve fast, while longer outages don't generate excessive load.

This is a sensible default and the one we follow at ThunderHooks for webhook delivery through the relay feature.

Circuit Breakers

Retries handle individual delivery failures. Circuit breakers handle systemic failure — when an endpoint is consistently down and retrying is just wasting bandwidth.

The concept comes from electrical engineering. When too much current flows through a circuit, the breaker trips and cuts the connection. Same idea here: when an endpoint fails too many times in a row, stop sending to it entirely for a cooldown period.

A circuit breaker has three states:

  • Closed — normal operation, requests flow through
  • Open — endpoint is considered down, all requests are short-circuited (queued or dropped)
  • Half-Open — after a cooldown, allow one request through to test if the endpoint recovered
type CircuitBreaker struct {
	mu              sync.Mutex
	failureCount    int
	failureThreshold int
	state           string // "closed", "open", "half-open"
	lastFailure     time.Time
	cooldown        time.Duration
}

func (cb *CircuitBreaker) Allow() bool {
	cb.mu.Lock()
	defer cb.mu.Unlock()

	switch cb.state {
	case "closed":
		return true
	case "open":
		if time.Since(cb.lastFailure) > cb.cooldown {
			cb.state = "half-open"
			return true
		}
		return false
	case "half-open":
		return false // only one probe at a time
	}
	return false
}

func (cb *CircuitBreaker) RecordSuccess() {
	cb.mu.Lock()
	defer cb.mu.Unlock()
	cb.failureCount = 0
	cb.state = "closed"
}

func (cb *CircuitBreaker) RecordFailure() {
	cb.mu.Lock()
	defer cb.mu.Unlock()
	cb.failureCount++
	cb.lastFailure = time.Now()
	if cb.failureCount >= cb.failureThreshold {
		cb.state = "open"
	}
}

Without a circuit breaker, a dead endpoint generates retry traffic indefinitely. With one, you detect the pattern, back off entirely, and periodically probe to see if things recovered. In ThunderHooks, the circuit breaker trips after 5 consecutive failures and enters a cooldown before probing again. Events that arrive while the circuit is open are queued, not dropped — they'll be delivered once the endpoint recovers.

Dead Letter Queues

Eventually, retries are exhausted. The circuit breaker has been open for days. The event still hasn't been delivered. Now what?

A dead letter queue (DLQ) is where events go when they've permanently failed delivery. Instead of silently dropping the event, you move it to a separate queue where it can be:

  • Inspected by a human
  • Replayed manually once the endpoint is fixed
  • Analyzed for patterns (is the same endpoint always failing?)
  • Alerted on (PagerDuty, Slack, email)

The DLQ is your safety net. Production incidents happen — a deploy breaks the webhook handler, someone accidentally rotates the signing secret, a certificate expires on a Friday night. The DLQ holds the events until someone fixes the root cause and replays them.

If your webhook provider doesn't have a DLQ or at least an event log with replay capability, you're flying without a net. When things go wrong (and they will), you'll lose events with no way to recover them.

Idempotency: The Other Half of Reliability

Retries introduce a subtle problem. If the webhook was delivered but the response was lost (network timeout after the server processed it), your retry logic thinks it failed and sends it again. Now the consumer processes the same event twice.

This matters a lot for some event types. Processing a payment.completed event twice could credit a user's account twice. Processing an order.created event twice could trigger duplicate fulfillment.

The solution is idempotency keys. Every webhook event gets a unique ID, sent in a header:

X-Webhook-ID: evt_2xK9mPq7vR3nT8wL

The consumer stores processed event IDs and checks before processing:

func handleWebhook(w http.ResponseWriter, r *http.Request) {
    eventID := r.Header.Get("X-Webhook-ID")

    // Check if already processed (atomic upsert)
    _, err := db.Exec(
        `INSERT INTO processed_events (event_id, processed_at)
         VALUES (?, ?) ON CONFLICT (event_id) DO NOTHING`,
        eventID, time.Now(),
    )
    if err != nil {
        http.Error(w, "internal error", 500)
        return
    }

    // If the row already existed, this was a duplicate
    result, _ := db.Exec(
        `UPDATE processed_events SET handled = TRUE
         WHERE event_id = ? AND handled = FALSE`, eventID)
    rows, _ := result.RowsAffected()
    if rows == 0 {
        w.WriteHeader(200) // Acknowledge but don't reprocess
        return
    }

    // Process the event...
}

The Standard Webhooks spec includes webhook-id as a required header for exactly this reason. Any system with retries needs idempotency support — they're two sides of the same coin.

Quick Reference: Retry Parameters

Parameter Recommended Value Why
Max retries 5 Balances persistence with resource usage
Base delay 30 seconds Fast enough for transient failures
Max delay 8 hours Long enough for extended outages
Backoff multiplier 2x (exponential) Standard doubling curve
Jitter ±25% Prevents thundering herd
Request timeout 30 seconds Don't hang on slow endpoints
Circuit breaker threshold 5 consecutive failures Detects persistent outages
Circuit breaker cooldown 5 minutes Probes recovery without flooding
DLQ retention 30 days Enough time to notice and replay
Retryable status codes 408, 429, 5xx Client errors are permanent
Non-retryable status codes 400, 401, 403, 404, 410 Don't retry what can't succeed

What ThunderHooks Does Automatically

If you're sending webhooks through ThunderHooks relay, all of this is handled for you:

  • 5 retries with exponential backoff and jitter — following the Standard Webhooks schedule
  • Circuit breaker — trips after 5 consecutive failures, enters cooldown, then probes with a single request to check recovery
  • Full event log — every delivery attempt is recorded with status code, latency, and response body
  • One-click replay — failed deliveries can be replayed from the dashboard once the root cause is fixed
  • Idempotency headers — every event includes a unique X-Webhook-ID so consumers can deduplicate safely

You focus on your business logic. We handle the plumbing.

Building vs. Buying Retry Logic

If you're building your own webhook sending infrastructure, implementing retries properly is non-trivial. You need:

  1. A durable queue (not an in-memory channel that vanishes on restart)
  2. Scheduled retry execution with backoff calculation
  3. Per-endpoint circuit breaker state
  4. Dead letter storage and replay tooling
  5. Monitoring and alerting on delivery rates
  6. Idempotency key generation and header injection

That's a meaningful amount of infrastructure. For some teams it makes sense to own it — if webhooks are core to your product and you need fine-grained control. For most teams, the engineering time is better spent elsewhere.

Whatever you choose, don't ship webhooks without retry logic. A webhook system without retries is a notification system that randomly drops messages. Your consumers deserve better than that.

Ready to simplify webhook testing?

Try ThunderHooks free. No credit card required.

Get Started Free