How to Monitor Webhook Endpoint Uptime
How to Monitor Webhook Endpoint Uptime
Your webhook endpoint goes down at 2am on Saturday. Stripe keeps sending invoice.paid events. They bounce. Stripe retries a few times over the next hours, then gives up. By Monday morning you have a backlog of unprocessed payments and confused customers wondering why their subscriptions weren't activated.
This happens more often than anyone wants to admit.
Why Webhook Endpoints Need Monitoring
Regular web pages fail visibly. Users see an error, they complain, someone fixes it. Webhook endpoints fail silently. No user is sitting there watching your /webhooks/stripe path. The webhook provider retries for a while, maybe sends an email notification to whatever address was configured six months ago, and eventually stops trying.
The failure modes are sneaky:
- Your app is up but the webhook route crashes. The server returns 200 on
/healthbut 500 on/webhooks/stripebecause of a bad deploy. Traditional uptime monitoring won't catch this. - TLS certificate expires. Webhook providers require HTTPS. An expired cert means instant rejection, no retries.
- Rate limiting kicks in. Your CDN or load balancer starts returning 429s to the webhook provider during traffic spikes.
- DNS changes propagate slowly. You migrated hosting, DNS updated everywhere except the region your webhook provider sends from.
You need monitoring that checks the actual webhook endpoint, not just "is the server running."
DIY: Cron + curl
The simplest approach. A cron job that hits your webhook endpoint and checks the response.
#!/bin/bash
# check_webhook.sh
ENDPOINT="https://api.yourapp.com/webhooks/stripe"
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" -X POST \
-H "Content-Type: application/json" \
-d '{"type":"health_check"}' \
"$ENDPOINT" \
--max-time 10)
if [ "$RESPONSE" -ne 200 ]; then
curl -X POST "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK" \
-H "Content-Type: application/json" \
-d "{\"text\":\"Webhook endpoint returned $RESPONSE\"}"
fi
Run it every five minutes with cron:
*/5 * * * * /home/deploy/check_webhook.sh
This works. It's also fragile. The cron server itself can go down. There's no retry logic. No history of past checks. One failed check fires an alert even if it was a momentary blip.
And there's a problem: you're sending a fake POST to your actual webhook handler. If your handler processes that health check as a real webhook, you might create phantom records. You either need to add health-check detection to your handler or use a separate health endpoint.
DIY: Dedicated Health Endpoint
Better approach — add a health check endpoint next to your webhook handler that verifies the same dependencies without processing fake data.
// Checks that the webhook handler's dependencies are healthy
func webhookHealthCheck(w http.ResponseWriter, r *http.Request) {
// Can we reach the database?
if err := db.Ping(); err != nil {
http.Error(w, "database unreachable", http.StatusServiceUnavailable)
return
}
// Can we reach the message queue?
if err := queue.Ping(); err != nil {
http.Error(w, "queue unreachable", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
w.Write([]byte("ok"))
}
Mount it at something like /webhooks/health and point your monitoring at that. This avoids the fake-webhook problem while still checking the infrastructure your webhook handler depends on.
Using an External Monitoring Service
Running monitoring on the same server as your app defeats the purpose. If the server goes down, your monitor goes with it.
External monitoring services run checks from outside your infrastructure. A few options:
UptimeRobot (free tier available) — Checks HTTP endpoints every 5 minutes on free, every 60 seconds on paid. Alerts via email, Slack, webhook. Simple setup but checks are basic — it just looks at the HTTP status code.
Better Stack (formerly Better Uptime) — More features. Can check response body content, set up status pages, manage on-call rotations. Starts at $24/month.
Pingdom — Been around forever. Checks from multiple geographic locations. $15/month and up.
All of these work. The tricky part is that they're general-purpose HTTP monitors. They don't understand webhooks specifically — they won't tell you that Stripe has been getting 503s for the last hour, or that your relay destinations are failing.
Webhook-Specific Monitoring
General uptime monitors answer "is this URL responding?" Webhook-specific monitoring answers "is my webhook pipeline healthy?"
That means checking:
- Is the endpoint reachable? (HTTP health check)
- Is it responding within an acceptable time? (latency threshold)
- Are webhook deliveries actually succeeding? (provider-side status)
- Are downstream relays working? (if you forward webhooks)
Checking From the Provider Side
Stripe, GitHub, and most providers show delivery status in their dashboards. Stripe's webhook dashboard shows success rates, recent failures, and lets you retry individual deliveries.
But checking dashboards manually is not monitoring. Some providers offer programmatic access:
# List recent webhook delivery attempts via Stripe API
curl https://api.stripe.com/v1/webhook_endpoints/we_123/events \
-u sk_live_your_key:
You could build a script that polls this and alerts on failure spikes. It's a lot of custom work per provider.
ThunderHooks Monitors
ThunderHooks has built-in monitoring that checks your webhook endpoints on a schedule. You configure:
- URL to check — your webhook endpoint or health check path
- Check interval — how often to run the check (30 seconds to 5 minutes depending on your plan)
- Alert webhook — a URL that receives a notification when checks fail
When a check fails multiple consecutive times, ThunderHooks sends a POST request to your alert webhook with details about the failure — status code, response time, error message. You can point that at Slack, PagerDuty, or whatever your team uses for alerting.
Each check costs 1 credit. A monitor running every 5 minutes uses about 8,640 credits per month. Every 1 minute is 43,200. Plan accordingly.
What Should Your Health Check Actually Check?
A health endpoint that just returns 200 isn't very useful. It tells you the process is running, not that it can handle webhooks.
Good health checks verify:
Database connectivity. Your webhook handler probably writes to a database. If the DB is down, webhooks will fail even though the HTTP endpoint is "up."
Queue connectivity. If you use a message queue for async processing, check that you can publish to it.
Disk space. If you log webhook payloads to disk, a full disk means silent data loss.
Memory pressure. A Go process that's approaching its memory limit will start GC thrashing and respond slowly. By the time it OOMs, webhooks have been timing out for minutes.
Don't check things that your webhook handler doesn't depend on. If your health check calls the Stripe API and Stripe is slow, your monitor will fire false alerts.
@app.route('/webhooks/health')
def webhook_health():
checks = {}
# Database
try:
db.session.execute(text('SELECT 1'))
checks['database'] = 'ok'
except Exception as e:
checks['database'] = str(e)
return jsonify(checks), 503
# Redis (if used for rate limiting or caching)
try:
redis_client.ping()
checks['redis'] = 'ok'
except Exception as e:
checks['redis'] = str(e)
return jsonify(checks), 503
return jsonify(checks), 200
Alert Fatigue Is Real
A monitor that fires on every single failed check will burn out your team in a week. One transient timeout shouldn't wake anyone up.
Good alerting practices:
- Require consecutive failures before alerting. Two or three failed checks in a row is a real problem. One failed check is probably nothing.
- Differentiate severity. A 500 error is urgent. A slow response (>5s but still 200) is a warning. A certificate expiring in 14 days is informational.
- Include context in alerts. "Webhook endpoint down" is useless at 3am. "POST /webhooks/stripe returned 503, last successful check 10 minutes ago, response body: 'database connection refused'" — that you can act on.
- Set up recovery notifications. Knowing when the problem resolved is as important as knowing it started.
Monitoring Checklist
Setting up webhook monitoring from scratch:
- Health check endpoint deployed alongside webhook handler
- Health check verifies actual dependencies (DB, queue, etc.)
- External monitor configured (not on the same server)
- Check interval matches your SLA needs
- Alerts go to the right channel (Slack, PagerDuty, email)
- Consecutive failure threshold set (not alerting on single failures)
- Recovery notifications enabled
- TLS certificate expiry monitoring in place
- Alert includes actionable context (status code, response body)