How to Monitor Cron Jobs with Heartbeat Pings (Dead Man's Switch)

By ThunderHooks Team · · 7 min read
How to Monitor Cron Jobs with Heartbeat Pings (Dead Man's Switch)

How to Monitor Cron Jobs with Heartbeat Pings (Dead Man's Switch)

Your nightly database backup has been silently failing for 23 days.

You don't know this yet. The cron job runs at 3am. It was dumping your Postgres database to S3. Then someone rotated the AWS credentials and forgot to update the backup server. The pg_dump command started exiting with code 1, the aws s3 cp never ran, and nothing told anyone.

You find out because a developer drops a production table and asks you to restore from backup. The most recent good backup is over three weeks old.

This is not a hypothetical. It happens all the time.

What Is a Dead Man's Switch?

The concept comes from trains. The engineer holds down a pedal. If they let go (because they're incapacitated), the train brakes automatically. No signal means something is wrong.

In software, a dead man's switch (also called a heartbeat monitor) works the same way. Your cron job sends a "ping" when it finishes successfully. If the ping doesn't arrive within the expected window, an alert fires.

It's backwards from normal monitoring. You're not watching for an error. You're watching for the absence of a signal.

This matters because cron jobs fail in ways that produce no signal at all:

  • The cron daemon itself crashes (happened to anyone who's upgraded a server without checking crontab)
  • The script errors early and exits before reaching any logging or notification code
  • The server runs out of disk and the job can't write output
  • Someone comments out the crontab entry during debugging and forgets to uncomment it
  • PATH isn't set in the cron environment and the script can't find pg_dump

Traditional monitoring can't catch these. There's nothing to catch. The job just... doesn't run.

The DIY Approaches (and Their Problems)

Cron MAILTO

Cron has a built-in notification mechanism. Set MAILTO=ops@yourcompany.com at the top of your crontab and cron will email stdout/stderr of every job.

MAILTO=ops@yourcompany.com

0 3 * * * /opt/scripts/backup.sh

Two problems. First: cron only emails when there's output. If the script produces no output (or if the failure happens before any output), no email. Second: the emails go to a mailbox that nobody checks. After a week of getting "backup completed" emails every night, the ops team creates a filter rule and never looks at them again.

Also, sending email from a server requires a working MTA. If you're on a minimal cloud VM, sendmail probably isn't configured.

Checking Exit Codes with a Wrapper

Better. Wrap your cron job in a script that checks the exit code:

#!/bin/bash
# wrapper.sh

/opt/scripts/backup.sh
EXIT_CODE=$?

if [ $EXIT_CODE -ne 0 ]; then
    curl -X POST "https://hooks.slack.com/services/T00/B00/xxx" \
        -H "Content-Type: application/json" \
        -d "{\"text\":\"Backup script failed with exit code $EXIT_CODE\"}"
fi

This catches failures that produce a non-zero exit code. But it still doesn't catch the case where the job never runs. If cron itself is dead, or the crontab entry is missing, the wrapper never executes. No execution means no alert. You're back to the same blind spot.

Logging to a File

0 3 * * * /opt/scripts/backup.sh >> /var/log/backup.log 2>&1

Now you have a log file. Great. Who's reading it? You could write another cron job that checks whether the log file was modified recently. But then who monitors that cron job?

You can see the problem. Every DIY approach has the same gap: if the job doesn't run, nothing happens.

Heartbeat Monitoring Fixes This

The fix is to flip the model. Instead of watching for failures, you expect a success signal and alert when it doesn't arrive.

Here's how it works:

  1. You create a heartbeat monitor with an expected schedule (e.g., "every 24 hours") and a grace period (e.g., "15 minutes late is okay")
  2. You get a unique ping URL
  3. Your cron job hits that URL when it finishes successfully
  4. If the ping doesn't arrive within the expected window + grace period, you get an alert

The monitoring service runs outside your infrastructure. It doesn't care if your server crashed, if cron died, if your disk is full. It only knows one thing: did the ping arrive on time or not?

Setting Up Heartbeat Monitoring in ThunderHooks

ThunderHooks has built-in heartbeat monitoring. Here's the quick version.

Create a heartbeat monitor in your dashboard. You'll configure:

  • Name — something human-readable like "Production DB Backup"
  • Expected period — how often the job should run (e.g., 86400 seconds for daily)
  • Grace period — how many seconds late is acceptable before alerting (default: 60 seconds, but for a daily backup, you'd want something like 300-900 seconds)
  • Alert destination — email address, Slack webhook, or any URL that accepts a POST

You'll get a ping URL with a unique slug:

https://thunderhooks.com/ping/a1b2c3d4e5f6

Now add a curl to the end of your backup script:

#!/bin/bash
# backup.sh

set -e  # Exit on any error

echo "Starting backup at $(date)"

pg_dump -h localhost -U app_user production_db | gzip > /tmp/backup_$(date +%Y%m%d).sql.gz

aws s3 cp /tmp/backup_$(date +%Y%m%d).sql.gz s3://my-backups/postgres/

rm /tmp/backup_$(date +%Y%m%d).sql.gz

echo "Backup completed at $(date)"

# Signal success — only reached if everything above succeeded
curl -fsS --retry 3 --max-time 10 https://thunderhooks.com/ping/a1b2c3d4e5f6

The -fsS flags tell curl to fail silently on server errors but still show errors on the client side. --retry 3 handles transient network blips. Because of set -e, the curl only runs if pg_dump and aws s3 cp both succeeded.

If the script fails at any point, the ping never fires. ThunderHooks waits for the expected period plus the grace period, and then sends an alert.

Reporting Failures Explicitly

You can also report failures explicitly with the /fail endpoint:

#!/bin/bash
# backup.sh

PING_URL="https://thunderhooks.com/ping/a1b2c3d4e5f6"

pg_dump -h localhost -U app_user production_db | gzip > /tmp/backup.sql.gz
if [ $? -ne 0 ]; then
    curl -fsS --retry 3 "$PING_URL/fail"
    exit 1
fi

aws s3 cp /tmp/backup.sql.gz s3://my-backups/postgres/
if [ $? -ne 0 ]; then
    curl -fsS --retry 3 "$PING_URL/fail"
    exit 1
fi

curl -fsS --retry 3 "$PING_URL"

This gives you faster alerts for known failure modes. You don't have to wait for the grace period to expire — the failure ping triggers an alert right away.

How ThunderHooks Compares to Alternatives

A few services do heartbeat monitoring. Here's how they stack up.

Healthchecks.io

Healthchecks.io is open source and solid. The hosted version starts at $20/month for 100 checks. The free tier gives you 20 checks and stores 100 pings per check. It's been around since 2015 and has a good reputation.

If you want to self-host it, you can — it's a Django app. But then you're maintaining another service. And if the server hosting your monitoring goes down at the same time as the server running your cron jobs (maybe they're the same server?), you're back to square one.

Cronitor

Cronitor is more feature-rich. It can monitor cron jobs, heartbeats, and even full web pages. Pricing starts at $49/month for the team plan. The developer plan at $12.50/month gives you 15 monitors. Good product, though the price adds up if you have lots of jobs.

ThunderHooks

ThunderHooks heartbeat monitoring is part of the broader webhook platform. The free tier includes 3 heartbeat monitors with a 5-minute minimum interval. Pings are free — they don't consume credits. Pro plans ($19/month) go up to 20 heartbeats with 1-minute intervals.

If you're already using ThunderHooks for webhook testing or relay, heartbeats are built in. No second tool, no second bill.

Feature Healthchecks.io (Free) Healthchecks.io ($20/mo) Cronitor ($12.50/mo) ThunderHooks (Free)
Heartbeats 20 100 15 3
Min interval 1 min 1 min 1 min 5 min
Ping cost Free Free Free Free
Alert channels Email, Slack, etc. All All Email, webhook
Self-host option Yes N/A No No
Webhook testing No No No Yes

The tradeoffs are pretty clear. Need lots of heartbeats and want to self-host? Healthchecks.io. Need full-featured cron monitoring with job telemetry? Cronitor. Already doing webhook development and want heartbeats alongside it? ThunderHooks.

Things That Trip People Up

Forgetting set -e in bash scripts. Without it, a failed command doesn't stop the script. Your pg_dump fails, the script continues, and the success ping fires anyway. Always use set -e, or check exit codes manually.

Pinging at the start instead of the end. If you put the curl at the top of your script, you're telling the monitor "I started" not "I finished." The backup could fail after the ping. Always ping last.

Grace periods that are too tight. A daily backup that sometimes takes 20 minutes shouldn't have a 30-second grace period. Give yourself room. If the job normally takes 10 minutes, a 15-minute grace period is fine. Don't alert on normal variance.

Not testing the alert path. Create a heartbeat, don't send a ping, and wait. Does the alert actually arrive? Check your Slack webhook URL. Check your email spam folder. Do this before you rely on it.

Cron environment vs. shell environment. Cron doesn't load your .bashrc. If curl is in /usr/local/bin and cron's PATH is just /usr/bin:/bin, the ping will fail. Use full paths: /usr/bin/curl instead of curl.

Setup Checklist

Getting heartbeat monitoring right from the start:

  • Create a heartbeat monitor for each critical cron job
  • Set expected period to match your cron schedule
  • Set grace period with room for normal runtime variance
  • Add curl ping to the end of each script (after all critical work)
  • Use set -e or explicit exit code checks so the ping only fires on success
  • Use --retry 3 and --max-time 10 on the curl command
  • Use full paths for commands in cron scripts (/usr/bin/curl, /usr/bin/pg_dump)
  • Test the alert by letting a heartbeat expire intentionally
  • Set up /fail pings for known failure modes that need immediate alerts
  • Document which team member gets the alert and what they should do

Resources

Ready to simplify webhook testing?

Try ThunderHooks free. No credit card required.

Get Started Free