Why didn't my automation platform alert me that a workflow stopped?

Because the built-in alert fires on a failed run, and a stopped automation produces no runs to fail. If a workflow is deactivated, its OAuth credential expires at the trigger, the schedule quietly stops firing, or the instance is down, there is no execution and therefore no error event. The platform stays silent because, from its point of view, nothing went wrong. Catching this needs a heartbeat that alerts on the absence of an expected run, not on the presence of an error.

What is a dead man's switch in automation monitoring?

It is a check that expects a signal at a known interval and alerts you when the signal does not arrive. Your workflow pings a monitoring service every time it finishes successfully. If the expected ping is late, the service notifies you. Tools like Healthchecks.io and Cronitor are built for this. It is the inverse of error alerting: it watches for silence instead of watching for failure.

Does Zapier turn off my Zap and tell me?

Zapier automatically turns off a Zap if 95 percent of its runs error in the last 7 days, and on Team and Enterprise plans it emails the account owner first with a grace period (24 hours on Team, 72 hours on Enterprise). But that only triggers when runs are happening and erroring. A Zap whose trigger stops returning new items, or one you turned off by hand, can sit dead without any error to cross that threshold.

Can I monitor automations without a third-party service?

Yes. Run a separate scheduled watchdog workflow that queries your own run log and alerts if there has been no successful run in the last N hours. It works, but it has the same weakness as any in-platform check: if the whole instance or account is down, the watchdog is down too. A heartbeat hosted outside your automation platform survives the platform itself failing, which is why we usually pair the two.

What should the alert threshold be for a heartbeat?

Set the expected period to how often the workflow should succeed, plus a grace window for normal variation. A flow that should finish every hour might use a 60-minute period and a 15-minute grace, so you are paged only after 75 minutes of silence. Too tight and you get false alarms on slow runs; too loose and a real outage goes unnoticed for hours. Tune it to how fast you actually need to know.

How to get alerted when an automation stops running

Your automation platform's built-in error alert has a blind spot, and it is the failure that costs the most. Error alerts fire when a workflow runs and a step fails. They cannot fire when the workflow never runs at all. A deactivated flow, an expired OAuth token at the trigger, a schedule that quietly stopped, the instance itself being down: none of these produce an error, because none of them produce a run. To catch the automation that silently stopped, you need the inverse of error alerting. You need a heartbeat that pages you when an expected run does not arrive.

This is the failure operators discover the slow way. The intake automation that has run flawlessly for four months stops one Tuesday, and nobody notices until Friday when a customer asks why they never heard back. The platform was silent the whole time, and it was right to be, by its own logic. Here is why error alerts miss this, and the three-layer setup that closes the gap.

Every error notification in n8n, Zapier, and Make is downstream of an execution. Something has to run before something can fail. n8n's error workflow, set with an Error Trigger node, fires when a workflow execution fails. Zapier surfaces failures in Zap History and will turn a Zap off if 95 percent of its runs error in the last 7 days. Make routes a failed run to Incomplete Executions when you set a Break directive. All three are real and worth turning on. All three share the same precondition: a run has to happen.

So picture the failures that produce no run. The workflow got deactivated, maybe by you during a fix, maybe by the platform after a string of errors, and never got turned back on. The trigger's OAuth credential expired, so the polling trigger cannot even start. The scheduled trigger stopped firing after a platform incident and did not resume. The self-hosted n8n container crashed and did not restart. In every one of these, the count of failed runs is zero, because the count of runs is zero. There is nothing for an error alert to react to. The automation is dead and the dashboard is green.

This is not a bug in those platforms. It is a category they do not cover. Error monitoring answers "did a run fail?" It does not answer "is this thing still alive?" Those are different questions, and the second one needs a different tool.

The two kinds of failure, and what catches each

Sort every automation failure into two buckets by one test: does it produce an error event?

Failure	Produces an error event?	What catches it
A step throws mid-run (bad data, API 500, timeout)	Yes	Built-in error alert or error workflow
A run partially completes then fails	Yes	Error workflow, Make Incomplete Executions
Workflow deactivated and not re-enabled	No	Heartbeat / dead man's switch
Trigger credential (OAuth) expired	Usually no	Heartbeat, plus credential-expiry reminders
Schedule silently stopped firing	No	Heartbeat
Self-hosted instance down	No	Heartbeat (hosted off the instance)
Runs fine but processes zero real items	No	Business-metric / volume check

The top two rows are the ones your platform already handles. The bottom five are the ones that take down a working automation for days, and not one of them shows up as an error. The last row is the sneakiest of all, and we will come back to it, because a flow that runs green while doing nothing useful is its own failure mode.

What "monitoring on absence" actually means

A heartbeat, also called a dead man's switch, flips the logic. Instead of waiting for something to go wrong, it waits for something to go right, and alerts you when that something is late. Your workflow sends a short HTTP ping to a monitoring service every time it finishes successfully. The service knows how often to expect that ping. As long as pings arrive on time, it stays quiet. The moment one is overdue past its grace window, it pages you.

Healthchecks.io describes exactly this model: it listens for pings from your jobs and stays silent while they arrive on schedule, then raises an alert as soon as one does not. The reason it works for the silent failures is that it does not depend on your automation running. It depends on your automation having run. If the flow is deactivated, the ping never comes, and the absence is the signal. The monitor lives outside your automation platform, so it survives the platform itself going down, which an in-platform check cannot.

You set two numbers per check: the expected period and a grace time. A flow that should finish hourly gets a 60-minute period and maybe a 15-minute grace, so you are alerted after 75 minutes of silence rather than on the first slightly slow run. That grace window is what keeps a heartbeat from crying wolf on normal variation.

What each platform gives you, and the gap it leaves

This is the part most "monitor your automations" advice skips. Each tool has real built-in handling, and each leaves the same hole.

Platform	Built-in failure handling	What it does not catch
n8n	Error Trigger workflow fires on a failed execution; node-level retries	An inactive workflow, a dead instance, or a schedule that never fired produces no failed execution to trigger on
Zapier	Zap History plus auto-off when 95 percent of runs error in 7 days, with an owner email and a 24h (Team) or 72h (Enterprise) grace period	A trigger that stops returning items, or a manually-paused Zap, never crosses the error threshold, so no email is sent
Make	Break directive sends failed runs to Incomplete Executions; scenario auto-deactivates after repeated errors with a notice	A scenario you forgot to re-enable, or one whose trigger went quiet, generates no error and no deactivation notice

Read down the right column. It is the same sentence three times: the safety net is woven from error events, and these failures throw none. That is not a knock on the tools. It is the precise reason a heartbeat is not optional for any automation you actually depend on.

A three-layer alerting setup you can build this week

You do not need an observability stack. You need three layers, each answering a different question, and you can stand all three up in an afternoon.

Layer 1, the heartbeat: did it run at all? Create one check per critical workflow in a heartbeat service. Add a final step to the workflow that pings the check's URL only on success. In n8n that is an HTTP Request node at the end of the happy path. In Zapier it is a Webhooks by Zapier POST as the last action. In Make it is an HTTP module after your last real step. Set the period and grace to match the schedule. This single layer catches all five silent failures from the table above, because every one of them stops the ping.

Layer 2, the error alert: did a run fail? Turn on what the platform already offers. Build the n8n Error Trigger workflow that writes failures to Slack or a table. Set Make modules to Break so failures land in Incomplete Executions. In Zapier, confirm error notifications go to an inbox a human reads, not a shared alias nobody checks. This layer catches the loud failures, the top two rows.

Layer 3, the business-metric check: did it do the right amount of work? This is the layer almost nobody builds, and it catches the failure that hides in plain sight. An intake flow that normally creates 15 to 30 CRM leads a day can run green while creating zero, because an upstream form provider changed its payload and now every record fails a filter silently. No error, healthy heartbeat, empty pipeline. The fix is a second scheduled workflow that counts the real output over a window and alerts if it falls outside a sane range. "Alert if today's processed-lead count is under five by 4 p.m." catches the quiet drift that the first two layers wave through. If you already keep a structured run log, this check reads straight off it. (For more on why working automations break without a sound, see our piece on why automations silently break.)

Three questions, three layers: is it alive, did a run fail, and is it producing the right volume. Most teams build the middle one, assume it covers them, and find out otherwise during an outage.

How to start

Pick your single highest-stakes automation, the one whose silent death would cost a customer or a sale, and give it Layer 1 today. Sign up for a heartbeat service, create one check, add the success ping as the last step, and set the period to the flow's real cadence. That one move converts your worst blind spot into a page within the hour. Add Layer 2 and Layer 3 to that same flow over the week, then repeat for the next automation down your list. You do not have to instrument everything. You have to instrument the ones you cannot afford to lose quietly.

This is the operational layer we build into every system we ship, alongside the run logging and dead-letter handling that make a failure visible instead of invisible. If you want monitoring designed in from the start, that is the core of our workflow automation systems and operational intelligence systems. Or bring us the automation that would hurt most if it died on a Friday, and we will help you wire the alert that wakes you before your customer does.

How to get alerted when an automation stops running

Why your error alerts have a blind spot

The two kinds of failure, and what catches each

What "monitoring on absence" actually means

What each platform gives you, and the gap it leaves

A three-layer alerting setup you can build this week

How to start

Frequently Asked Questions

SOURCES & CITATIONS

About Alexey Yushkin

Related reading

Rolling back a broken automation isn't recovery

Webhook or polling trigger: which should you use?

When should an automation retry a failed step?

Want this kind of system in your business?