How to test an AI automation before you trust it
To test an AI automation, pull 30 to 50 real past cases from your execution history, label the correct output by hand, run the AI step against them, and compare its error cost to the human or rule it replaces. Ship when the worst-case error is caught by a downstream guard and total error cost is at or below the baseline, not when accuracy hits some round number.
Test an AI automation step the same way you would decide whether to hand a task to a new hire: not against perfection, but against whoever or whatever is doing the job now. Pull 30 to 50 real cases from your execution history, label the correct output by hand, run the AI step against them, and look at where it is wrong and what each wrong answer costs. Then ship if two things are true: the AI makes fewer expensive mistakes than the human or rule it replaces, and its worst possible mistake is caught by a guard downstream. The number most people chase, "is it 95 percent accurate," is the wrong question, because a single accuracy figure hides the one error that actually sinks a case.
That is the whole method, and it fits in an afternoon and a spreadsheet. The rest of this is how to run it without fooling yourself, and why the popular advice (build a golden dataset, measure accuracy, iterate) stops one step short of the decision you actually have to make.
Grade against the baseline, not against 100 percent
Here is the trap. You build a test set, run your AI classifier, and it scores 94 percent. Is that good? You cannot answer that in a vacuum. Ninety-four percent is excellent if the step replaces a tired intake clerk who tags tickets correctly 86 percent of the time. It is a disaster if it replaces a deterministic rule that was right 100 percent of the time on the cases the rule covered. The accuracy number alone tells you nothing about whether to ship.
The right comparison is always: what does the AI step replace, and how often does that thing get it wrong today? If you are automating a task a person does, your baseline is that person's real error rate, which is almost never zero and is usually higher than people admit. If you are replacing a rule, your baseline is the rule's behavior on the inputs it was built for. You are not trying to beat perfection. You are trying to beat the status quo by enough to be worth the switch, while not introducing a new failure that the old way did not have.
This reframes the goal from "make the model better" to "make fewer costly mistakes than the current process." Those are different jobs. The first never ends. The second has a finish line you can see.
The six-step ship test you can run in a spreadsheet
This is the original element of this piece: a concrete procedure an operator can run without an eval framework, a data scientist, or a single line of test code. n8n shipped a native Evaluations feature in June 2025 (version 1.95.1 and up) that runs this exact loop inside the tool, pulling test cases and scoring metrics like correctness or whether the right tool was called. Use it if you are on n8n. But the method is tool-agnostic, and doing it once by hand teaches you what the tool is actually measuring.
- Pull 30 to 50 real cases from your execution history. Not cases you invent. Your platform's run history is a labeled-by-reality dataset, and real inputs carry the typos, the half-filled forms, and the weird phrasings you would never dream up at a desk. This is the same execution log that is also your largest store of customer PII, so pull and store the test set with that in mind.
- Freeze the set. Copy those cases into a sheet and never edit them again. The moment you start tweaking test cases to make the model look good, you have stopped testing and started grading your own homework. A frozen test set is the only thing that lets you compare two prompt versions honestly.
- Label the correct answer yourself, before you look at the AI output. Fill in a "what should this be" column by hand first. If you label after seeing the model's guess, you will unconsciously rubber-stamp its answer. This is the single most skipped step, and skipping it quietly inflates every score that follows.
- Run the AI step against all of them. Same prompt, same model, same settings you intend to ship. Record the model's output in a column next to your label.
- Mark each row right or wrong, and for every wrong row, note the direction and the cost. Not just "wrong." Wrong how, and what does that specific mistake cost when it reaches a real customer. This is the step the standard advice skips, and it is covered in full below.
- Compare total error cost to the baseline and apply the ship rule. Ship if the AI's costly-error rate is at or below the thing it replaces, and the worst error it can make is caught downstream. Otherwise, narrow the step or add a gate.
Do not report one accuracy number; build an error-cost table
A single accuracy percentage treats every mistake as equal. They are not. Google's own classification guidance makes the point with class imbalance: a model that always predicts "no" scores 99 percent accuracy on a problem where the answer is "yes" only 1 percent of the time, while being completely useless at the one thing you care about. The same logic applies to an automation, except the imbalance is not in the data, it is in the cost.
Take a support-ticket triage step that sorts incoming messages into urgent, billing, or routine. Suppose your frozen set of 50 cases comes back with three wrong. Accuracy: 94 percent. Now look at the three:
| Case | Correct label | AI said | Direction | What it costs |
|---|---|---|---|---|
| "your service has been down all morning" | urgent | routine | urgent to routine | A real outage sits unread for hours. Customer churns. |
| "can you resend my receipt" | billing | routine | billing to routine | Slightly slower reply. Customer waits an extra hour. |
| "thinking about upgrading" | routine | billing | routine to billing | A salesperson glances at it, shrugs, moves it back. |
Three errors, one accuracy number, three wildly different costs. The first one is the entire ballgame. An urgent-to-routine miss buries a churning customer; the other two cost minutes. If your test had returned 96 percent accuracy with two errors but one of them was that first row, it would be worse than 94 percent with the first row correct. The accuracy figure ranks them backwards.
So the output of a real test is not a percentage. It is a short list of the wrong cases sorted by cost, and a single decision per expensive category: is this error recoverable, and is it caught before it does damage. In the triage example, the fix is not a better model, it is a cheap deterministic guard: force any ticket containing "down," "cancel," or "refund" to urgent regardless of what the model says. Now the one expensive error direction cannot happen, and the model only has to be good at sorting the cases where being wrong is survivable. That is the same reasoning behind deciding when an automation needs a human in the loop: you protect the irreversible, expensive direction and let the model run free on the rest.
When you have to re-run the test
A test you run once is a photograph, not a smoke detector. The score you got is only true for the exact prompt, model, and input mix you tested. Re-run the frozen set whenever any of those three moves:
- Every prompt change. Adding one sentence to a system prompt can fix three cases and break five others you were not watching. You cannot tell which without re-running. This is the entire reason to keep the test set frozen: it is the only way to know whether an edit was a net gain.
- Every model version change. This is the one operators miss. AI vendors ship floating aliases that point at "the latest version," and they repoint that alias to a newer snapshot after release. OpenAI's model documentation is explicit that aliases move and that production apps should pin a dated snapshot. So a flow you have not touched in months can start producing different outputs because the model under the alias changed. Pin a dated snapshot in production, and re-run the test against any new snapshot before you adopt it.
- The first time a new shape of input shows up. A new lead source, a new form field, a different language in the inbox. Your frozen set was drawn from yesterday's traffic. When tomorrow's traffic looks different, add a fresh batch of those new cases and re-test, because the old score does not cover them.
Pin the model, freeze the test, re-run on change. That is the maintenance loop, and it is cheap once the sheet exists.
Why some AI steps cannot be tested this way
This method works cleanly when the AI step has a checkable answer: classify into a fixed set, extract named fields, return a yes or no. You can put the right answer in a column and compare. It does not work the same way on a step that generates open-ended text, like drafting a reply, because there is no single correct output to compare against. For those, you fall back to an LLM-as-judge metric (a second model scoring the first) or human spot-checks, both of which are slower and fuzzier.
That is a strong argument, made in full in when to use AI instead of a rule, for designing the AI step to be checkable in the first place. A step that classifies a message into one of four buckets can be tested against a frozen set in minutes. A step that writes a paragraph cannot. When you have the choice, push the model toward a narrow, checkable output and let rules and templates handle the rest, precisely because the narrow version is the one you can prove is working before you trust it.
What to do next
Open your most important AI automation and find the one step where a model decides something: a category, a routing choice, an extracted value. Export 30 to 50 real past runs of that step from your execution history. Drop them in a sheet, hand-label the correct answer in a fresh column before looking at what the model did, then run the step and mark the wrong rows by cost, not by count. You will usually find that the raw accuracy is fine and that one error direction is doing all the damage, which means the fix is a small guard, not a better prompt.
We build this test into every workflow automation system we ship, and we keep the frozen set so that every prompt edit and every model bump gets re-checked before it reaches a customer. If you have an AI step running in production that you have never actually measured against the thing it replaced, send us the flow and we will build you the test set and tell you, in costs rather than percentages, whether it is safe to trust.
Frequently Asked Questions
SOURCES & CITATIONS
- Introducing Evaluations for AI workflows — n8n Bloghttps://blog.n8n.io/introducing-evaluations-for-ai-workflows/
- Classification: Accuracy, recall, precision, and related metrics — Google for Developershttps://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall
- Models — OpenAIhttps://platform.openai.com/docs/models
About Alexey Yushkin
Alexey is the founder of GENERAL INFORMATICS LLC. He designs and ships AI and automation systems for businesses and operators across the US.
Related reading
Want this kind of system in your business?
We build practical AI and automation systems for operators. Send us your current workflow and we will show you what to automate first.
Request a Workflow Review