All insightsPractice

Your first agentic workflow should be boring

·6 min read

The pattern I see in teams shipping their first agentic workflow is consistent enough to be a rule. They pick the most exciting problem they can think of, a research assistant, a sales co-pilot, a thing that browses the web, and six weeks later they have a demo that impresses an executive once, never gets used in anger, and quietly disappears. The teams that succeed pick something boring on purpose.

Boring, in this context, has a specific meaning. It means the task is narrow, the output is structured, the success rate is measurable against a pre-existing baseline, and the cost of being wrong is bounded. "Categorise an incoming invoice and route it to the right approver queue" is boring. "Be a digital sales rep" is not. The first one ships. The second one demos.

A concrete boring example

Take invoice categorisation. An accounts payable team gets a few hundred invoices a day. Each one needs to be classified, into a GL code, a cost centre, a project, and routed to the right approver. Today this is done by a junior accountant, or by a brittle rules engine that someone wrote in 2019 and nobody wants to touch.

An LLM with structured output is well-suited to this. It reads the supplier name, the line items, the historical pattern for that supplier. It produces a structured classification with a confidence score. If the confidence is above a threshold, the invoice routes automatically. If it's below, it goes to a human review queue with the model's best guess pre-filled. The architecture is roughly:

ts
// Outline of the agent loop, deliberately small
type Classification = {
  glCode: string;
  costCentre: string;
  projectId: string | null;
  confidence: number;
  rationale: string;
};

async function classifyInvoice(invoice: Invoice): Promise<RouteDecision> {
  const history = await getSupplierHistory(invoice.supplierId);
  const result = await llm.structured<Classification>({
    schema: classificationSchema,
    system: CLASSIFY_SYSTEM_PROMPT,
    user: renderInvoiceContext(invoice, history),
  });

  await auditLog.write({ invoice, result });

  if (result.confidence >= AUTO_APPROVE_THRESHOLD) {
    return { route: "auto", target: queueFor(result), result };
  }
  return { route: "review", target: "human-review", result };
}

Notice what isn't there. There's no agent loop in the autonomous-tool-using sense. There's no browser. There's no chain of LLM calls that decide what to do next. The model gets one job, returns one structured object, and a deterministic piece of code makes the routing decision. This is the sweet spot for first projects, and most teams skip past it because it doesn't feel like "agents".

Why this works and the demos don't

Three reasons. First, the success criterion is unambiguous: the invoice was either routed correctly or it wasn't. You can build a golden set of a few hundred manually labelled invoices and grade every change against it. Second, you have a baseline: the rule-based system is doing this today, badly. You can A/B against it and prove that the LLM version makes fewer routing mistakes per hundred invoices. Third, the cost of being wrong is bounded: a misrouted invoice gets bounced back from the wrong approver, and the worst case is a one-day delay. Nobody gets hurt.

The exciting demos fail at all three. The research assistant produces output that's hard to grade objectively. There's no baseline because nobody had a research assistant before. And the cost of being wrong, a confident plausible-looking summary that's quietly wrong on a fact the user didn't know to check, is unbounded. Same architecture, completely different shipping outcome.

The eval discipline that makes it stick

The thing that turns a prototype into something operations will trust is eval discipline. Three layers, in increasing cost:

  1. A golden set of around 200 hand-labelled examples that the system is graded against on every prompt change. This catches regressions caused by you fiddling.
  2. A weekly A/B against the existing rule-based system on live traffic, with the routing decision logged for both. This catches drift, cases where the supplier mix changed and your prompt didn't.
  3. A monthly review where the AP team flags the worst auto-routing mistakes from the past four weeks, and those examples get added to the golden set. This is the slowest loop and the one that compounds.

Eval is the boring infrastructure that makes the boring agent stay good. Most teams underinvest here because building evals doesn't feel like building product. It is building product. The agent without evals is a demo with extra steps.

When to graduate to real agents

Once the boring workflow is in production, has been stable for a quarter, and the operations team trusts it, you've earned the right to do something more agentic. You'll know it because the next problem will look like "now that we classify invoices, can we also reach out to the supplier when a line item doesn't match a PO", a multi-step task that requires the model to make decisions about what to do next, and where the boring infrastructure you built (logging, evals, rollback, HITL queue) carries directly over.

The teams that build the multi-step thing first, without that scaffolding, are the ones whose pilot dies. Not because the model isn't capable, frontier models are plenty capable for these tasks in 2026, but because the operations side of the system, the part that catches and corrects mistakes, was never built.

Bottom line

Pick something narrow, structured, and measurable. Build the eval before you build the agent. Ship behind a confidence threshold with a human queue for the rest. Beat the existing baseline by a margin large enough that the operations team feels the difference. Then, and only then, expand scope. Where I'm uncertain: I suspect the gap between "boring agentic workflow" and "real agentic workflow" closes faster than I'm guessing as model planning capability improves. But the eval discipline doesn't get cheaper, and that's still the bottleneck.