State Machines for Agent Workflows

March 5, 2026 · Architecture

Most broken agents fail for the same reason: nobody can answer, "what state is this thing in right now?"

Prompts look linear. Production behavior is not. Retries, tool failures, stale context, and user interrupts create branching paths that are easy to miss.

Make State Explicit

We moved from implicit prompt flow to a finite-state model. Every run must be in one known state:

enum AgentState {
  Created,
  Planning,
  FetchingContext,
  CallingTool,
  WaitingForUser,
  Synthesizing,
  Completed,
  Failed
}

Each transition is logged with timestamp, reason, and correlation ID. That single decision removed most "it got weird" debugging sessions.

Guard Transitions, Not Just Outputs

Teams often validate only final answers. We now validate transitions with invariants:

rule no_direct_complete:
  disallow Planning -> Completed

rule tool_timeout_path:
  CallingTool(timeout) -> Failed or Planning

rule user_block:
  WaitingForUser requires pending_question_id

This catches impossible flows before users see them.

Retries Become Safer

Without explicit state, retries replay random chunks of work. With state machines, retries become deterministic: resume from the last successful transition and apply idempotency keys for side effects.

That matters for actions like ticket creation, provisioning, or payments, where duplicate execution is expensive.

Observability Gets Better Immediately

Dashboards become clear when each run has a known lifecycle. We use three core metrics:

- transition_count_per_run
- time_in_state_seconds
- failure_rate_by_transition

"Failure rate by transition" told us our biggest issue was not model quality. It was tool call timeout handling.

How to Start

You do not need a huge framework. Start with a lightweight table-driven transition map and enforce it at runtime.

allowed[Planning] = [FetchingContext, Failed]
allowed[FetchingContext] = [CallingTool, Synthesizing, Failed]
allowed[CallingTool] = [Synthesizing, Planning, Failed]

Once transitions are explicit, testing and incident response both improve. The agent becomes a system you can reason about, not a black box you hope behaves.

← Back to Home