State Machines for Agent Workflows
Most broken agents fail for the same reason: nobody can answer, "what state is this thing in right now?"
Prompts look linear. Production behavior is not. Retries, tool failures, stale context, and user interrupts create branching paths that are easy to miss.
Make State Explicit
We moved from implicit prompt flow to a finite-state model. Every run must be in one known state:
enum AgentState {
Created,
Planning,
FetchingContext,
CallingTool,
WaitingForUser,
Synthesizing,
Completed,
Failed
}
Each transition is logged with timestamp, reason, and correlation ID. That single decision removed most "it got weird" debugging sessions.
Guard Transitions, Not Just Outputs
Teams often validate only final answers. We now validate transitions with invariants:
rule no_direct_complete:
disallow Planning -> Completed
rule tool_timeout_path:
CallingTool(timeout) -> Failed or Planning
rule user_block:
WaitingForUser requires pending_question_id
This catches impossible flows before users see them.
Retries Become Safer
Without explicit state, retries replay random chunks of work. With state machines, retries become deterministic: resume from the last successful transition and apply idempotency keys for side effects.
That matters for actions like ticket creation, provisioning, or payments, where duplicate execution is expensive.
Observability Gets Better Immediately
Dashboards become clear when each run has a known lifecycle. We use three core metrics:
- transition_count_per_run
- time_in_state_seconds
- failure_rate_by_transition
"Failure rate by transition" told us our biggest issue was not model quality. It was tool call timeout handling.
How to Start
You do not need a huge framework. Start with a lightweight table-driven transition map and enforce it at runtime.
allowed[Planning] = [FetchingContext, Failed]
allowed[FetchingContext] = [CallingTool, Synthesizing, Failed]
allowed[CallingTool] = [Synthesizing, Planning, Failed]
Once transitions are explicit, testing and incident response both improve. The agent becomes a system you can reason about, not a black box you hope behaves.
← Back to Home