Latency Budgets for LLM Products
Users do not experience "model quality" first. They experience waiting.
If your system responds in 7 seconds, nobody cares that your benchmark score improved by 2 points. Fast products get used. Slow products get abandoned.
Define the End-to-End Budget
Start from UX. For an interactive assistant, we target:
Time to first token: <= 900ms
Time to useful answer: <= 2400ms
P95 end-to-end: <= 3500ms
Then distribute that budget across pipeline stages.
Allocate by Stage
Input validation 50ms
Retrieval 300ms
Re-ranking 120ms
Model start-up 180ms
Generation 1600ms
Post-processing 150ms
Safety checks 100ms
Buffer 200ms
Every stage has an owner. If one stage exceeds its budget, the owner fixes it or negotiates a tradeoff.
Use Budget-Aware Fallbacks
Fallback logic should depend on remaining time, not static rules.
if remaining_ms < 700:
skip_reranking()
reduce_context_tokens()
force_compact_response_mode()
This keeps responses timely even during spikes.
Stream Early, Stream Meaningfully
Token streaming helps only if early tokens carry meaning. Avoid filler openings. We train prompts to emit structure first: summary sentence, then details.
When users see immediate relevance, they tolerate longer total completion times.
Track P95 by Stage
A single p95 for end-to-end latency is not enough. You need per-stage percentiles and regression alerts:
- p95_retrieval_ms
- p95_generation_first_token_ms
- p95_generation_complete_ms
- p95_safety_ms
This tells you where the latency debt actually lives.
Budgets Force Real Product Decisions
You cannot optimize everything at once. Budgets make tradeoffs explicit: smaller context windows, lighter safety models on low-risk paths, or compact output formats for interactive modes.
That constraint is healthy. It aligns engineering, product, and design on one concrete goal: useful answers fast enough that people keep using the product.
← Back to Home