Latency Budgets for LLM Products

March 4, 2026 · Performance

Users do not experience "model quality" first. They experience waiting.

If your system responds in 7 seconds, nobody cares that your benchmark score improved by 2 points. Fast products get used. Slow products get abandoned.

Define the End-to-End Budget

Start from UX. For an interactive assistant, we target:

Time to first token: <= 900ms
Time to useful answer: <= 2400ms
P95 end-to-end: <= 3500ms

Then distribute that budget across pipeline stages.

Allocate by Stage

Input validation      50ms
Retrieval            300ms
Re-ranking           120ms
Model start-up       180ms
Generation          1600ms
Post-processing      150ms
Safety checks        100ms
Buffer               200ms

Every stage has an owner. If one stage exceeds its budget, the owner fixes it or negotiates a tradeoff.

Use Budget-Aware Fallbacks

Fallback logic should depend on remaining time, not static rules.

if remaining_ms < 700:
  skip_reranking()
  reduce_context_tokens()
  force_compact_response_mode()

This keeps responses timely even during spikes.

Stream Early, Stream Meaningfully

Token streaming helps only if early tokens carry meaning. Avoid filler openings. We train prompts to emit structure first: summary sentence, then details.

When users see immediate relevance, they tolerate longer total completion times.

Track P95 by Stage

A single p95 for end-to-end latency is not enough. You need per-stage percentiles and regression alerts:

- p95_retrieval_ms
- p95_generation_first_token_ms
- p95_generation_complete_ms
- p95_safety_ms

This tells you where the latency debt actually lives.

Budgets Force Real Product Decisions

You cannot optimize everything at once. Budgets make tradeoffs explicit: smaller context windows, lighter safety models on low-risk paths, or compact output formats for interactive modes.

That constraint is healthy. It aligns engineering, product, and design on one concrete goal: useful answers fast enough that people keep using the product.

← Back to Home