Observability for Systems That Can't Fail

March 13, 2026 · Reliability

At 2:47 AM last month, our payment processing pipeline stopped. No alerts fired. No dashboards turned red. Customers just couldn't complete purchases.

It took 23 minutes to find the problem. It took 8 minutes to fix. The 23 minutes cost us more than the outage itself.

We rebuilt our observability from first principles.

Monitoring Tells You When. Observability Tells You Why.

Traditional monitoring tracks known metrics. CPU usage. Error rates. Request latency. These tell you something is wrong, but not why.

Observability is about asking unknown questions. "Show me all requests from user 12345 that failed in the last hour." "Trace the path of order #987654 through the system." These aren't pre-defined metrics. They're ad-hoc investigations.

The three pillars: logs, metrics, and traces. Most teams have all three. Few use them together effectively.

Structured Logging Is Non-Negotiable

Unstructured logs: 2026-03-13 14:32:01 INFO User login successful

Structured logs:

{
  "timestamp": "2026-03-13T14:32:01.234Z",
  "level": "INFO",
  "message": "User login successful",
  "event": "user.login",
  "user_id": "user_12345",
  "ip_address": "203.0.113.42",
  "user_agent": "Mozilla/5.0...",
  "duration_ms": 245,
  "trace_id": "abc123def456"
}

The difference: machine readability. You can filter structured logs by any field. You can aggregate by user. You can correlate with traces. Unstructured logs require parsing that breaks when format changes.

We enforce structure with code generation. Every event has a schema. The logging library validates at compile time. Wrong field name? Build fails.

Distributed Tracing That Works

In a microservices architecture, a single user request touches dozens of services. When it's slow, which one is the problem?

We use OpenTelemetry for tracing. Every request gets a trace ID propagated across service boundaries. Each service creates spans:

Trace: abc123
├── API Gateway (45ms)
├── Auth Service (12ms)
├── User Service (67ms)
│   └── Database Query (45ms)
├── Payment Service (234ms)
│   ├── Validation (23ms)
│   ├── Fraud Check (145ms)
│   └── Processor Call (56ms)
└── Notification Service (34ms)

At a glance, we see the payment service is slow. Drill down: fraud check is the culprit. Root cause analysis in seconds, not hours.

Correlating Everything

Logs, metrics, and traces are separate systems in most organizations. We unified them with a single trace ID:

- Every log entry includes trace_id
- Every metric is tagged with trace_id (where applicable)
- Every trace links to related logs

Click a slow trace → See all logs from that request
Click an error log → See the trace that generated it

This correlation turns debugging from archaeology into navigation. You start with a symptom and follow the breadcrumbs.

Smart Alerting

Most alerting is broken. Alert on CPU > 80%? You'll wake someone up for a healthy spike. Alert on errors > 0? You'll miss gradual degradation.

We use SLO-based alerting:

Objective: 99.9% of requests succeed in < 200ms
Alert when: Error budget burns > 2% in 1 hour

This aligns alerts with user impact. A brief spike that doesn't affect the SLO? No alert. A gradual increase that will breach the SLO in 4 hours? Page someone before customers notice.

We also use anomaly detection. Not "is this metric high?" but "is this pattern unusual?" Our ML model learns normal behavior and alerts on deviations. It caught a memory leak that gradual alerts missed.

The Cost Problem

Observability data is expensive to store. We handle 50TB of logs daily. Traces add another 20TB. Storing everything forever isn't feasible.

Our retention strategy:

- Raw logs: 7 days
- Aggregated metrics: 1 year
- Sampled traces (1%): 30 days
- Error traces (100%): 90 days
- Custom retention for audit logs: 7 years

We sample happy paths aggressively. But we keep 100% of errors and slow requests. The data that matters for debugging is always available.

Developer Experience

Observability isn't just for SREs. Developers need it to understand their code in production.

We built a query language that feels like coding:

from logs
where service = "payment-api"
  and level = "ERROR"
  and timestamp > now() - 1h
group by error_type
order by count desc
limit 10

No clicking through UI. No learning a proprietary query language. Just SQL-like syntax that developers already know.

The Incident That Proved It

Last week, latency spiked on our search API. Old us would have checked dashboards, looked at recent deploys, guessed at causes.

New us: trace the slow requests, see they're all hitting a specific database shard, check logs for that shard, find a lock contention issue, identify the query causing it. Total time: 4 minutes.

The fix was a missing index. The value was knowing exactly where to look.

Building for Unknown Unknowns

You can't predict every failure mode. But you can build systems that help you understand any failure quickly.

Structured data. Correlated signals. Fast queries. These aren't nice-to-haves. They're the difference between minutes and hours of downtime.

We don't just monitor our systems. We make them observable. And that makes all the difference.

← Back to Home