March 12, 2026 Data

Schemas Before Pipelines

The fastest way to create a fragile data platform is to invest in movement before you invest in meaning.

Teams usually notice data quality problems only after they have built enough pipelines to make the mess expensive. At that point, every dashboard has a different definition, every event name has three spellings, and nobody knows which table is safe to use.

The root problem is rarely throughput. It is weak contracts.

Pipelines Cannot Rescue Undefined Events

If an upstream producer emits inconsistent payloads, downstream tooling only spreads the damage faster.

{
  "event": "checkout_completed",
  "user_id": "4821",
  "amount": "19.99",
  "currency": "usd"
}

Looks harmless. Then another service ships amount_cents, a third omits currency, and a fourth sends guest checkouts with no user identifier. Now analytics, fraud, billing, and ML each patch around the inconsistency in different ways.

Start with Ownership

Every schema needs a real owner. Not a team name in a wiki. A team with operational responsibility for changes, compatibility, and documentation.

If ownership is vague, compatibility will be vague too.

Make Compatibility Rules Boring and Strict

Most producers should be allowed to add optional fields, but not rename or silently change meaning. That rule alone prevents a large class of downstream breakage.

Allowed:
- add optional field
- widen enum with review

Not allowed:
- change units
- repurpose existing field
- delete field without migration window

Schema Reviews Are Product Reviews

Good schema design is not clerical work. It forces product clarity.

When someone cannot answer whether a value is pre-tax or post-tax, gross or net, local time or UTC, the product behavior is already ambiguous. The schema review just exposed it.

Warehouse Cleanups Are Too Late

Many organizations rely on dbt models or warehouse transforms to normalize upstream inconsistency. Those layers are useful, but they should not be the first line of defense.

Correcting bad semantics downstream is expensive because consumers already built expectations on the broken shape.

Meaning Scales Better Than Movement

A small platform with clear event contracts will outperform a sophisticated platform full of ambiguous payloads. Stable meaning makes every pipeline, dashboard, and model easier to build.

Build movement second. Build meaning first.

← Back to Home