Shipping tokens.css without breaking production

There is a fantasy version of token deployment: designer changes a color in Figma, the plugin syncs to the repo, CI runs, tokens.css ships, everything is fine. The reality is that between "designer changes a color" and "everything is fine" there are about fifteen ways to corrupt a production stylesheet, three of which we discovered by breaking production.

The naive approach

The first version of our export pipeline was a script: read the token JSON, write the CSS custom properties, commit. It ran in CI on every push to main. It worked for about six weeks, until a designer accidentally set color.surface.paper to transparent while testing a dark mode variant. The commit went through. The homepage turned invisible.

The fix for that specific bug is easy: validate that surface colors are not transparent. But that approach doesn't scale. There are thousands of ways a valid-looking token value can be semantically wrong, and writing a validator for each one is not engineering — it's whack-a-mole.

A token that passes JSON schema validation can still make your product invisible.— Post-mortem, October 2025

What can go wrong

We catalogued every token-related incident over twelve months. The failure modes cluster into four categories:

Value drift: A token changes to a value that is valid but fails accessibility contrast against its typical background.
Missing tokens: A component references a token that was renamed or deleted upstream.
Circular references: A semantic token resolves to itself through an alias chain.
Gamut violations: A color is specified in P3 space but consumed by a rendering environment that expects sRGB, producing clipped values.

None of these are caught by a JSON schema validator. All of them require semantic understanding of what a token means in context — which means writing checks that understand design intent, not just data structure.

Fig. 01—Incident timeline — token-related production failures by category, 2025.

The pipeline

The current pipeline has five stages, each of which can fail and block the merge. The first two stages run in under two seconds; the last three can take up to thirty, which is fast enough that designers don't wait and slow enough that we haven't cut corners.

Parse: Read the token JSON, resolve all aliases, build the dependency graph. Fail on circular references or unresolved aliases.
Schema: Validate every value against its declared type. Hex colors must be valid hex. Numbers must be numbers. Fail on type mismatches.
Contrast: For every color token with a declared role (surface, text, border), run WCAG 2.1 contrast against the expected background. Fail below AA.
Regression: Diff the new token set against the last known-good snapshot. Flag any deletions or value changes above a configurable threshold.
Visual: Render a set of reference components with the new tokens. Compare screenshots pixel-by-pixel against the baseline. Fail on diff above 0.5%.

Regression checks

The regression stage deserves more detail because it is the most nuanced. Not every token change is a regression: intentional brand updates should pass. The check is not "did this change?" but "did this change in a way that was expected?"

We solve this with a change manifest. When a designer initiates a brand update through UISqueezy Studio, they fill in which tokens are intentionally changing and why. The CI pipeline reads this manifest and skips regression warnings for declared changes. Undeclared changes still fail.

Every change should be intentional. The pipeline's job is to make unintentional changes impossible to miss.

The merge gate

Nothing ships to main without passing all five stages. This sounds obvious, but enforcing it required political capital the first time we blocked a VP's last-minute brand tweak at 4pm on a Friday. The pipeline was right. The tweak had dropped body text contrast below 3:1 on the landing page. We fixed it in twenty minutes with the right token value and shipped at 4:30.

The lesson isn't that CI is always right. The lesson is that CI gives you a reason to slow down that isn't personal. Nobody is saying the VP's color taste is wrong. The algorithm is saying the color doesn't pass contrast. Those are very different conversations.

Fig. 02—Pipeline stage timing — median duration per stage across 200 runs.

Marco Klein

Staff engineer at UISqueezy. Previously broke production twice with a bad tokens.css. Has opinions.

The naive approach

What can go wrong

The pipeline

Regression checks

The merge gate

Building a 10-stop primary scale that actually behaves

Type as a token: making typography editable

Two-way sync: how Studio and the Figma plugin stay honest