A median is a story the fast requests tell about themselves. The slow ones are where the system is actually broken.

The composite hides where the system fails

Most dashboards publish p50 and p95 latency. Some publish p99. Almost none publish the raw series, and almost none publish the full distribution including the long tail. That is convenient for the vendor. It is also the single easiest place to bury a failure mode the buyer would have wanted to see. (Class C — observed across the integrations we audited in the last twelve months.)

Here is the mechanical reason. A p50 is the median: half the requests came back faster than that number, half slower. A p95 says only that one in twenty was worse than that number. Neither tells you anything about the shape of the slow tail — whether the slow requests were slow by a small margin or a large one, whether they clustered around a specific time of day, whether they all came from one upstream dependency, whether they all failed the same way before they came back. A composite metric is a sales artifact. A raw series is a diagnostic.

The standard reference on percentile-based monitoring lying-by-omission is old engineering folklore now (Class E — see Gil Tene's well-known "coordinated omission" talks from the early 2010s, which formalized why percentile reports from load generators systematically under-report tail latency when the system stalls). The point survives: if your monitoring only tells you about the requests that finished, it cannot tell you about the requests the system was too sick to attempt.

One live SWU example

Two weeks ago a model-routing layer in one of our pipelines started returning answers about three seconds slower than usual. The p50 moved by about four hundred milliseconds. The p95 moved by about eight hundred. By any normal dashboard standard, that is a yellow light, not a red one.

The raw latency series told a different story. The slowest one percent of requests had gone from a tail around six seconds to a tail around fifty-eight seconds — not a thicker tail, an entirely new mode, sitting out beyond the chart's old scale. When we looked at which requests were landing in the new mode, every single one was hitting a particular provider region that had degraded silently. The composite metrics did not move enough to alert. The raw series made the failure obvious within about three minutes of opening the page. (Class C — captured in the incident receipt; the override-rate series confirmed the bad mode was producing answers the team threw away.)

We rerouted around the bad region the same morning. We left the raw series visible on the page rather than smoothing it back into the percentile chart, because the next time the same shape appears we want the same three-minute diagnosis, not a four-hour investigation through averaged graphs.

Why we publish raw, not summarized

Three reasons. First, the slow tail is where the real failures live — timeouts, retries, dependency degradations, capacity ceilings — and the slow tail is exactly what a percentile summary throws away. Second, distributions interact: a latency dashboard read alongside the override rate and the refusal rate tells you something neither one tells you alone, and that joint reading is impossible if any of the three has been pre-summarized. Third, a raw series is auditable. A buyer can compute their own p50 from our numbers; they cannot reconstruct our numbers from our p50.

The trade-off is honest: raw series are harder to read at a glance. We accept that. The dashboards that read well at a glance are the ones that have already done the hiding for you.

What a buyer should ask for

If a vendor is publishing latency as p50 / p95 only, ask for the raw series — or, at minimum, the full distribution down to p99.9, with sample counts. Ask whether the load generator records the requests it never sent because the system stalled. Ask what the slowest one percent looked like during the last incident, and which dependency caused it. Those four questions separate vendors who use latency as a sales number from vendors who use it as a diagnostic.

We publish ours raw, on a page that does not flatter us. The honest chart is dull most days and useful on the bad ones. The flattering chart is exciting every day and useful on none of them.

If you want the broader posture this sits inside, the six receipts we publish on every engagement are described at /measurement-honesty-for-ai-projects and at /the-six-receipts-explained. The underlying architecture that makes the raw numbers available to the buyer at all is at /transparency-architecture-overview. If you want to apply this discipline to your own engagement, the entry point is /workshop.

Latency As A Honesty Metric: What Slow Reveals

The composite hides where the system fails

One live SWU example

Why we publish raw, not summarized

What a buyer should ask for

Bring this into a working session.