Skip to content
~/writing/finding-latency-spike-nobody-saw

essay / payments

Finding the latency spike nobody was looking for

Tiny, non-alerting spikes in TP99 graphs led to a sharding problem that only affected our most important customers.

At Expedia, I go on call for two weeks every couple of months. Most of on-call is reactive: alerts fire, you investigate, you fix. But I have always found it more interesting to scan the historical graphs looking for patterns that do not trigger alerts but probably should.

During one rotation, I noticed tiny spikes in the TP99 latency graphs. Not persistent. Not large enough to breach our SLA or SLO thresholds. But frequent during weekdays, and they had started appearing after we integrated a new brand called Egencia onto our payment platform.

Following the thread

The spikes affected only a subset of calls from Egencia. I traced the exact code paths and found the pattern: the spikes occurred when a user with the “Executive Assistant” role logged in.

An executive assistant at Egencia typically manages payment instruments for everyone they support in their group, potentially dozens of personal cards, plus all the corporate cards registered under that group. Hundreds of instruments total.

Our payment vault sharded data across multiple database partitions. For most users with a handful of cards, the latency was consistent because their instruments landed on one or two partitions. But an executive assistant’s instruments could be distributed across many partitions, requiring fan-out reads that added up.

The fix

We resharded the data for Egencia to ensure all payment instruments for a given organizational group stayed in one partition. The spikes disappeared completely, and latencies for these code paths aligned with everything else.

Turning observation into a metric

This incident led to a broader initiative. I analyzed how much each of our services contributed to the overall booking path latency. We were at roughly 22% of the total payment page latency. I proposed tracking this as a formal KPI, with a modest first-quarter goal of reducing our contribution by 0.5%.

We hit that goal by identifying a caching opportunity in one of our services. The metric became a standing item in quarterly town hall presentations. Over the following years, we brought it down from 22% to around 19%.

The insight was simple: our SLAs told us whether we were good enough. They did not tell us whether we were getting better. Adding a directional metric alongside the threshold metric changed how the team thought about performance from “are we within bounds” to “are we improving.”