Observability at hyperscale: telemetry anomaly detection across datacenters

Context

At Microsoft Cloud + AI, the CIH Efficiency program encompasses several critical datacenter infrastructure services: Flex-MA, Hierarchical Capping (PowerCapping), Sub-Synchronous Oscillation (SSO) detection, and GUPS. These services operate at Microsoft’s datacenter scale, where telemetry volume is measured in billions of data points and latency anomalies can cascade across datacenter hops before traditional monitoring catches them.

I took ownership of the Athena codebase (the backbone of PowerCapping and telemetry programs) with zero prior documentation, inheriting a system where institutional knowledge had walked out the door.

Constraints

Extreme data volume flowing through Azure Data Explorer (Kusto). Query latency requirements that demanded sub-minute detection windows. Cross-datacenter correlation challenges where SSO events manifest differently depending on the hop path. Cost of telemetry infrastructure at this scale, where naive approaches to anomaly detection become their own bottleneck. The Athena codebase had fragmented branches with no unified CI foundation.

Architecture

Designed and implemented the SSO Aggregator Pilot v1 for detecting sub-synchronous oscillation events across datacenter infrastructure. Partnered with Schneider to integrate Smart Connector v2, enabling new telemetry capabilities from hardware sensors. Led the foundational RackManager Telemetry Ingestion code in collaboration with the Azure core team, establishing the pipeline for rack-level power and performance data.

For Athena, consolidated all branches into a unified CI foundation and automated 90% of deployments via PowerShell scripts, achieving 20x faster, safer releases. Established comprehensive documentation where none existed.

Hard Problems

The meta-problem: telemetry systems at this scale risk becoming their own bottleneck. Ingesting, processing, and querying billions of telemetry data points while maintaining detection latency that’s actually useful for incident response. False positive management and alert fatigue in an environment where the signal-to-noise ratio determines whether on-call engineers trust the system.

Inheriting a critical codebase with zero documentation required systematic reverse-engineering of implicit knowledge, then making that knowledge explicit and maintainable.

Outcome

Automated 90% of Athena deployments, achieving 20x speed improvement and safer, repeatable release processes. SSO Aggregator Pilot v1 delivered. Smart Connector v2 integration enabled new telemetry capabilities. As OE Champion for the entire DIODE organization, led monthly operational excellence reviews and established standards that gave leadership clear visibility into operational posture.

As PowerCapping Service Owner, established SFI compliance, accountability structures, and metrics-driven decision making. Drove S360 remediation, migrated services from SAS tokens to Managed Identities, and conducted RCA for Sev-2 incidents that hardened system reliability and security posture.

What I’d Do Differently

Inheriting Athena with zero documentation was the defining challenge. I would push harder for a dedicated knowledge-capture sprint before touching any code. We reverse-engineered the system while simultaneously fixing bugs and shipping features, which worked but created unnecessary risk. A two-week period of pure documentation, with the explicit goal of making the system survivable if I also left, would have been a better investment.

The CI consolidation should have happened first, not in parallel with feature work. We shipped the SSO Aggregator while still running fragmented branches, which meant every deployment carried uncertainty about which branch state was canonical. Establishing a single source of truth for the codebase before building on top of it seems obvious in retrospect.

On the telemetry architecture itself: our anomaly detection thresholds were initially tuned by intuition rather than data. This led to a predictable cycle of alert fatigue followed by threshold loosening followed by missed incidents. I would invest in a structured calibration process using historical incident data from the start, treating threshold tuning as a data problem rather than an engineering judgment call.