Skip to content
~/work/otter-incident-triage

case study / ai infrastructure

OTTER: Transforming on-call with multi-agent AI incident triage

Architecting a system that ingests production incidents, queries telemetry through 33 MCP tools, and learns from engineer feedback to automate root cause analysis.

2024-2025 Engineering Lead, Principal Software Engineer Enterprise-scale on-call operations

Context

On-call at Microsoft datacenter infrastructure scale means investigating production incidents across systems generating billions of telemetry data points. Engineers spend hours manually querying Kusto dashboards, correlating signals across services, and building mental models of what went wrong. The process is highly dependent on tribal knowledge: experienced engineers know which queries to run and which signals matter, but that expertise doesn’t transfer easily to newer team members.

OTTER (On-call Troubleshooting & Triage Engine with Recall) is a multi-agent AI system designed to transform this workflow from manual dashboard analysis to AI-assisted triage.

Constraints

Production incident investigation requires high-confidence outputs: wrong root cause analysis wastes engineering time and can lead to incorrect mitigations. The system must integrate with existing tooling (ICM for incidents, Kusto for telemetry, Grafana for visualization) rather than replacing it. Engineers need to trust the system’s reasoning, which means transparency into how conclusions were reached. Data sensitivity and compliance requirements around production telemetry.

Architecture

OTTER uses Azure OpenAI (GPT) with a Model Context Protocol (MCP) integration layer exposing 33 specialized tools. When an ICM incident arrives, the system orchestrates multiple AI agents that:

  1. Parse the incident metadata and identify the affected service and region
  2. Execute targeted Kusto telemetry queries through MCP tools to gather relevant signals
  3. Correlate signals across services to identify root cause patterns
  4. Present findings with supporting evidence and suggested mitigation steps
  5. Learn from engineer feedback on accuracy, building a recall mechanism that improves future investigations

The multi-agent design allows parallel investigation of different hypotheses, with a coordinator agent synthesizing results.

Hard Problems

Grounding AI reasoning in real production telemetry without hallucinating correlations. Designing 33 MCP tools that are specific enough to be useful but general enough to cover the breadth of incident types. Handling the feedback loop: how do you capture implicit engineer knowledge when they correct the system, and how do you make that knowledge durable?

The trust problem is fundamental: on-call engineers under pressure will abandon a tool that’s wrong more than occasionally. The confidence threshold for adoption is much higher than for developer productivity tools.

Outcome

OTTER is in active development, with the MCP tool suite and core agent orchestration in place. The architecture has been validated against historical incidents, demonstrating the ability to identify root causes that match engineer conclusions. The system represents a shift in how the organization thinks about on-call: from purely reactive manual investigation to AI-augmented triage with institutional memory.

What I’d Do Differently

Building 33 MCP tools before validating the core agent loop was a sequencing mistake. We should have started with 5-8 tools covering the most common incident patterns and expanded based on where the agent failed. The breadth-first approach meant we built tools that were individually correct but didn’t compose well, because we hadn’t yet learned which tool combinations the agent actually needed for real investigations.

The trust problem deserved its own workstream from the beginning. We treated it as something that would resolve naturally as accuracy improved, but engineer trust in an incident triage tool is binary: they either use it under pressure or they don’t. Building explicit confidence scoring and showing the agent’s reasoning chain (not just its conclusions) should have been a launch requirement, not a follow-up feature.

I would also invest earlier in synthetic incident generation for testing. Waiting for real production incidents to validate the system meant slow iteration cycles and inconsistent test coverage. A framework for replaying historical incidents with known root causes would have let us iterate on agent behavior daily rather than waiting for incidents to occur organically.