case study / ai infrastructure
Building a coding agent before the category existed
An auto-PR-from-tickets system at Microsoft, and what it taught us about the architecture decisions the industry would rediscover years later.
Context
At Microsoft Cloud + AI, I led AI adoption and innovation initiatives across a 600-developer organization in the CIH Efficiency program. The challenge: engineering teams were adopting AI coding tools inconsistently, with most developers spending 3-4 revision cycles to get AI-generated code to production quality. We needed a systematic approach to make AI tools effective at enterprise scale, not just available.
The state of LLMs at the time meant raw code generation was unreliable without deep context. The industry was still discovering that the hard problem wasn’t getting LLMs to write code, but getting them to write the right code for a specific codebase, team conventions, and production constraints.
Constraints
Enterprise security requirements at Microsoft scale. Code review policies that couldn’t be bypassed. Integration with existing CI/CD pipelines across dozens of repositories. Model reliability limitations that required human-in-the-loop validation. The need to work across heterogeneous codebases (Python, C#, PowerShell, TypeScript) with different conventions and tooling.
Architecture
The solution was Spected (COI initiative), a context engineering framework that makes any repository AI-agent ready in 20 minutes. Rather than building a monolithic agent, we focused on the meta-problem: giving AI agents the right context to produce correct code on the first attempt.
The framework deployed battle-tested prompts, coding standards, and multi-agent workflows. Alongside this, I built PRD2ADO Agent, which converts unstructured PM requirements into structured, Spected-ready engineering inputs, closing the gap between product intent and engineering execution.
Hard Problems
Grounding LLM output in large, complex codebases with years of accumulated conventions and implicit knowledge. Handling the variance between how different teams structure their code, name things, and handle errors. Ensuring that AI-generated code meets not just functional correctness but also the team’s quality bar for readability, testability, and operational safety.
The deeper challenge was organizational: convincing 600+ developers to change their workflows required demonstrating clear, measurable value rather than just showing impressive demos.
Outcome
Spected was adopted org-wide across 600+ developers, establishing context engineering as the standard for AI-assisted development. AI code revision cycles dropped from 3-4 rounds to production-ready on first try. The VP demo of Spected generated significant leadership interest and appreciation for AI innovation across the organization.
What I’d Do Differently
The biggest lesson was that context engineering matters more than model capability. We spent early cycles evaluating different models when the real leverage was in how we structured the context window: what repository knowledge to include, how to represent coding standards, and how to scope the task boundary so the model could succeed. The industry arrived at this same conclusion roughly a year later, but we could have gotten there faster if we had treated context as a first-class engineering problem from the start rather than an input formatting exercise.
I would also build the feedback loop earlier. We designed Spected as a framework that teams configure, but the most valuable signal came from watching where the AI still produced code that required revision. Instrumenting those failure modes systematically, rather than gathering anecdotal reports, would have accelerated iteration on the prompt architecture.
On the organizational side, I underestimated how much adoption depends on meeting developers where they are. The engineers who adopted Spected fastest were already comfortable with AI tools. The ones who needed it most were often skeptical and needed hands-on pairing sessions rather than documentation. If I were doing this again, I would allocate more time to embedded enablement and less to written guides.