essay / ai
From coder to code reviewer: the skill shift nobody's talking about
When AI generates the code, your job shifts from writing to verifying. Here is a practical playbook for the new reality.
The manager’s question
During a routine sync, my manager posed a challenging question: “How do we actually know the AI-generated code is good?” It prompted deep reflection on verification processes that had become insufficient. Tests passing and casual code review were not adequate safeguards.
The uncomfortable truth: AI code fails differently
AI-generated code presents unique challenges:
- Deceptive appearance: Clean formatting and proper structure mask potential issues
- Misaligned solutions: Code may solve the wrong problem confidently without acknowledging business context gaps
- Unexamined defaults: Patterns borrowed from training data may not suit specific requirements like performance or security needs
These failure modes differ fundamentally from typical human coding errors, requiring evolved verification approaches.
Why traditional code review is not enough anymore
Standard code review assumes human authors understood problems and made conscious decisions. With AI-generated code:
- No genuine problem comprehension exists
- No deliberate architectural choices were made
- Mistakes involve semantic mismatches rather than syntax errors
The implicit review contract breaks down, necessitating new techniques.
The verification playbook: what actually works
1. Round-trip spec verification
This technique involves:
Step 1: Write detailed specifications in spec-original.md
Step 2: Feed specs to AI for implementation
Step 3: Ask a different AI model to read the generated code and produce spec-derived.md describing what it actually does
Step 4: Diff the specifications
Gaps represent lost requirements; additions indicate unintended behaviors. This approach captures “intent drift” effectively. You can automate this in CI pipelines, flagging divergence exceeding thresholds. This technique catches 60-70% of quarterly AI-related issues.
Practical implementation tips:
- Use different models for generation versus verification
- Be specific in original specifications
- Automate comparison using diff tools or third-party AI analysis
- Run iteratively for critical code paths
2. Test-driven development (flipped)
Instead of generating code then writing tests, reverse the sequence:
- Write failing tests defining required behavior
- Feed tests plus context to AI with instructions to pass them
- Run test suite; iterate if failures occur
- Review code focusing on implementation quality, not intent verification
This inverts the trust model, constraining AI through specifications rather than validating afterward.
3. Property-based testing
Rather than checking specific input-output pairs, define invariant properties. For example, a sorting function should maintain:
- Output length equals input length
- All output elements exist in input
- Each element is less than or equal to the next
Randomized input generation reveals edge case failures AI commonly exhibits.
Recommended tools: Hypothesis (Python), fast-check (JavaScript/TypeScript), PropEr (Erlang/Elixir), QuickCheck (Haskell).
4. Mutation testing
Automatically modifying code (swapping operators, removing lines) and running test suites against mutations reveals whether tests meaningfully validate behavior or provide false confidence.
Recommended tools: Stryker (JavaScript/TypeScript/C#), mutmut (Python), pitest (Java).
5. Adversarial AI review
Employ a second AI model reviewing the first model’s output. Different models have different blindspots, providing complementary perspectives. This serves as an automated pre-filter before human review.
6. Contract testing
For APIs and services, define request/response schemas and interface contracts before code generation. Automated validation ensures generated code conforms to specifications.
Recommended tools: Pact (multi-language), Dredd (API testing), OpenAPI validators.
The workflow: putting it all together
Start with round-trip verification
Before examining code, run spec comparisons. Addressing intent problems precedes implementation review.
Tag everything
Label AI-generated versus AI-assisted PRs distinctly, enabling pattern analysis and improvement tracking.
Tiered review intensity
Tier 1 (Low risk): Tests, documentation, boilerplate. Standard review.
Tier 2 (Medium risk): Business logic, data transformations. Requires property-based testing plus intent verification.
Tier 3 (High risk): Security-sensitive code, financial calculations. Full verification suite mandatory.
Incremental trust boundaries
Expand AI scope based on measured data rather than intuition. Track revert rates, then broaden permissions accordingly.
Post-merge monitoring
Track AI-generated code performance separately after merging. Tag errors by origin, monitor regressions and security findings, building dashboards reflecting these metrics.
The tools that make this practical
For prevention:
- GitHub Copilot custom instructions
.github/copilot-instructions.mdfiles with team standards- Context files like
CLAUDE.md
For review detection:
- Semgrep and CodeQL for anti-pattern rules
- Custom linting targeting codebase patterns
- Adversarial model review in CI
For verification:
- Hypothesis/fast-check for property testing
- Stryker/mutmut for mutation testing
- Pact or schema validators for contracts
- Simple diff pipelines using LLM APIs
For tracking:
- PR labeling systems
- Grafana/Datadog dashboards segmented by code origin
- Spreadsheet revert rate tracking
The real shift: from writer to reviewer
The fundamental transformation is not writing code with AI. It is verifying code you did not write. This requires:
- Understanding what correctness looks like
- Recognizing architectural tradeoffs
- Detecting technically correct but architecturally unsuitable solutions
- Evaluating whether common patterns fit specific systems
Junior engineers lack this pattern recognition from insufficient experience. Senior engineers become the verification layer ensuring AI-generated code remains shippable. Experience becomes more valuable, not less.
What I told the manager
The complete framework:
- Round-trip spec verification on all AI-generated PRs
- TDD-first workflows with AI implementation following specifications
- Automated testing techniques addressing AI-specific failure modes
- Separate tagging enabling reliable data collection
- Incremental trust boundaries based on demonstrated reliability
- Post-merge monitoring matching quality standards elsewhere
Trust, it turns out, is the whole game. Teams building verification capabilities now gain significant competitive advantages when AI assistance becomes standard across engineering organizations.