Designing coding challenges that test AI collaboration, not just coding

I have been writing about how technical interviews are broken in the AI era. Candidates can prompt their way through any algorithm question. Take-homes compile identically whether the author understands the code or not. The signal is gone.

So I built three challenges designed to evaluate how candidates collaborate with AI, not just whether they produce working code.

Challenge 1: the sync engine

A C# challenge where candidates build a real-time sync engine between a legacy CRM and Microsoft Dynamics 365. Four phases in 60 minutes: entity mapping with non-trivial field transforms (splitting full names, converting Unix epochs, mapping status strings to enum pairs), conflict resolution with three strategies, event-driven processing with auto-linking across entities, and a stretch goal for caching with TTL.

The phases are deliberately ordered so that early architectural decisions constrain later ones. A candidate who lets AI generate phase one without thinking about the data model will struggle in phase three when the event handler needs to cross-reference entities. The AI will happily generate each phase in isolation. Only a candidate who understands the overall design can make the phases cohere.

Challenge 2: the ML bug hunt

An inherited fraud detection pipeline that claims 99% accuracy in development but drops to roughly 50% in production. Candidates have 90 minutes to find and fix at least four of five intentional bugs: data leakage from sorting before splitting, scaler fit on train+test data, no class weighting for imbalanced data, accuracy as the metric on a 5% fraud rate, and training accuracy reported as the final metric.

This challenge is brutal on candidates who rely on AI without understanding. An AI assistant will confidently explain that the pipeline looks correct. The bugs are all patterns that produce plausible-looking results. Finding them requires understanding why the numbers are too good, which requires understanding what the numbers should look like.

Challenge 3: the code review

A .NET API with eight intentional flaws: a data privacy leak (supplier costs exposed to public endpoints), negative inventory from insufficient stock validation, a fat controller with all business logic inline, an N+1 query pattern, magic strings for status, hardcoded tax logic, thread-unsafe static collections, and no transaction semantics across stock decrements.

Candidates choose one of three tasks: extend the API, review the code, or fix a specific bug. The choice itself is a signal. Candidates who jump to implementation without reading the existing code miss the architectural issues. Candidates who review first and flag structural problems before touching code demonstrate the judgment that matters.

What the rubric actually measures

All three challenges share a dimension most interview rubrics lack: AI collaboration quality. Weighted at 15-20% of the total score, it evaluates whether the candidate guides the AI or follows it, whether they verify AI output against their own mental model, and whether they catch when the AI confidently produces the wrong answer.

Architecture and design decisions carry the highest weight (25%), because those are the choices AI cannot make for you. The AI generates code. You decide what code to generate.

The passing score is 0.70 across all dimensions. A candidate who writes perfect code but delegates every decision to the AI will not pass. A candidate who makes strong architectural choices and uses AI effectively to implement them will, even if the code has rough edges.

This is the interview I wish existed when I started asking “walk me through the last time AI gave you a wrong answer” and getting blank stares.

hiring ai engineering-culture