Anthropic says AI is hurting developer skills in its new paper

The headline that made me question everything

Anthropic’s new paper “How AI Impacts Skill Formation” revealed that developers using AI coding assistants scored 17% lower on comprehension and debugging tests.

As someone who advocates for AI in development, I decided to investigate whether this finding warranted changing my approach.

What the Anthropic study actually says

The research involved a randomized controlled trial where developers learned Python’s Trio asynchronous library. Half used AI assistants; half coded manually.

Key findings:

AI users scored 50% on comprehension quizzes versus 67% for manual coders
Debugging skills showed the largest performance gap
AI users finished only about 2 minutes sooner (not statistically significant)
Developers felt confident despite worse performance: the illusion of competence

The paper introduced the concept of a “debugging tax”: time spent prompting and verifying AI output that eliminates speed advantages while reducing comprehension.

Three AI patterns that preserved learning

The paper identified six distinct AI usage patterns. Three degraded learning; three preserved it.

Patterns that degraded learning (scoring under 40%):

AI Delegation: Full code generation without understanding
Progressive AI Reliance: Escalating from questions to complete delegation
Iterative AI Debugging: Pasting errors without reading them

Patterns that preserved learning (scoring 65-86%):

Conceptual Inquiry: “How does async/await work?” scored 86%
Generation-Then-Comprehension: Generate code, then request line-by-line explanations
Hybrid Code-Explanation: Request code paired with detailed explanations

The distinction is cognitive engagement versus cognitive offloading. Seven participants using conceptual inquiry dominated comprehension scores while still leveraging AI.

Counter-evidence: research that complicates the narrative

Harvard’s 2025 AI Tutoring Study. AI tutoring produced learning gains over double those of traditional instruction. The key difference: pedagogically designed AI with scaffolding and cognitive load management, not raw ChatGPT access.

Toronto’s Study on Young Programmers (CHI 2023). Sixty-nine novices ages 10 to 17 using AI code generators showed increased completion rates by 1.15x and scores by 1.8x without decreased performance on manual tasks. Students with prior block-based programming experience showed best retention.

Meta-Analysis (Wang et al., 2024). Forty-five independent studies found AI-enabled adaptive learning produced an overall effect of g = 0.70. Chatbots and generative AI showed g = 1.02, the highest impact among AI technology categories.

METR Study on Experienced Developers (July 2025). Sixteen developers with average 5 years experience were 19% slower with AI tools. However, authors noted findings do not imply AI systems are not useful. AI may provide more value when facing unfamiliar codebases or documentation gaps.

Limitations of the Anthropic study

Sample size concerns: 52 participants total. Only 4 developers in 1-3 years experience subgroup. Single study, single library, single interface type.

Temporal constraints: One-hour learning window. 35-minute task completion. Immediate assessment with no long-term follow-up.

Ecological validity questions: Developers learning completely unfamiliar material (worst-case scenario). No measurement of usage pattern evolution. Chat interface effects may not generalize to IDE-integrated tools.

What cognitive psychology tells us

The Generation Effect. Meta-analyses across 86 studies show self-generated information is better retained than passively received information (effect size d = 0.40). When AI generates code, developers skip generative processing that strengthens memory.

Cognitive Load Theory. Technology reducing extraneous load (unnecessary burden) helps. Technology reducing germane load (beneficial effort toward understanding) undermines learning. Current AI tools do not distinguish between these categories.

Desirable Difficulties. Spacing, interleaving, and testing improve long-term retention more than massed practice. However, difficulty only remains “desirable” when learners have sufficient foundation.

Historical parallels: technology fears before

Calculators. Hembree and Dessart’s meta-analysis of 79 studies found calculator use combined with traditional instruction actually improved paper-and-pencil computation skills.

GPS Navigation. Dahmani and Bohbot’s Nature study found GPS experience correlated with worse spatial memory during self-guided navigation. However, turn-by-turn navigation disengages users completely. GPS requiring decision-making may preserve spatial learning.

The Google Effect. People show reduced recall for information they believe is accessible online, but preserved memory for where to find information. This represents a shift in memory strategy, not pure decline.

What I am actually changing

1. Manual-first periods for new tech stacks. The first week learning new technology is AI-free. Build the mental model manually, then accelerate with AI.

2. Code reviews that ask “explain why.” Beyond “Does it work?” ask developers to walk through flow, explain failures, and justify approach choices. If explanation is not possible without AI, the code does not ship.

3. Promoting “Explain this error” over “Fix this error.” Train teams on conceptual inquiry patterns. Ask what errors mean, why they occur, and common causes before requesting fixes.

4. Watching for the illusion of competence. Monitor signals: fast completion with vague explanations, confidence that crumbles under follow-up questions, and “it works but I am not sure why” patterns.

5. Preserving 20-30% manual coding. Maintain manual practice to preserve skills needed to evaluate AI output. You cannot debug code you do not understand or verify solutions when comprehension is outsourced.

The real question is not “AI or no AI”

Calculators did not kill math skills. GPS impacted spatial memory. Spell-check changed how we write.

AI will change how developers code. That is inevitable.

The question is whether this change is designed intentionally or allowed to happen.

The bottom line

What the paper gets right: Cognitive offloading during skill acquisition impairs learning. Debugging skills degrade most severely. The “productivity paradox” is real. Usage patterns matter enormously.

What remains uncertain: Whether skill differences persist long-term. How findings generalize beyond novices learning completely new material. Whether developers naturally adjust usage patterns over time.

What to do about it: Build foundations before accelerating with AI. Use AI for understanding, not just execution. Review for comprehension, not just functionality. Maintain manual practice to preserve evaluation skills. Watch for confident ignorance, the most dangerous pattern.

The goal is not less AI, but smarter adoption that builds capability instead of dependency.

ai research engineering-culture