Reducing AI Code Debt: A Human-Supervised PDCA Framework for Sustainable Development

The race to adopt AI code generation is a sustainability crisis that Agile practitioners are uniquely positioned to solve. As more production code is written by agents, research shows increased computational waste, decreased human accountability, and declining code quality that threatens the sustainability of software systems.

GitClear’s longitudinal analysis of 211 million lines of code shows a 10x increase in duplicated code blocks during 2024, creating maintenance waste. The data shows “moved” code – a strong signal of refactoring and code reuse – has dropped from 24.8% of changes in 2021 to 9.5% in 2024. Meanwhile, duplicated code blocks increased by an order of magnitude in 2024, with 6.66% of commits now containing significant duplication compared to 0.70% in 2020. Tellingly, “2024 marked the first year GitClear has ever measured where the number of Copy/Pasted lines exceeded the count of Moved lines”.

A productivity paradox

These patterns create a troubling productivity paradox. While individuals may feel more productive, the system-level impacts tell a different story. Duplicated code blocks lead to higher defect rates and more complex resolutions. 17% of cloned code contains bugs (Wagner, et. al.) and 18.42% of cloned code containing bugs propagates those bugs into other copies. (Mondal, et. al.).

Google’s DORA 2024 research found that every 25% increase in AI adoption correlates with a 7.2% decrease in delivery stability and 1.5% decrease in throughput. Research shows individual developers have widely different results dependent upon experience and task complexity. Less experienced developers working on simpler tasks show the greatest productivity gains but also increased quality issues. Experienced developers working on complex tasks show minimal gains. METR performed a controlled study with 16 experienced developers working in familiar codebases on real-world tasks and found they were 24% slower when using AI tools, contradicting both expert predictions and their own expectations.

Economic and ecologic waste

Inference costs and code quality are already serious sustainability challenges. A University of Washington professor estimates that daily inferences against ChatGPT alone consumes 1 GWh of energy, equivalent to the usage of 33,000 US households. In 2022, CISQ estimated the cost of accumulated technical debt in US software at $1.52 trillion. AI is a hoped for solution to technical debt but it is poised to make the problem materially worse while wasting large quantities of energy.

The solution is neither heroic developers nor magic automatons

The evidence suggests declining quality is not just an artifact of the early state of tooling but reflects process gaps in how we use AI. GitClear finds, “The current unstructured approach to interacting with agents optimizes for immediate output while undermining long-term maintainability… The evidence points beyond tool limitations to process gaps… This trend to under-supervised code generation and disempowered human oversight will create technical debt at unprecedented scale.”

So, the productivity paradox won’t be solved by expecting developers to move faster, get out of the way, and be more careful. It won’t be solved by the imminent release of autonomous agents that can solve novel problems inexpensively, consistently with perfect context and no defects. We need to rethink our AI-assisted coding practices.

As an Agile community, we have already learned that rapid delivery of quality code takes systematic practices. We have evolved such practices to align our work to business value and positive impact (Dual Track), to rein in code excess (TDD), and to foster continuous learning (Retrospection). In fact, we have a whole library of alternative tools and frameworks that help us respond to change, value individuals and interactions, prioritize customer collaboration, and deliver working software.

We must adapt our practices to AI adoption to shrink development cycles, reduce waste, and keep humans engaged, empowered, and accountable. We need to apply Agile analysis, planning, and validation to guide our moment-to-moment interactions with AI agents and tools, and employ continuous improvement through near-instant feedback loops to manage their rapidly changing and poorly understood capabilities

Focusing on code generation in isolation plays into the assumption that it is the primary constraint in the software development life cycle. This PDCA process begins at the point a developer pulls work which is significantly earlier in the life cycle. IEEE published research tracking 78 professional developers across 3,148 working hours found developers spend 58% of their time on program comprehension instead of code writing.

Human supervised code generation in a Plan-Do-Check-Act cycle

Here is a disciplined Plan-Do-Check-Act (PDCA) cycle. This cycle occurs in the code generation session itself as a nested loop within whatever team and project practices an organization uses. The full PDCA cycle can take as little as one hour, ideally no more than three.

Each step has a distinct purpose and builds upon the prior ones.

Plan (7-15 min): Use the agent’s ability to analyze the entire codebase to reduce architectural drift and scope creep, then create a detailed execution strategy with explicit checkpoints.

Do (30 min – 2.5 hrs): Generate code following structured prompts and active human oversight to minimize over-complication, untested behavior, and excessive code production.

Check (2-5 min): Validate completeness against an explicit definition of done to ensure code meets intended outcomes and quality standards.

Act (5-10 min): Retrospective to identify collaboration patterns and process improvements for future human-AI interactions.

The developer further commits to a set of working agreements that define specific behaviors they will use to remain collaborative during the session and engage their experience and critical thinking.

Working agreements

Working agreements are commitments we, the human operators, hold ourselves accountable to in our interaction with the coding agents. While they should remain small in number, they capture our fundamental beliefs about which behaviors in AI code generation will produce high-quality, valuable code.

Example working agreements

  • I commit to enforcing “One focused change at a time. One failing test at a time, no exceptions.” I will stop the AI when I observe it attempting to fix multiple things simultaneously or making changes that go beyond the current failing test.
  • I commit to “Respect existing architecture: Work within established patterns. No dramatic approach changes unless requested.” I will interrupt when the AI proposes solutions that diverge from established codebase patterns without explicit justification and approval.
  • I commit to “Explicitly establish: methodology, scope boundaries, and intervention rights before coding begins.” I will not allow any implementation to start until we have agreed on the testing approach, defined the scope of work, and confirmed my authority to interrupt process violations.

Plan: Analyze the problem and plan the execution

Analysis (5-10 min)

Ask the agent to examine the codebase in relation to a business outcome. The AI assembles context from existing code as it relates to the goal and suggests alternate approaches.

Human’s commitment: Define an achievable value-based objective small enough to complete in a 3 hour window. For example, “We are reducing patient intake errors by implementing real-time validation of insurance ID and policy coverage so that patients can begin treatment immediately without coverage verification delays”. Evaluate the alternative approaches proposed by the agent and select one.

Example prompt

I need to do a high level design brainstorm. 

The overall goal is to [describe the overall goal as best I understand it. Highlevel design considerations, questions, concerns]

**Analysis needed:**

– Understand the problem and its scope

– Explore different approaches or solutions

– Identify potential challenges, dependencies, or unknowns

– Consider architectural implications or patterns

– Assess complexity and effort (rough estimate)

– Note any assumptions or clarifications needed

**Output:** Provide a terse and clear understanding of the problem and recommended high level alternative approaches. Keep it at a human readable length and level of detail.

Focus on the “what” and “why” at a high level – we’ll do more detailed analysis and planning in later phases.

Detailed planning (2-5 min)

This interaction creates explicit checkpoints to encourage discrete, test validated increments that progress towards the desired outcome.

Human’s commitment: Scan and understand the plan enough to hold the agent accountable for following it and intervene in the code generation if the plan needs to change based on new information.

Example detailed planning prompt

**Planning Phase**

Based on our analysis, provide a coherent plan incorporating our refinements that is optimized for your use as context for the implementation:

**Integration Strategy:**

– Map end-to-end data flow and all touch points

– Identify required changes to existing methods/interfaces

… Additional strategy points

**Testing Strategy:** 

Test Drive atomic changes to production code using red/green strategy

 – Break the work into atomic, incremental changes

 – For each task or batch: build one or more failing tests that drive a code change

… Additional testing strategy points

**Specific Architectural Concerns:**

… largely surfaced through retrospection

**Create actionable plan with:**

– Numbered implementation steps (small, testable increments)

– ONE file/component per step when possible

… Additional criteria for an actionable plan

**Process Checkpoints:**

– Verify adherence to chosen testing strategy

– Each step: Confirm appropriate test coverage exists

… Additional guardrails, i.e. stop conditions, intervention triggers

This is an abbreviated example, my current planning prompt is 2,023 characters following this outline.

Do: Code Generation (30 min to 2-1/2 hrs)

Instruct the agent to begin code generation, including implementation guidelines formatted as a checklist that both the agent and developer can track. This reinforces process discipline to minimize over-complication, unintended or untested behavior, and excessive code production.

Human’s commitment: Follow the agent’s internal conversation as it proceeds. Intervene and ask questions as soon and as often as needed. This reduces the amount of wasted tokens polluting the thread as well as code that needs to be reverted. Interventions can be questions of clarification or corrections: “Did you write a meaningful test?” ‘Is the test failing for the right reasons?” “Are you using the existing helper class we identified?” “Does this follow the repository pattern we discussed?”

Watch for signs of context drift – when a model’s responses lose coherence as it struggles to maintain relevant parts of the conversation thread in its context window. You’ll see it go off on a tangent, duplicate code, make unrelated changes, or fail to honor the testing guidance. The best way to avoid drift is to work in smaller scopes of work, cycles of hours not days. But to recover from context drift, remind the agent what it was doing, ask it to reiterate and update the plan, then resubmit the implementation guidance.

Example prompt

**TDD Implementation – Step [X]**

**🚨 TDD DISCIPLINE CHECK 🚨**

– [ ] Have I written a FAILING test first? (RED phase mandatory)

– [ ] Am I implementing ONLY enough to make the test pass? (GREEN phase)

– [ ] Is this test simple enough? (Complex scenarios → simplify first)

**AI Command Transparency:**

– [ ] Show exact test commands used before reporting results

– [ ] Document test scope: “Testing [specific area]”

…continued.

My current template covers TDD, command transparency, test complexity, and a summarized definition of done (see below). It is 1008 characters.

Check: Validate completeness (2-5 min)

Define and enforce an explicit “definition of “done” (Scrum Guide) that includes delivering the planned output to your personal and team quality standards. This agent will check its adherence to the analysis, plan, and implementation guidelines. The output of this check provides data for the retrospective.

Human’s commitment: Take accountability for whether the code will achieve the intended outcome as best you understand it and meets your professional standard of quality. This is your code.

Example completeness prompt

**Verification:**

– [ ] All tests passing

– [ ] No regressions introduced

…continued.

**Process Audit:**

– [ ] Testing approach followed consistently

– [ ] TDD discipline maintained

…continued.

**Status:** [Complete/Needs work]

**Outstanding items:** [any remaining tasks]

**Ready to close:** [Yes/No with reasoning]

My current prompt is a 10 item checklist of 679 characters.

Act: Retrospective (5-10 min)

Use the agent to help brainstorm ways to improve the human / agent interaction. The agent performs a review of the full thread to identify meaningful events and collaboration patterns. It presents things that could have gone better and suggests things you can do to help that happen. The point is what you can do to better elicit the desired behavior from the LLM next time.

Human’s commitment: Review the results and determine no more than 1-3 small, incremental changes you might want to make to the existing prompts or your behavior. Make the changes and follow through on them, then evaluate if they actually improve your output over the next few cycles.

Retrospective prompt

Let’s retrospect on our coding session. Please address whichever of these areas seem most relevant:

**Session Overview:** What was our main goal and scope?

**What Happened:** What approaches did we try? What worked smoothly vs. where did we struggle?

**Technical Insights:** What patterns do you notice in our collaboration? What would have accelerated progress?

**Process Insights:** 

– What worked well in our process?

– What process improvements for future work?

**Collaboration Analysis:**

– Where did process discipline break down (if it did)?

– How effective was the “process police” intervention?

– What communication patterns helped vs. hindered progress?

**AI Assistant Reflection:**

– What would you do differently next time? (insight generation)

– What practices should we continue/change? (your perspective)

– How can we maintain discipline while staying flexible? (your analysis)

**Key Takeaway:** What’s the most important thing we learned?

**Lessons Learned:** What are key insights for future similar work

**Human Context Retention:** (You control these for next time)

– What prompting approaches should you continue/modify?

– What intervention timing worked best?

– How should you set up process expectations upfront?

– What “process police” tactics were most effective?

Follow up prompt

Based on learnings would you suggest changes to this interaction framework. 

[paste in all or part of your template framework]

Only suggest targeted, specific and highly relevant changes.

A note on prompts

The prompt samples are real world but provided as directional guidance. Each developer and team needs to tailor them for the specific model and version and their own practices and standards. That said, the PDCA cycle itself provides rapid feedback and incremental evolution of the prompts aided by the model itself.

Measuring the value of a PDCA cycle

Activity metrics are leading indicators that we hope will drive the changes we want to achieve. Pursued as ends of themselves, they create perverse incentives that can result in the opposite of what we want, e.g. generated code percentage quotas that coerce developers into poor oversight leading to bugs and rework. The Agile community’s response is to move beyond activity metrics to measure outcomes. So, while we measure activity in the short term we need to reward based on longer term measures that represent the actual customer/user behavior we want to change.

Google has documented their comprehensive metrics for measuring AI productivity. While few companies can implement their exact approach, we can adapt their principles.

Session-Based: Following Google’s approach, group development activities into work sessions to measure the full AI impact. Track the complete PDCA cycle rather than isolated coding speed metrics.

Qualitative and quantitative: The most important outcomes are often hard to measure. They are still the most important things so we measure them as best we can to ensure our perceived improvements reflect genuine gains rather than false signals.

Longitudinal tracking: Monitor metrics over a 6-12 month period to capture the full life cycle effect. This longer view reveals whether initial speed gains are real and sustainable or undermined over the long term.

Possible leading indicators: Predictive activities

Focus on code quality, AI practice adoption, team health and their correlation to faster delivery of useful, defect-free features, and desired customer behavior or improved customer sentiment.

Process health (immediately measurable):

  • PDCA cycle adherence: Track whether teams follow an analysis, planning, implementation, retrospective pattern through a checklist in story cards.
  • Working agreement compliance: Simple weekly team temperature check to see whether they followed the established AI collaboration agreements.

Code quality behaviors (Git-based using commercially available tools):

  • Commit size and scope: Track lines changed per PR and files modified per commit – sprawling changes can indicate AI-generated code accepted without proper decomposition.
  • Test-first discipline: Percentage of commits that include test changes before or alongside production code changes.

Team health and individual engagement:

Teams conduct regular self-assessments focused on autonomy, understanding, learning, and process discipline. These stay internal to the team for continuous improvement rather than external evaluation. For specific templates, teams can adapt existing retrospective practices from Agile Retrospectives or team health models like Spotify’s Squad Health Check.

Possible lagging indicators: Outcome results

System health (measurable outcomes):

  • Code duplication trend: Use existing tools (SonarQube, CodeClimate) to detect growth of redundant code over time.
  • Bug escape rate: Count production bugs discovered after release vs. during development.

Customer value delivery (outcome indicators):

  • Feature adoption rate: Percentage of shipped features that see meaningful user engagement within 30 days.
  • Customer problem resolution velocity: Time from identifying a system problem that impacts customers to deploying a solution.

Building our communal fluency

This framework requires community collaboration to validate its effectiveness. We need to collect data on whether PDCA cycles reduce short-term token usage and/or pays returns against long-term technical debt while improving customer outcomes.

Our Agile practices are rooted in values and principles. We are committed to upholding human dignity, delivering efficiently, and creating a beneficial long-term impact. Let’s champion a more sustainable way of working with AI that reduces harm and produces benefit for the largest possible set of stakeholders.


AI disclosure: This article was written through disciplined AI collaboration. The prompt examples are drawn from my actual prompts and are co-authored by Claude. I used Claude to identify and summarize research sources, after which I retrieved those sources and read relevant portions, using Claude to help me fact-check my interpretation of the data including a devil’s advocate review of my arguments. I used Claude to help me build my initial outline. I authored the research analysis, arguments, and framework description through multiple drafts. For each draft, I used Claude for copy editing and writing refinement during which I adopted suggestions proposed by the LLM making substantial changes of my own. Aside from portions of the prompts, all content decisions and final language remain my own.

We hope you found this post informative

Before you move on, please consider supporting our non-profit mission by making a donation to Agile Alliance todayThis is a community blog post. The opinions contained within belong solely to the author or authors, and may not represent the opinion or policy of Agile Alliance.

Picture of Ken Judy

Ken Judy

Ken Judy is Senior Partner at Stride Build, where he coaches technical leaders and serves as an advisor and lead on custom software projects. His focus is building collaborative, team-based organizations and the responsible use of generative AI to address business challenges. Ken lives in Brooklyn, NY.

Recent Blog Posts

Recent Posts

Join Agile Alliance!

$5 per month (paid annually)*

*Corporate plans are also available

Post your comments or questions

Recent Agile Alliance Blog Posts

Ready to join Agile Alliance?

Unlock members-only access to online learning sessions, Agile resources, annual conference discounts, and more! And when you join, you’ll be supporting our member initiatives, regional events, and global community groups.

Privacy Preference Center

IMPORTANT: We have transitioned to a new membership platform. If you have not already done so, you will need to SET UP AN ACCOUNT on the new platform to establish your user profile. Your previous login credentials will not work until you do this set up.

When you see the login screen, choose “Set up Account” and follow the prompts to create your new account. You can choose to log in using your social credentials for either Google or Linkedin (recommended), or you can set up your account using an email address.