Reducing AI Code Debt: A Human-Supervised PDCA Framework for Sustainable Development

The race to adopt AI code generation is a sustainability crisis that Agile practitioners are uniquely positioned to solve. As more production code is written by agents, research shows increased computational waste, decreased human accountability, and declining code quality that threatens the sustainability of software systems.

GitClear’s longitudinal analysis of 211 million lines of code shows a 10x increase in duplicated code blocks during 2024, creating maintenance waste. The data shows “moved” code – a strong signal of refactoring and code reuse – has dropped from 24.8% of changes in 2021 to 9.5% in 2024. Meanwhile, duplicated code blocks increased by an order of magnitude in 2024, with 6.66% of commits now containing significant duplication compared to 0.70% in 2020. Tellingly, “2024 marked the first year GitClear has ever measured where the number of Copy/Pasted lines exceeded the count of Moved lines”.

A productivity paradox

These patterns create a troubling productivity paradox. While individuals may feel more productive, the system-level impacts tell a different story. Duplicated code blocks lead to higher defect rates and more complex resolutions. 17% of cloned code contains bugs (Wagner, et. al.) and 18.42% of cloned code containing bugs propagates those bugs into other copies. (Mondal, et. al.).

Google’s DORA 2024 research found that every 25% increase in AI adoption correlates with a 7.2% decrease in delivery stability and 1.5% decrease in throughput. Research shows individual developers have widely different results dependent upon experience and task complexity. Less experienced developers working on simpler tasks show the greatest productivity gains but also increased quality issues. Experienced developers working on complex tasks show minimal gains. METR performed a controlled study with 16 experienced developers working in familiar codebases on real-world tasks and found they were 24% slower when using AI tools, contradicting both expert predictions and their own expectations.

Economic and ecologic waste

Inference costs and code quality are already serious sustainability challenges. A University of Washington professor estimates that daily inferences against ChatGPT alone consumes 1 GWh of energy, equivalent to the usage of 33,000 US households. In 2022, CISQ estimated the cost of accumulated technical debt in US software at $1.52 trillion. AI is a hoped for solution to technical debt but it is poised to make the problem materially worse while wasting large quantities of energy.

The solution is neither heroic developers nor magic automatons

The evidence suggests declining quality is not just an artifact of the early state of tooling but reflects process gaps in how we use AI. GitClear finds, “The current unstructured approach to interacting with agents optimizes for immediate output while undermining long-term maintainability… The evidence points beyond tool limitations to process gaps… This trend to under-supervised code generation and disempowered human oversight will create technical debt at unprecedented scale.”

So, the productivity paradox won’t be solved by expecting developers to move faster, get out of the way, and be more careful. It won’t be solved by the imminent release of autonomous agents that can solve novel problems inexpensively, consistently with perfect context and no defects. We need to rethink our AI-assisted coding practices.

As an Agile community, we have already learned that rapid delivery of quality code takes systematic practices. We have evolved such practices to align our work to business value and positive impact (Dual Track), to rein in code excess (TDD), and to foster continuous learning (Retrospection). In fact, we have a whole library of alternative tools and frameworks that help us respond to change, value individuals and interactions, prioritize customer collaboration, and deliver working software.

We must adapt our practices to AI adoption to shrink development cycles, reduce waste, and keep humans engaged, empowered, and accountable. We need to apply Agile analysis, planning, and validation to guide our moment-to-moment interactions with AI agents and tools, and employ continuous improvement through near-instant feedback loops to manage their rapidly changing and poorly understood capabilities

Focusing on code generation in isolation plays into the assumption that it is the primary constraint in the software development life cycle. This PDCA process begins at the point a developer pulls work which is significantly earlier in the life cycle. IEEE published research tracking 78 professional developers across 3,148 working hours found developers spend 58% of their time on program comprehension instead of code writing.

Human supervised code generation in a Plan-Do-Check-Act cycle

Here is a disciplined Plan-Do-Check-Act (PDCA) cycle. This cycle occurs in the code generation session itself as a nested loop within whatever team and project practices an organization uses. The full PDCA cycle can take as little as one hour, ideally no more than three.

Each step has a distinct purpose and builds upon the prior ones.

Plan (7-15 min): Use the agent’s ability to analyze the entire codebase to reduce architectural drift and scope creep, then create a detailed execution strategy with explicit checkpoints.

Do (30 min – 2.5 hrs): Generate code following structured prompts and active human oversight to minimize over-complication, untested behavior, and excessive code production.

Check (2-5 min): Validate completeness against an explicit definition of done to ensure code meets intended outcomes and quality standards.

Act (5-10 min): Retrospective to identify collaboration patterns and process improvements for future human-AI interactions.

The developer further commits to a set of working agreements that define specific behaviors they will use to remain collaborative during the session and engage their experience and critical thinking.

Working agreements

Working agreements are commitments we, the human operators, hold ourselves accountable to in our interaction with the coding agents. While they should remain small in number, they capture our fundamental beliefs about which behaviors in AI code generation will produce high-quality, valuable code.

Example working agreements

I commit to enforcing “One focused change at a time. One failing test at a time, no exceptions.” I will stop the AI when I observe it attempting to fix multiple things simultaneously or making changes that go beyond the current failing test.
I commit to “Respect existing architecture: Work within established patterns. No dramatic approach changes unless requested.” I will interrupt when the AI proposes solutions that diverge from established codebase patterns without explicit justification and approval.
I commit to “Explicitly establish: methodology, scope boundaries, and intervention rights before coding begins.” I will not allow any implementation to start until we have agreed on the testing approach, defined the scope of work, and confirmed my authority to interrupt process violations.

Plan: Analyze the problem and plan the execution

Analysis (5-10 min)

Ask the agent to examine the codebase in relation to a business outcome. The AI assembles context from existing code as it relates to the goal and suggests alternate approaches.

Human’s commitment: Define an achievable value-based objective small enough to complete in a 3 hour window. For example, “We are reducing patient intake errors by implementing real-time validation of insurance ID and policy coverage so that patients can begin treatment immediately without coverage verification delays”. Evaluate the alternative approaches proposed by the agent and select one.

Example prompt

I need to do a high level design brainstorm.

The overall goal is to [describe the overall goal as best I understand it. Highlevel design considerations, questions, concerns]

**Analysis needed:**

– Understand the problem and its scope

– Explore different approaches or solutions

– Identify potential challenges, dependencies, or unknowns

– Consider architectural implications or patterns

– Assess complexity and effort (rough estimate)

– Note any assumptions or clarifications needed

**Output:** Provide a terse and clear understanding of the problem and recommended high level alternative approaches. Keep it at a human readable length and level of detail.

Focus on the “what” and “why” at a high level – we’ll do more detailed analysis and planning in later phases.

Detailed planning (2-5 min)

This interaction creates explicit checkpoints to encourage discrete, test validated increments that progress towards the desired outcome.

Human’s commitment: Scan and understand the plan enough to hold the agent accountable for following it and intervene in the code generation if the plan needs to change based on new information.

Example detailed planning prompt

**Planning Phase**

Based on our analysis, provide a coherent plan incorporating our refinements that is optimized for your use as context for the implementation:

**Integration Strategy:**

– Map end-to-end data flow and all touch points

– Identify required changes to existing methods/interfaces

… Additional strategy points

**Testing Strategy:**

Test Drive atomic changes to production code using red/green strategy

– Break the work into atomic, incremental changes

– For each task or batch: build one or more failing tests that drive a code change

… Additional testing strategy points

**Specific Architectural Concerns:**

… largely surfaced through retrospection

**Create actionable plan with:**

– Numbered implementation steps (small, testable increments)

– ONE file/component per step when possible

… Additional criteria for an actionable plan

**Process Checkpoints:**

– Verify adherence to chosen testing strategy

– Each step: Confirm appropriate test coverage exists

… Additional guardrails, i.e. stop conditions, intervention triggers

This is an abbreviated example, my current planning prompt is 2,023 characters following this outline.

Do: Code Generation (30 min to 2-1/2 hrs)

Instruct the agent to begin code generation, including implementation guidelines formatted as a checklist that both the agent and developer can track. This reinforces process discipline to minimize over-complication, unintended or untested behavior, and excessive code production.

Human’s commitment: Follow the agent’s internal conversation as it proceeds. Intervene and ask questions as soon and as often as needed. This reduces the amount of wasted tokens polluting the thread as well as code that needs to be reverted. Interventions can be questions of clarification or corrections: “Did you write a meaningful test?” ‘Is the test failing for the right reasons?” “Are you using the existing helper class we identified?” “Does this follow the repository pattern we discussed?”

Watch for signs of context drift – when a model’s responses lose coherence as it struggles to maintain relevant parts of the conversation thread in its context window. You’ll see it go off on a tangent, duplicate code, make unrelated changes, or fail to honor the testing guidance. The best way to avoid drift is to work in smaller scopes of work, cycles of hours not days. But to recover from context drift, remind the agent what it was doing, ask it to reiterate and update the plan, then resubmit the implementation guidance.

Example prompt

**TDD Implementation – Step [X]**

**🚨 TDD DISCIPLINE CHECK 🚨**

– [ ] Have I written a FAILING test first? (RED phase mandatory)

– [ ] Am I implementing ONLY enough to make the test pass? (GREEN phase)

– [ ] Is this test simple enough? (Complex scenarios → simplify first)

**AI Command Transparency:**

– [ ] Show exact test commands used before reporting results

– [ ] Document test scope: “Testing [specific area]”

…continued.

My current template covers TDD, command transparency, test complexity, and a summarized definition of done (see below). It is 1008 characters.

Check: Validate completeness (2-5 min)

Define and enforce an explicit “definition of “done” (Scrum Guide) that includes delivering the planned output to your personal and team quality standards. This agent will check its adherence to the analysis, plan, and implementation guidelines. The output of this check provides data for the retrospective.

Human’s commitment: Take accountability for whether the code will achieve the intended outcome as best you understand it and meets your professional standard of quality. This is your code.

Example completeness prompt

**Verification:**

– [ ] All tests passing

– [ ] No regressions introduced

…continued.

**Process Audit:**

– [ ] Testing approach followed consistently

– [ ] TDD discipline maintained

…continued.

**Status:** [Complete/Needs work]

**Outstanding items:** [any remaining tasks]

**Ready to close:** [Yes/No with reasoning]

My current prompt is a 10 item checklist of 679 characters.

Act: Retrospective (5-10 min)

Use the agent to help brainstorm ways to improve the human / agent interaction. The agent performs a review of the full thread to identify meaningful events and collaboration patterns. It presents things that could have gone better and suggests things you can do to help that happen. The point is what you can do to better elicit the desired behavior from the LLM next time.

Human’s commitment: Review the results and determine no more than 1-3 small, incremental changes you might want to make to the existing prompts or your behavior. Make the changes and follow through on them, then evaluate if they actually improve your output over the next few cycles.

Retrospective prompt

Let’s retrospect on our coding session. Please address whichever of these areas seem most relevant:

**Session Overview:** What was our main goal and scope?

**What Happened:** What approaches did we try? What worked smoothly vs. where did we struggle?

**Technical Insights:** What patterns do you notice in our collaboration? What would have accelerated progress?

**Process Insights:**

– What worked well in our process?

– What process improvements for future work?

**Collaboration Analysis:**

– Where did process discipline break down (if it did)?

– How effective was the “process police” intervention?

– What communication patterns helped vs. hindered progress?

**AI Assistant Reflection:**

– What would you do differently next time? (insight generation)

– What practices should we continue/change? (your perspective)

– How can we maintain discipline while staying flexible? (your analysis)

**Key Takeaway:** What’s the most important thing we learned?

**Lessons Learned:** What are key insights for future similar work

**Human Context Retention:** (You control these for next time)

– What prompting approaches should you continue/modify?

– What intervention timing worked best?

– How should you set up process expectations upfront?

– What “process police” tactics were most effective?

Follow up prompt

Based on learnings would you suggest changes to this interaction framework.

[paste in all or part of your template framework]

Only suggest targeted, specific and highly relevant changes.

A note on prompts

The prompt samples are real world but provided as directional guidance. Each developer and team needs to tailor them for the specific model and version and their own practices and standards. That said, the PDCA cycle itself provides rapid feedback and incremental evolution of the prompts aided by the model itself.

Measuring the value of a PDCA cycle

Activity metrics are leading indicators that we hope will drive the changes we want to achieve. Pursued as ends of themselves, they create perverse incentives that can result in the opposite of what we want, e.g. generated code percentage quotas that coerce developers into poor oversight leading to bugs and rework. The Agile community’s response is to move beyond activity metrics to me asure outcomes. So, while we measure activity in the short term we need to reward based on longer term measures that represent the actual customer/user behavior we want to change.

Google has documented their comprehensive metrics for measuring AI productivity. While few companies can implement their exact approach, we can adapt their principles.

Session-Based: Following Google’s approach, group development activities into work sessions to measure the full AI impact. Track the complete PDCA cycle rather than isolated coding speed metrics.

Qualitative and quantitative: The most important outcomes are often hard to measure. They are still the most important things so we measure them as best we can to ensure our perceived improvements reflect genuine gains rather than false signals.

Longitudinal tracking: Monitor metrics over a 6-12 month period to capture the full life cycle effect. This longer view reveals whether initial speed gains are real and sustainable or undermined over the long term.

Possible leading indicators: Predictive activities

Focus on code quality, AI practice adoption, team health and their correlation to faster delivery of useful, defect-free features, and desired customer behavior or improved customer sentiment.

Process health (immediately measurable):

PDCA cycle adherence: Track whether teams follow an analysis, planning, implementation, retrospective pattern through a checklist in story cards.
Working agreement compliance: Simple weekly team temperature check to see whether they followed the established AI collaboration agreements.

Code quality behaviors (Git-based using commercially available tools):

Commit size and scope: Track lines changed per PR and files modified per commit – sprawling changes can indicate AI-generated code accepted without proper decomposition.
Test-first discipline: Percentage of commits that include test changes before or alongside production code changes.

Team health and individual engagement:

Teams conduct regular self-assessments focused on autonomy, understanding, learning, and process discipline. These stay internal to the team for continuous improvement rather than external evaluation. For specific templates, teams can adapt existing retrospective practices from Agile Retrospectives or team health models like Spotify’s Squad Health Check.

Possible lagging indicators: Outcome results

System health (measurable outcomes):

Code duplication trend: Use existing tools (SonarQube, CodeClimate) to detect growth of redundant code over time.
Bug escape rate: Count production bugs discovered after release vs. during development.

Customer value delivery (outcome indicators):

Feature adoption rate: Percentage of shipped features that see meaningful user engagement within 30 days.
Customer problem resolution velocity: Time from identifying a system problem that impacts customers to deploying a solution.

Building our communal fluency

This framework requires community collaboration to validate its effectiveness. We need to collect data on whether PDCA cycles reduce short-term token usage and/or pays returns against long-term technical debt while improving customer outcomes.

Our Agile practices are rooted in values and principles. We are committed to upholding human dignity, delivering efficiently, and creating a beneficial long-term impact. Let’s champion a more sustainable way of working with AI that reduces harm and produces benefit for the largest possible set of stakeholders.

AI disclosure: This article was written through disciplined AI collaboration. The prompt examples are drawn from my actual prompts and are co-authored by Claude. I used Claude to identify and summarize research sources, after which I retrieved those sources and read relevant portions, using Claude to help me fact-check my interpretation of the data including a devil’s advocate review of my arguments. I used Claude to help me build my initial outline. I authored the research analysis, arguments, and framework description through multiple drafts. For each draft, I used Claude for copy editing and writing refinement during which I adopted suggestions proposed by the LLM making substantial changes of my own. Aside from portions of the prompts, all content decisions and final language remain my own.

We hope you found this post informative

Before you move on, please consider supporting our non-profit mission by making a donation to Agile Alliance today. This is a community blog post. The opinions contained within belong solely to the author or authors, and may not represent the opinion or policy of Agile Alliance.

Ken Judy

Ken Judy is Senior Partner at Stride Build, where he coaches technical leaders and serves as an advisor and lead on custom software projects. His focus is building collaborative, team-based organizations and the responsible use of generative AI to address business challenges. Ken lives in Brooklyn, NY.

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
_csrf	session	This cookie is essential for the security of the website and visitor. It ensures visitor browsing security by preventing cross-site request forgery.
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
gdpr[allowed_cookies]	1 year	This cookie is set by the GDPR WordPress plugin. It is used to store the cookies allowed by the logged-in users and the visitors of the website.
JSESSIONID	session	Used by sites written in JSP. General purpose platform session cookies that are used to maintain users' state across page requests.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
pmpro_visit		The cookie is set by PaidMembership Pro plugin. The cookie is used to manage user memberships.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	This cookie is set by Addthis to make sure you see the updated count if you share a page and return to it before our share count cache is updated.
__atuvs	30 minutes	This cookie is set by Addthis to make sure you see the updated count if you share a page and return to it before our share count cache is updated.
__jid	30 minutes	Used to remember the user's Disqus login credentials across websites that use Disqus
aka_debug		This cookie is set by the provider Vimeo.This cookie is essential for the website to play video functionality. The cookie collects statistical information like how many times the video is displayed and what settings are used for playback.
bcookie	2 years	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
CONSENT	16 years 8 months 15 days 5 hours	Description Pending
disqus_unique	1 year	Disqus.com internal statistics
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
language		This cookie is used to store the language preference of the user.
lidc	1 day	This cookie is set by LinkedIn and used for routing.
locale	3 days	This cookie is used to store the language preference of a user allowing the website to content relevant to the preferred language.
STYXKEY_aa_signup_visited	session	No description

Cookie	Duration	Description
_gat_UA-17319182-1	1 minute	Set by Google Analytics and Google Tag Manager to enable website owners to track visitor behaviour and measure site performance. These cookies are used to collect information about how you use our website. The information collected includes number of visitors, pages visited and time spent on the website. The information is collected by Google Analytics in aggregated and anonymous form, and we use the data to help us make improvements to the website.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_17319182_1	1 minute	Set by Google Analytics and Google Tag Manager to enable website owners to track visitor behaviour and measure site performance. These cookies are used to collect information about how you use our website. The information collected includes number of visitors, pages visited and time spent on the website. The information is collected by Google Analytics in aggregated and anonymous form, and we use the data to help us make improvements to the website.
_gat_UA-0000000-1	1 minute	Set by Google Analytics and Google Tag Manager to enable website owners to track visitor behaviour and measure site performance. These cookies are used to collect information about how you use our website. The information collected includes number of visitors, pages visited and time spent on the website. The information is collected by Google Analytics in aggregated and anonymous form, and we use the data to help us make improvements to the website.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
eud	1 year 24 days	The domain of this cookie is owned by Rocketfuel. This cookie is used to sync with partner systems to identify the users. This cookie contains partner user IDs and last successful match time.
S	1 hour	domain .google.com
uvc	1 year 1 month	The cookie is set by addthis.com to determine the usage of Addthis.com service.
vuid	2 years	This domain of this cookie is owned by Vimeo. This cookie is used by vimeo to collect tracking information. It sets a unique ID to embed videos to the website.

Membership

Members-only Content

Become an Agile Alliance member!

IN-PERSON Events

Virtual Events

Community Events

Download the Agile Manifesto

NEW Manifesto for Enterprise Agility

Reimagining Agility

MEMBER INITIATIVES

Your Community

Global Development

Global Affiliates

Global Affiliates

OUR POLICIES

ABOUT US

Sign up for Agile News

Reducing AI Code Debt: A Human-Supervised PDCA Framework for Sustainable Development

A productivity paradox

Economic and ecologic waste

The solution is neither heroic developers nor magic automatons

Human supervised code generation in a Plan-Do-Check-Act cycle

Working agreements

Example working agreements

Plan: Analyze the problem and plan the execution

Analysis (5-10 min)

Example prompt

Detailed planning (2-5 min)

Example detailed planning prompt

Do: Code Generation (30 min to 2-1/2 hrs)

Example prompt

Check: Validate completeness (2-5 min)

Example completeness prompt

Act: Retrospective (5-10 min)

Retrospective prompt

Follow up prompt

A note on prompts

Measuring the value of a PDCA cycle

Possible leading indicators: Predictive activities

Possible lagging indicators: Outcome results

Building our communal fluency

We hope you found this post informative

Ken Judy

Recent Blog Posts

Recent Posts

Join Agile Alliance!

Post your comments or questions

Recent Agile Alliance Blog Posts

Ready to join Agile Alliance?