Refactoring Legacy Code With AI: A Practical Guide

Legacy code is where the hardest software engineering problems live. Systems that started small, grew accidentally, accumulated quirks, shed their original authors, and now run critical functions despite no one really understanding them. AI promises to help with legacy modernisation — and it can, if used carefully. It can also accelerate destruction, if used poorly. This guide is a practical plan for using AI to refactor legacy codebases without breaking production. It covers the test-harness discipline that makes refactoring safe, the incremental patterns that produce reviewable diffs, the language migrations AI handles particularly well, and the anti-patterns that turn AI-assisted refactoring into expensive disasters.

What makes legacy code hard

The reasons legacy code resists change are not about the code itself; they are about the context around it.

The original authors are usually gone. Institutional knowledge of why specific choices were made has evaporated. The code looks mysterious because the context has been lost.

Tests are often missing or inadequate. Without tests, every change is a leap of faith. Refactoring becomes terrifying because you cannot verify you have not broken something subtle.

Dependencies are tangled. The code calls other code that calls other code, touching parts of the system no one remembers exist. Refactoring one module may break something far away.

Production depends on the code. Legacy systems often run critical business functions. The cost of breaking them is high. Caution is reasonable but compounds the inertia.

These factors together are what make legacy refactoring hard in practice. AI does not eliminate any of them, but it can help you work through them more efficiently if you respect them rather than pretending they do not apply.

The safety principle: tests before changes

Before refactoring anything in a legacy codebase, build a test harness. This is not negotiable. Refactoring without tests is gambling.

The tests you need are characterisation tests — tests that document what the code currently does, without asserting what it should do. The goal is to freeze the current behaviour so that your refactoring cannot change it.

AI is particularly good at writing characterisation tests. Given a function, you can ask the AI to generate tests that exercise the function with representative inputs and capture the current outputs as assertions. Run the tests against the current code to confirm they pass, then use them as a safety net while refactoring.

The investment in a test harness often pays back many times over. A week spent characterising a legacy module's behaviour makes every subsequent refactoring attempt safe and verifiable. Without this investment, every change carries risk; with it, the risk shrinks dramatically.

Mapping the blast radius

Before refactoring a module, understand how it connects to the rest of the system. AI can help here by analysing the code and producing a dependency map.

"Analyse the user-service module. List: what calls this module (and what they expect), what this module calls (and what it expects from those), what data formats this module produces or consumes, what infrastructure (database tables, APIs, message queues) it touches. Flag anything that looks fragile or non-obvious."

This analysis surfaces coupling you might have missed. "I didn't realise the user service was also called from the batch jobs" is the kind of discovery that, made before refactoring starts, saves production incidents later.

Particularly useful: ask the AI to identify modules that depend on the current implementation in ways that would break if you changed it. These are the real blast-radius constraints.

Incremental refactoring patterns

Large refactors fail. Small incremental refactors succeed. AI can help you break a large refactor into a sequence of small ones.

The strangler pattern is the classic approach. Build the new implementation alongside the old, route a small fraction of traffic to the new one, verify parity, gradually shift more traffic, and finally remove the old implementation.

AI can help generate the new implementation alongside the old, write the routing logic, generate comparison tests that verify the old and new produce the same results, and draft the migration scripts. Each step is bounded and reversible.

Another useful pattern: rename-before-refactor. Before changing behaviour, rename variables, functions, and files to their new names. This is a mechanical change the AI can execute perfectly. Once names are right, the refactoring becomes clearer.

A third pattern: extract-then-refactor. Before changing a large function, extract its pieces into smaller functions without changing behaviour. The AI can do this cleanly. Then refactor each small function individually.

Language migrations

Language migrations — Python 2 to Python 3, JavaScript to TypeScript, Java 8 to Java 17, legacy to modern frameworks — are a sweet spot for AI-assisted refactoring. The patterns are well-documented, the transformations are often mechanical, and the AI has seen enough examples to apply them reliably.

Typical workflow: give the AI a specific file or module, ask it to translate to the target language/version, run the tests, fix any failures. Iterate file by file.

The non-obvious piece: AI handles the easy 90% of language migrations well. The last 10% — where the source language has idioms that do not map cleanly, or where edge cases in standard library behaviour differ — needs human attention.

A common migration pattern: AI drafts the migrated code; developer reviews for idiomatic quality; tests validate correctness; specific edge cases get hand-tuning. The end result is a migration that would have taken weeks compressed into days, with quality matching what a careful human would produce.

Renaming, re-layering, and re-modelling

Structural refactoring — changing how code is organised rather than what it does — is another AI strength.

Renaming: update variable, function, or class names to reflect current purpose rather than historical one. AI can propose better names and do the rename mechanically. Tools like ast-grep and the standard IDE refactoring tools are useful here; AI helps by suggesting the new names and confirming the mechanics.

Re-layering: moving code between layers (e.g., extracting business logic from controllers into services, pulling database logic out of business logic into repositories). AI can propose the new structure and execute the mechanical work; the architectural choices remain human.

Re-modelling: changing data types and their relationships. For example, replacing a flat user object with a user + user-preferences split. AI can rewrite the code to use the new model and update all callers. The main human input is deciding what the new model should look like.

These structural changes are where AI-assisted refactoring shines. The hard decisions are architectural; the mechanical execution is where AI removes enormous amounts of tedium.

Reading the legacy code

Before changing legacy code, you need to understand it. AI can help with understanding even when the code is poorly documented or strangely written.

A useful prompt: "Read this module. Describe what it does in plain English. Identify the main logic paths. Flag any parts that look unusual, fragile, or that you think might contain bugs. Do not suggest changes yet; just explain."

The AI's description of legacy code is often clearer than the code itself. For particularly tangled modules, this kind of AI-generated walkthrough can be the difference between fearing the code and being able to change it.

When the AI flags fragile or unusual parts, treat those flags as prompts for further investigation, not as bugs to fix. Sometimes the weirdness is intentional; sometimes it is a latent bug. Either way, the flag focuses attention where it is warranted.

Reviewable diffs at every step

Large PRs are the enemy of safe refactoring. Every refactoring step should produce a diff small enough to review meaningfully — usually a few hundred lines at most.

AI makes large diffs easy to produce, which makes discipline harder to maintain. Resist the temptation. If the AI produces a 2,000-line diff, reject it; ask for the changes in three or four commits instead.

A useful workflow: plan the refactoring as a sequence of commits before starting. "The refactor will be: commit 1: extract helper functions. Commit 2: rename misleading variables. Commit 3: change data flow. Commit 4: update tests." Execute one commit at a time, reviewing each before proceeding.

This commit-by-commit discipline means each step is reviewable, each step is reversible, and each step is safe. The refactoring may take longer in calendar time but is dramatically less risky.

Keeping the team in the loop

Large legacy refactorings often involve multiple engineers. AI-assisted refactoring benefits from coordination practices.

Use shared AI sessions or transcripts. When one engineer has done a refactoring pass with AI, sharing the transcript helps teammates understand what was done and why.

Write migration notes. Document the refactoring approach in the repo. Future contributors (and future you) will want to understand why things were done a particular way.

Communicate scope carefully. When you say "I'm refactoring the user service," teammates need to know whether you mean "reorganising internals" or "changing the public interface." Scope confusion is a classic cause of merge conflicts and broken expectations.

Coordinate deployment carefully. If the refactoring touches production behaviour, coordinate with operations about deployment windows, rollback plans, and monitoring.

When not to refactor

Some legacy code should be left alone. AI makes refactoring easier, but that does not mean every legacy module should be refactored.

If the code is stable and rarely changes, refactoring offers little upside and introduces risk. "If it ain't broke" remains sound advice.

If the code is about to be deprecated or replaced, don't refactor. Effort on code that will soon be deleted is wasted.

If the team lacks the expertise to maintain the refactored version, refactoring is counterproductive. Code that is clean but poorly understood is worse than code that is messy but familiar.

If production risk is high and test coverage is low, refactoring carries unacceptable risk. Build test coverage first, then refactor, not the other way around.

If you do not have the mandate to refactor, don't. Large refactorings without organisational support tend to be abandoned mid-way, leaving the code worse than before.

Working with truly ancient code

Some legacy code is not just old but alien — written in older language versions, using now-deprecated frameworks, with styles that have fallen out of fashion. AI can help here too, but with additional care.

AI is usually well-trained on older language and framework versions. It can read COBOL, old PHP, legacy ActionScript, vintage Java. Ask for explanations rather than changes first; understand what you have before touching it.

For migrations from dead technologies to modern ones, the AI's knowledge of current alternatives is usually strong. "Translate this ColdFusion to modern Node.js" is a tractable task.

Watch for outdated security patterns. Legacy code often uses patterns that were acceptable when written but are now known risks (MD5 for passwords, raw SQL concatenation, unescaped output). AI usually flags these if you ask; always ask.

Respect the original intent. Old code was usually written for reasons that made sense at the time. Refactoring should preserve the original behaviour unless there is specific justification for changing it.

When the AI proposes a rewrite instead of a refactor

A specific trap worth noting. When you ask the AI to refactor a module, it sometimes decides a full rewrite would be better and produces one. This is usually wrong, for a predictable reason: rewrites have much higher risk profiles than refactors, and the AI does not always weigh risk.

Refactoring preserves behaviour. Rewriting changes behaviour (at minimum, it changes what bugs exist). Refactoring can be validated incrementally against tests. Rewriting often requires all-or-nothing validation.

When the AI heads toward rewrite territory, stop and redirect. "I asked for a refactor that preserves behaviour. Propose a path that keeps the current behaviour but improves the code's structure incrementally." This usually produces a better answer.

If you genuinely want a rewrite, treat it as a separate project with its own test discipline and rollout plan. Do not let it slip in disguised as a refactor.

Common anti-patterns in AI-assisted refactoring

Failure modes to avoid.

The big-bang refactor. Months of work released in one huge PR. Usually gets merged reluctantly because it is too large to review, or rejected because it is too risky. Small increments, always.

Skipping tests. "The AI says it's safe" is not a test strategy. Characterisation tests first, every time.

Fixing unrelated things. While refactoring module X, AI will "helpfully" clean up module Y. Reject these drifts; scope the change.

Trusting the AI's claim of equivalence. AI may say "this refactoring preserves behaviour" when it does not. Verify with tests; don't trust.

Rewriting instead of refactoring. Rewrites are tempting and usually fail. If the AI wants to rewrite a module you asked it to refactor, stop and ask explicitly for a refactoring path instead.

Losing the history. Git blame matters. Large AI-generated diffs can obscure the history of why specific lines exist. Keep changes small enough that blame remains useful.

A worked example: splitting a god class

Concrete scenario. A legacy UserManager class has grown to 3,000 lines. It handles authentication, profile management, notifications, permissions, and reporting. Refactoring target: split into smaller focused classes.

Step 1: build characterisation tests. Ask the AI to generate tests covering representative uses of UserManager. Run them against current code to capture current behaviour.

Step 2: map the dependencies. Ask the AI to list every method of UserManager, what internal state it uses, and what external code calls it. This produces a blueprint for the split.

Step 3: propose a split. Ask the AI to suggest how to break UserManager into focused classes. Review the suggestion with team input; converge on a target structure.

Step 4: execute in commits. Commit 1: extract Authenticator. Commit 2: extract ProfileService. Commit 3: extract NotificationService. And so on. After each commit, run the characterisation tests to confirm behaviour is preserved.

Step 5: remove the UserManager shell once all functionality has been extracted. Update callers to use the new classes directly.

This approach takes a few weeks rather than months of big-bang rewriting, produces reviewable commits, and preserves behaviour throughout. AI accelerates each step but does not remove the need for discipline.

Refactor legacy code with AI only behind a characterisation test suite. Small reviewable diffs beat big rewrites every time, and AI makes the small-diff approach tractable in ways it previously was not. The refactorings that used to feel impossible because of scale are now genuinely approachable, provided you respect the discipline.

The short version

AI is genuinely useful for legacy refactoring when paired with test discipline and incremental patterns. Build characterisation tests first; map dependencies; refactor in small commits; verify behaviour preservation at every step. Language migrations, renamings, re-layerings, and god-class splits all benefit substantially from AI assistance. Avoid big-bang refactors, trust-but-verify AI claims of equivalence, and keep every diff reviewable. The gains are significant — weeks of mechanical work compress to days — but only with the discipline that makes refactoring safe in the first place. Teams who try to skip the discipline usually pay for it later, when a quiet AI-introduced behaviour change causes a production incident that no one can easily diagnose because everyone assumed the refactor preserved behaviour.