How to Debug AI-Generated Code That Behaves Wrong

The "it compiles but it does not work" problem is older than AI coding assistants. What is new is the rate at which it shows up. A prompt that took thirty seconds to type can produce a hundred lines of plausible-looking code that runs without errors, passes a quick smoke test, and silently does the wrong thing.

The cost is not the generation; it is the debugging. A bug in code you wrote yourself is debugged with intuition you already have. A bug in code an assistant produced is debugged on top of the cost of understanding code you did not write in the first place.

This post is a practical workflow for that case. It covers the failure modes that show up most often, the order to check them in, and where the standard debugging instincts need to shift when the code under inspection came from an assistant.

A code editor with a terminal pane showing program output
Photo by K on Pexels

The shape of the problem

Traditional debugging assumes you wrote the code, or someone on your team wrote it, and the code reflects the author's mental model of the system. The bug is usually a gap between what the author intended and what they typed.

AI-generated code does not have a clear author mental model. The assistant produced something that pattern-matched against your prompt and its training data. That output may be:

Code that solves a related but different problem
Code that uses a library or API that does not exist
Code that uses a real API in an outdated or wrong way
Code that handles common cases but fails on edge cases the prompt did not mention
Code that is correct for a different version of the language, framework, or runtime

Each failure mode has a different signature and a different fastest path to the fix. Knowing which one you are dealing with cuts the debugging time significantly.

Step 1: Reproduce, then read the code carefully

Same as any debugging session. Get a reliable repro before changing anything. A flaky symptom you cannot reliably reproduce becomes worse, not better, when you start modifying the code looking for the cause.

After you have a repro, read the generated code carefully. Slowly. Not skimming.

This is the step most people skip. The temptation is to treat AI output as a black box (drop it in, see if it works), and that habit carries over into debugging. It should not. The output is just code; you can read it and understand it, and reading it carefully will often reveal the bug without any deeper investigation.

What you are looking for at this stage:

Function and method names that you have not heard of. Search the library docs for them; if they do not exist, the assistant invented them.
Constants and string literals (URLs, configuration keys, environment variable names). Verify each one is real.
Type annotations or signatures that look subtly off (returning a list when the function should return a generator, taking a string when the real API takes an enum).
Comments that contradict the code below them. Often the assistant produced a correct comment and incorrect code, or vice versa.

For a longer view of how this fits into a broader team practice, the 137Foundry AI automation work is built around exactly this kind of integration between AI output and production-grade code review.

Step 2: Verify every external dependency

This is the single highest-yield check on AI-generated code. Hallucinated packages, hallucinated API methods, and hallucinated function signatures are common enough that they should be the first thing you eliminate.

For Python: run pip show <package_name> for every imported package. If it is not installed, install it and verify the function the code calls actually exists in the installed version. The PyPI homepage is the source of truth for what packages exist.

For Node.js: run npm view <package_name> for every imported package. Check the npm registry directly for any package the assistant introduced that you did not specifically request.

For API calls (REST, GraphQL, etc.): open the actual API documentation in a browser. Confirm the endpoint path, request method, required headers, and response shape. AI assistants often mix up parameter names between versions of the same API, or invent parameters that fit the pattern but do not actually exist.

For internal services and libraries: check that the function the code calls exists in your codebase with the signature the assistant assumed. Renamed functions are a common cause of "looks right, runs, returns wrong data" bugs.

Eliminating hallucinated dependencies takes ten minutes and resolves a surprising fraction of AI-generated code bugs.

A developer reviewing terminal output and library documentation
Photo by Thư Tiêu on Pexels

Step 3: Check the edge cases the prompt did not mention

If the dependencies are real and the code structure looks reasonable but the behavior is still wrong, the next most common cause is edge case handling that does not match your data.

Write down what the prompt explicitly said the code should do. Then write down the cases your real data includes that the prompt did not mention. The bug is usually in the gap.

Specific things to check:

Empty inputs (empty list, empty string, null/None values)
Single-element inputs (the code may have an off-by-one bug that only fires on length 1)
Inputs that contain the delimiter or escape character the code uses internally
Unicode characters in inputs that the code assumes are ASCII
Very large inputs that hit a recursion limit or a buffer size
Concurrent inputs where the code assumes sequential operation
Inputs in time zones, locales, or character encodings other than the assistant's default assumption

For each gap, write a specific test case that exercises it. Run them. The bug will usually fire on one of them, and at that point you have a localized repro and the fix is local.

Step 4: Check versioning assumptions

AI assistants are trained on a mix of years of documentation. They often generate code that is correct for an older or newer version of the same library, framework, or language than the one you are actually using.

Specific things to check:

Python language version (f-strings vs .format(), walrus operator, structural pattern matching)
JavaScript/TypeScript runtime version (async/await, optional chaining, top-level await)
Library major version (the API for pandas 1.x vs 2.x, requests 2.x vs 3.x, React 17 vs 18 vs 19)
Database version (SQL syntax, function availability, JSON column types)
API version (v1 vs v2 endpoints, deprecated parameters)

If a piece of code looks correct but fails with a "no attribute" or "unknown function" error, version mismatch is the most likely cause. The Stack Overflow tag pages are useful for confirming which feature appeared in which version when the docs are vague.

Step 5: Check the silently-wrong cases

The hardest bugs in AI-generated code are the ones where the code runs, returns a value, and the value is subtly wrong. No exception, no obvious indicator. Just wrong numbers, wrong dates, wrong groupings.

A short checklist for this case:

Date and time handling. Time zones, daylight saving boundaries, leap seconds, ISO 8601 parsing. AI-generated date code is wrong shockingly often.
Floating-point arithmetic. Cumulative rounding errors, comparison with equality (==), precision loss in unit conversions.
Integer overflow in languages where it matters.
Off-by-one errors in array slicing, range bounds, pagination logic.
Sort stability and ordering assumptions.
Set vs list semantics (deduplication, ordering).
Aggregation logic across groups (sum vs cumulative sum, average vs median, count distinct vs count).

For each, write a test case with a known correct answer and compare against the code's output. Mismatches localize the bug.

"The first hour with AI-generated code is spent confirming it does what the prompt said. The second hour is confirming it does what you actually meant. The second hour is the one teams skip and pay for later." - Dennis Traina, founder of 137Foundry

Step 6: Verify the test coverage actually tests the bug

If you wrote tests for the AI-generated code (and you should have), the bug surviving the tests means the tests do not cover the failure case.

Read the tests. Confirm each one is checking the behavior you think it is checking. AI assistants often generate tests that exercise the code without verifying the output (run the function, do not assert anything meaningful about what it returned), or tests that use the same flawed assumption as the implementation (the test asserts a wrong answer is "correct" because the assistant's mental model was consistently wrong across both).

This is a specific failure mode of AI-generated test code: tautological tests that pass for the wrong reason. Run each test in your head and ask: if the implementation were silently wrong, would this test catch it? If the answer is no, the test does not earn its keep, and you need to write a real one.

The Mozilla Developer Network reference is a good source for verifying expected behavior of standard library functions, when you need a non-AI reference to compare against.

Step 7: Strip and rebuild if it is still wrong

If you have worked through the previous steps and the code is still wrong, the kindest thing is usually to strip it and write the piece by hand.

A few hours of careful manual work on a contained problem is often less expensive than another debugging cycle on AI output that has resisted three fixes. The signal that you have reached this point: each round of fix introduces a different bug, the assistant cannot describe why the previous version was wrong, and you cannot tell whether the latest version is better or just differently broken.

Take the prompt, take what you have learned about the actual problem, and write the implementation directly. Save the AI output as a reference for the test cases you want to cover.

A workshop notebook with handwritten code annotations
Photo by Devansh Kumar on Pexels

The general workflow

Putting the steps together:

Reproduce reliably.
Read the code carefully.
Verify every imported package and external API exists.
Test edge cases the prompt did not mention.
Check versioning assumptions.
Check silently-wrong cases (dates, floats, off-by-one, aggregation).
Verify the tests actually test the behavior, not the assistant's flawed assumption.
Strip and rebuild if still wrong after rounds 3 to 7.

This is the workflow. It is not magic. The discipline is in actually doing each step instead of skipping ahead.

For more about the team-level practices that make this kind of code review sustainable on real projects, the 137Foundry AI automation work is built around exactly this integration. The 137Foundry web development practice handles the broader project-shape side, and the 137Foundry homepage is the front door for getting in touch about specific engagements.

A closing observation

The fundamental shift with AI-generated code is that the cost of producing the first draft has dropped dramatically while the cost of debugging has stayed roughly the same. The result is that debugging quality is the limiting factor on whether AI-assisted development is a net win or net loss for a team.

The teams that win this trade-off are the ones that take debugging seriously and have a repeatable workflow for it. The teams that lose are the ones that treat AI output as a finished product. Whether you are using a coding assistant for the first time or have integrated one deeply into your team's workflow, the discipline of careful debugging is what separates the two outcomes.