How to Generate Unit Tests with AI Coding Assistants

Unit testing is one of the most consistently under-written parts of most codebases. Developers know tests matter, agree they should write more of them, and then deprioritize them under deadline pressure every sprint. The tests that don't get written aren't failing - they just don't exist, which means bugs surface in production instead of in the test suite.

AI coding assistants change this equation in a specific and practical way. They're particularly well-suited to test generation because writing unit tests is a pattern-recognition task: given a function's signature, its purpose, and its edge cases, a test suite follows a predictable structure. AI systems that have processed millions of codebases have seen those patterns thousands of times. Generating tests from function definitions is fast, and reviewing AI-generated tests - even imperfect ones - is faster than writing them from scratch.

This guide covers the workflow for getting useful unit tests from AI coding assistants: how to provide context, how to prompt effectively, how to review what comes back, and how to integrate generated tests without inheriting their failure modes.

Notebook annotated unit test plan desk
Photo by DariuszSankowski on Pixabay

Why Test Generation Suits AI Well

AI coding assistants struggle with novel architecture decisions, complex debugging, and understanding undocumented system behavior. They do well with structured, pattern-following tasks where there's a clear mapping between input (the code) and output (the tests).

Unit test generation fits this profile. A test for a pure function requires: one or more inputs, the expected output, and an assertion. The variations are: happy path, edge cases (empty inputs, boundary values, null handling), and error cases (invalid inputs, exception conditions). An AI assistant that sees the function can enumerate these cases reliably, especially when you specify which ones you care about.

What AI gets wrong in test generation: it can produce tests that pass because they assert the wrong thing - testing the AI's own assumptions about what a function should do rather than what it actually does. It can generate mocking code that mocks away the behavior you most need to test. And it can produce redundant tests that cover the same code path four different ways while missing important edge cases.

The review workflow exists to catch these issues before they're committed.

Providing the Right Context

The quality of AI-generated tests scales directly with the quality of the context you provide. At minimum, provide:

The function being tested. Paste the complete function, not just its signature. The implementation details determine what edge cases exist.

The function's dependencies. If the function calls other functions or accesses external state, include that context. Without it, the AI will generate tests that mock dependencies with plausible but incorrect behavior.

The testing framework you're using. Jest tests look different from Pytest tests. Specifying the framework prevents the AI from generating valid tests in the wrong syntax.

Any existing tests for adjacent functions. Showing the AI two or three tests from the same codebase establishes the naming conventions, assertion patterns, and mocking approach your team uses. Generated tests that match your existing style are easier to review and accept.

The known edge cases. If you already know about a specific tricky input - null handling, empty arrays, timezone edge cases - include it in the prompt. AI assistants don't always infer these from code alone.

A structured prompt outperforms an open-ended one. "Write tests for this function" produces less useful output than "Write Jest unit tests for this function. Include the happy path, a null input case, an empty array input case, and a test for when the callback throws. Use the same assertion style as the examples below."

Laboratory equipment precision measurement instruments
Photo by qimono on Pixabay

The Generation Workflow

Step 1: Write the function first. Generating tests alongside unfinished functions produces tests that codify the current (possibly wrong) behavior. Write the function, review it, and then generate tests for the version you intend to ship.

Step 2: Run the generation prompt with explicit constraints. Include the framework, the cases you want covered, and the style examples. Ask for tests one function at a time rather than for entire files - smaller generations are easier to review and make better use of the context window.

Step 3: Review generated tests against the function's specification, not its implementation. This is the most important review step. A test that calls a function and asserts the output matches what the AI expected is not a useful test - it's a tautology. Verify that each assertion is testing behavior you care about, not behavior the AI inferred.

Step 4: Run the tests. AI-generated tests sometimes don't compile, sometimes import missing dependencies, and sometimes fail immediately because of incorrect assumptions. Running them is faster than reading them carefully.

Step 5: Fix or discard what fails. Tests that fail for the wrong reason (incorrect mock setup, missing import) are worth fixing. Tests that fail because the AI misunderstood the function's behavior are worth discarding and rewriting manually.

Step 6: Check coverage. AI-generated tests often cluster on the happy path and miss edge cases even when you asked for them. Run a coverage report and identify which branches the generated tests don't exercise. These are the cases to fill in manually or with a targeted follow-up prompt.

Reviewing for the Right Failures

The difference between a useful test suite and a false-confidence test suite is whether the tests fail when the code breaks in the ways that matter.

After generating and running tests, introduce one intentional bug into the function - remove a null check, flip a boundary condition, change a return value - and verify the tests catch it. If they don't, the test suite is not testing what you think it's testing. This takes two minutes and is worth doing for any function that's critical to your application logic.

"The fastest way to inherit technical debt from AI tooling is to accept generated tests without verifying they catch real failures. One quick mutation check takes sixty seconds and tells you more about test quality than reading the assertion code." - Dennis Traina, AI automation services by 137Foundry

When AI Struggles With Test Generation

AI test generation works well for pure functions: given input, produce output, no side effects. It works less well for functions that:

Depend on external state. Functions that read from a database, call an API, or depend on the current time require careful mocking. AI-generated mocks sometimes mock at the wrong level of abstraction, producing tests that pass even when the real dependency would fail.

Have complex interaction effects. Tests for functions whose correctness depends on the order of prior state changes require setup code that models that history. AI assistants often generate oversimplified setup that misses the dependency chain.

Involve asynchronous behavior. Async test patterns (async/await in JavaScript, asyncio in Python) require specific handling. AI sometimes generates tests that don't wait properly for resolution, producing tests that pass because they never actually evaluate the assertion.

For these categories, AI generation is still useful as a starting point, but manual review should be more intensive, and manual supplementation should be expected.

Server rack cables organized data center
Photo by cookieone on Pixabay

Integrating AI-Generated Tests Into CI

The same CI rules apply to AI-generated tests as to manually written ones: they run on every pull request, failures block merging, and flaky tests get fixed rather than skipped.

The practical concern is test quality at scale. If AI-generated tests are added to the suite without the mutation-check review step, you may accumulate tests that provide coverage metrics without providing real protection. Coverage reports show you how many lines are executed - not whether the assertions are meaningful.

GitHub and similar platforms make it easy to enforce test quality gates through required CI checks. Pairing those gates with periodic mutation testing (tools like Stryker for JavaScript or mutmut for Python) catches low-quality tests that have accumulated over time.

The goal is a test suite that's genuinely useful, not one that passes coverage thresholds. AI generation accelerates writing the tests; the review process determines whether they're worth having.

Building the Habit

The practical value of AI-assisted test generation is not that it produces perfect tests - it's that it removes the friction that causes tests not to get written at all. A developer who can generate a 70% complete test suite in three minutes and review it in five minutes will write more tests than a developer who has to write every assertion from scratch.

The 137Foundry services team integrates AI coding workflows including test generation as part of broader AI automation engagements - the same pattern-based generation approach applies to integration tests, API contract tests, and property-based testing when the function signature is well-defined.

For codebases with low test coverage, starting with AI generation for the functions most critical to core behavior - and applying the mutation-check review to those specifically - builds a foundation faster than any manual approach.

The TypeScript documentation on testing and the type system provides useful context on how type information can improve both the prompts you give AI assistants and the tests they generate - typed function signatures produce better test cases because the AI has more information about valid input ranges and return type expectations.

OpenAI and other AI providers continue to improve code generation capabilities with each model release, meaning test generation quality improves over time. The workflow described here is framework-agnostic: the same context-provision, generation, and review steps apply whether you're using GPT-4, Claude, Copilot, or a self-hosted model.

The tests you don't write are the bugs you find in production. AI generation makes writing them faster. The review process makes them trustworthy. Together, they're a practical path to the test coverage most codebases should have but consistently don't.