AI Coding Guidelines: How to Write Rules Engineers Follow

Every engineering organization that has adopted AI coding assistants over the past three years has gone through some version of the same arc. Early enthusiasm. Productivity gains for some people, mixed results for others. A growing pile of subtle bugs that everyone suspects are AI-generated but no one can prove. A management decision that "we need guidelines." A document drafted by a small group, circulated for review, and quietly ignored by everyone after week two.

The guidelines fail not because the rules are wrong but because the document was written like a compliance manual. Engineers do not follow compliance manuals. They follow heuristics that show up in code review, pair programming, and slack threads. If your AI coding guidelines do not live in those places, they do not exist.

This is what we have learned at 137Foundry about writing guidelines that survive contact with a real engineering team.

annotated engineering notebook with pen and ruler on wood desk
Photo by Jessica Lewis 🦋 thepaintedsquare on Pexels

Start with the Problems, Not the Rules

The standard guideline document opens with a list of allowed and disallowed tools, then a list of rules. That ordering is backwards. The people who need to internalize the rules need to first understand why the rules exist, and the why is always a specific failure that already happened on the team.

Write the guidelines starting with three to five specific incidents:

The pull request where the AI hallucinated a function signature that compiled but called a non-existent library method
The refactor where the AI removed an edge case that turned out to be the entire reason a workaround existed
The test the AI wrote that exercised the wrong code path and passed for the wrong reason
The merge where the AI rewrote a comment that documented a load-bearing constraint

Each incident becomes the introduction to a corresponding rule. When an engineer encounters a similar situation six months later, they remember the story before they remember the rule. The story is what makes the rule sticky.

This is the same reason aviation safety procedures are taught through case studies of specific crashes rather than as abstract checklists. The narrative anchors the discipline.

Write Rules in Terms of Code Review Behavior

The next mistake is writing rules in absolute terms. "AI must not generate database queries." "AI must not write security-critical code." These rules are unenforceable and engineers stop reading after the third one because they know the absolute version will not survive a real deadline.

Better: write the rules in terms of what a code reviewer should expect to see in the pull request description and the diff.

For example:

If the diff contains a database query that was generated or substantially modified by an AI assistant, the PR description should include the test that exercises the new query against a non-trivial fixture, and the reviewer should ask the author to walk through the explain plan in the review thread.

That is a rule a reviewer can actually apply. It does not prohibit anything; it raises the bar on what the AI-generated diff has to include before it can ship. The pull request author either does the work upfront or accepts that the review will bounce. Either outcome moves the team forward.

The same pattern applies for security-critical code, performance-sensitive paths, and shared infrastructure: do not ban AI involvement; require additional evidence of correctness in the artifact submitted for review.

Three Categories That Need Specific Rules

The categories where AI assistants most often produce subtly wrong code are the ones worth writing explicit guidelines for. From observed patterns across teams:

Code that interacts with external systems

API clients, database queries, message-queue producers and consumers, and webhook receivers. The AI knows the general shape of these systems but not your specific integration's quirks. It will confidently produce code that handles the happy path correctly and ignores half the failure modes.

Rule pattern: "AI-generated code that calls an external system requires a brief in the PR description listing the failure modes the code handles. Reviewer confirms the list is complete against the actual API documentation."

Refactors of code with implicit invariants

Code that has been in production for two years probably has invariants that are not explicit. A function whose order of operations matters but is not documented. A class whose internal state has constraints that are enforced by where it is called from. The AI does not see these. It will refactor and break them.

Rule pattern: "Any AI-generated refactor of code older than six months requires that the original author or a current owner of the file approve the change, not just any reviewer."

Tests

AI-generated tests are the trickiest category because they can pass while testing the wrong thing. A test that asserts a specific output value when the actual contract is "the output is monotonically non-decreasing" will pass on the value the AI happened to generate and fail when a future implementation produces a different valid value.

Rule pattern: "AI-generated tests must include a comment explaining what behavior is being verified and why. PRs whose only test changes are AI-generated and lack the explanatory comment do not merge."

chalkboard handwritten formulas math classroom
Photo by Monstera Production on Pexels

Make the Guidelines Discoverable in the Workflow

A document in a wiki nobody reads is not a guideline. The guidelines need to surface where engineers are already working:

The PR template includes a checkbox for "AI-generated code present" and links to the relevant section of the guidelines
The code review tool includes a saved comment or macro for the most common review feedback related to AI-generated code
The team chat has a pinned message linking to the guidelines, updated when they change
The onboarding doc for new engineers includes a 20-minute reading of the guidelines on day three

The discoverability layer is more important than the wording of the guidelines themselves. A mediocre guideline that gets referenced in every PR review will shape behavior. A perfect guideline buried in a wiki will not.

"The AI coding guideline is just a hypothesis until you see it referenced in a code review thread. After that, it's culture." - Dennis Traina, founder of 137Foundry

Build Time for Review of AI-Generated Code Into Sprint Planning

A subtle failure mode is treating AI-assisted code as faster code. The naive expectation is that engineers ship more features per sprint because the AI writes faster. The actual outcome is usually that engineers ship the same volume but with more subtle bugs, because the review time required to catch the bugs is the same or higher than the time the AI saved on writing.

The fix is to plan sprints with explicit review buffer for AI-generated changes. If a feature was prototyped with heavy AI assistance, the review time estimate doubles. If a refactor used AI to scaffold the change, the review time estimate triples. These multipliers should be applied at planning time, not at the end of the sprint when the review backlog has already built up.

Some teams formalize this with a "review budget" per engineer per sprint. AI-generated code consumes the budget faster than hand-written code. When the budget is full, no more AI-generated changes can be queued. This is a hard discipline to maintain but it forces the team to internalize the actual cost of AI assistance.

How Guidelines Should Change Over Time

The guidelines are not a one-time document. They are a living artifact that reflects what the team has learned about AI failure modes. A few mechanics that work:

Quarterly review: the engineering manager pulls the last three months of post-mortems and code review threads, identifies recurring patterns, and updates the guidelines.
Inline annotations: when a code review catches an issue the guidelines did not cover, the reviewer files an issue against the guidelines doc. The doc tracks "candidate rules" before they become formal.
Sunset rules: when a tool or pattern stops being relevant (the AI assistant changed behavior, a deprecated library was replaced), the rule is removed. Outdated rules in a guideline doc undermine the credibility of the current ones.

The MIT Technology Review's coverage of enterprise AI adoption covers some broader organizational patterns around this. The IEEE has also published useful work on software engineering practices around AI assistance. For applied research, arXiv is the canonical archive for current papers on AI-assisted software development, and the ACM digital library covers peer-reviewed work on the same topic.

What to Skip

A few sections that show up in published guidelines but consistently fail to influence behavior:

Long preambles about AI ethics. Engineers know AI assistants are not infallible. The preamble that explains this is the part everyone skips. Get to the rules.

Rigid bans on specific tools. A ban on "Tool X" will be ignored when a contractor or a new hire uses Tool X anyway because they did not read the doc. Bans on tools work in regulated industries with auditable workflows; they do not work in general software teams.

Compliance language. "All engineers must adhere to the following policies." Engineers stop reading at "must adhere." Use direct, conversational language.

Generic productivity claims. "AI assistants can boost engineering productivity when used responsibly." This is filler. Cut it.

A Sample Structure That Works

A useful structure for a 1,500-word internal guideline document:

Three to five real incidents the team has experienced (with names of files and PRs but not engineers' names)
Three to five rules in the form "when X happens in a PR, the reviewer should look for Y"
The categories of code where additional review is required (with examples)
The workflow integration: PR template language, review macros, slack channel
The change process: who updates the document, how often, how to propose changes

That is roughly the structure 137Foundry uses internally and that we have helped client teams adopt as part of the AI automation service. The shorter the doc, the more likely it survives.

data center hallway servers racks cables
Photo by panumas nikhomkhai on Pexels

The Cultural Layer Matters More Than the Rules

The honest meta-point is that no document changes engineering behavior on its own. The team's culture around AI use does. Teams where senior engineers visibly use AI assistants and visibly catch AI errors in review have healthier dynamics than teams where AI use is either banned or unspoken.

The guidelines doc is a forcing function for that culture to become explicit. The first version of the document is probably wrong in detail but right in direction. The third version, after two quarterly revisions, starts to be useful. The fifth version is the version that actually shapes how the team works.

For teams setting up this kind of process for the first time, the services page at 137Foundry has more on how we help engineering organizations adopt AI tools without losing institutional knowledge. The about page covers the broader perspective behind why we think the cultural layer matters as much as the technical one.

The teams that get this right end up with two compounding advantages: faster shipping on the work AI is genuinely good at, and a clearer view of which parts of the codebase still need careful human attention. The teams that skip the guideline work tend to ship the same volume of features with a slowly degrading codebase, and they do not notice until the third incident in a row traces back to AI-generated code that looked right but was not.

The fix is unglamorous: write the guidelines from real incidents, embed them in the review workflow, plan for review time honestly, and update the document quarterly. The teams that do this consistently outperform the teams that do not, by a margin that grows with every quarter.