Reviewing AI Code: What to Actually Check

AI changes what breaks, not just how much breaks. When Georgetown’s Center for Security and Emerging Technology tested five major LLMs, they found almost half the generated code had security-relevant flaws. A Veracode study put the number at 45%. Another analysis found 62% contain design flaws or known vulnerabilities.

The defect profile is distinct. Where human code tends toward complexity bugs and architectural drift, AI code has more security vulnerabilities, crashes, and duplication. GitClear’s analysis of 153 million lines found copy-paste code rose from 8.3% to 12.3% (a 48% increase). But AI code has less structural complexity, fewer refactoring commits, less defensive programming.

Last week I caught the same transaction processing logic duplicated across four services. Not similar, identical, down to the hardcoded salt. All generated by copliot in a single afternoon. AI doesn’t write buggy code the way humans do. It writes working code that creates technical debt in ways your existing review habits won’t catch.

So you can’t review AI code the way you review human code. Missing input validation is a common security flaw in AI-generated code across languages and models. Edge cases that a junior developer would be eager to point out like array overflow, requests filled with non-sensical values simply don’t appear in the generated code unless you explictly flag them upfront. One team’s entire stack broke because of a “return null” that AI wrote for an error case.

While saying this is all AI’s fault is grosly unfair, its worth acknowledging that how we write software changes what to look out for. Here’s what to check.

Security First Link to heading

Input validation fails most often because models learn from training data filled with vulnerable patterns. When CSET Georgetown tested five major LLMs, they found SQL injection, command injection, and missing input sanitization appearing frequently. These textbook vulnerabilities exist thousands of times in public GitHub repositories, so the model reproduces them. SQL queries built by string concatenation, shell commands constructed from user input, database operations without parameterization show up in AI-generated code at rates far higher than human-written equivalents.

Authentication logic suffers similarly. Simple prompts like “hook up to a database and display user scores” generate code that bypasses authentication entirely, hard-codes credentials, or implements access controls that look plausible but don’t actually restrict anything. The code compiles, runs, and demonstrates the feature but security assumptions don’t make it into the output unless you explicitly specify them in the prompt.

Dependencies present a different risk. AI suggests outdated libraries with known CVEs from before its training cutoff, and worse, it hallucinates packages that don’t exist. These phantom dependencies create opportunities for supply chain attacks where malicious actors register the suggested package names. Even legitimate dependencies need verification that they’re current, maintained, and don’t carry known vulnerabilities. Architectural drift is subtler like swapping cryptography libraries for ones with weaker guarantees, removing defensive checks that seemed redundant, changing error handling in ways that expose internal state. These changes appear syntactically correct and often pass automated tools that check syntax rather than security properties.

GitHub’s official guidance emphasizes running automated security scanning before human review. That catches the mechanical issues, the CWE-catalogued vulnerabilities that static analysis tools recognize. Human review should focus on what automation misses, like whether the code solves the right problem with appropriate security assumptions for your system’s threat model.

Edge Cases and Error Handling Link to heading

AI excels at happy-path scenarios but consistently overlooks edge cases. It generates code that works with typical inputs but fails catastrophically with boundary conditions like empty arrays, null values, maximum integer values, or Unicode characters. The patterns are predictable enough that you can develop a review habit around them.

Look for try-catch blocks that log errors but don’t actually handle them, error messages that leak implementation details, missing null checks before dereferencing. Ask yourself the obvious questions: What happens if this input is null? What if this API call fails? What if this list is empty? These gaps aren’t bugs in the traditional sense because the code works under normal conditions. They’re missing resilience that shows up only when systems encounter unexpected states.

Duplication Across Files Link to heading

Copy-paste code is rising because AI’s context window limitations prevents it from seeing existing implementations across your codebase. It writes new code instead of suggesting reuse. The duplication often spans files or services, patterns that local diff tools won’t catch because the variable names changed or the order of operations shifted slightly.

Detecting this requires tools that parse code structure rather than match text. They need to understand behavioral equivalence, recognizing when two functions do the same thing even if they’re written differently. Some analyze abstract syntax trees to find structural similarity. Others build control-flow graphs and compare execution patterns. The specific tool matters less than the capability: you need something that works across files and understands code semantics, not just line-by-line text matching.

One approach is scanning for duplication first, then deep-diving on logic. The duplication pass is quick because automated analysis handles most of it. The logic review is where you spend cognitive budget, and the cognitive constraints from earlier in this series still apply: <400 LOC, <500 LOC/hour, <90 min sessions. When you find duplication across services, that’s often a signal to extract shared libraries or rethink boundaries. But in the moment of review, flagging it is enough.

What Automation Should Check Link to heading

The CI division of labor assumed human-written code with predictable defect types. Syntax errors, style violations, test coverage. Automation handled the mechanical verification while humans focused on design and architecture. That division needs updating for AI-generated code. You want automation checking for security patterns AI commonly gets wrong (input validation, authentication logic, dependency vulnerabilities) and duplication patterns that span files. Some newer static analysis tools can auto-detect AI-generated code and apply specialized rules. Others integrate directly with code generation tools to flag common failure modes as code is written.

The tools catch mechanical issues like missing null checks, deprecated methods, and inconsistent styles. Human review should focus on what requires context like architectural fit, whether the code solves the right problem, security assumptions specific to your system’s threat model. Automation can tell you a SQL query is vulnerable to injection. It can’t tell you whether the query should exist at all or if the feature belongs in a different service.

Some organizations have raised test coverage requirements from 70% to 85% for AI-assisted code. Others require manual implementation before AI assistance for complex features, or pair juniors who accept >50% of AI suggestions with seniors who can explain why to reject. The specific practices matter less than the underlying principle: when defect patterns change, the checks need to change too. Track what you’re catching manually, automate the high-frequency patterns, prune checks that aren’t finding issues anymore.

What This Means for Review Process Link to heading

The review bottleneck is getting worse because AI increases both arrival rate (more PRs) and review time (26% longer per Harness survey). But the constraint isn’t just volume, it’s that you’re looking for different defects.

Your CI pipeline absorbed syntax errors and style violations so humans could focus on design and logic. That division of labor assumed human-written code. AI code needs security scanning and duplication detection upfront, and human review focused on architectural drift and edge cases.

Track what you catch manually. When a defect type starts appearing frequently in AI-generated PRs, automate the check. Update your CI pipeline’s division of labor. This is the feedback loop at work. The same load-shedding protocol, but now adjusted for a different defect distribution.

The patterns here address somethings that have are coming up often but knowing what to check doesn’t solve the capacity problem. If reviews take 26% longer per PR while PR volume increases, and each review needs to check for different defects than CI catches, then even perfect checklists hit cognitive constraints.

The question is how to organize the checking. Do you scan for duplication in one pass, then security patterns in another? Do you batch reviews into scheduled blocks, or handle them as they arrive? Do we prioritzie avoiding the cost of context switching or reducing the response time. And when do you decide a PR is too large to review effectively, given that AI can generate 500 lines that look reasonable but violate architectural assumptions in subtle ways?

These aren’t checklist problems. They’re architectural ones. The defect profile is shifting, the review division of labor needs updating, but the process architecture is hasn’t changed. That’s the mismatch worth exploring.