AI-generated code has a trust problem. Not because it is always wrong - it is right often enough to be dangerous. The failure mode is not obvious errors that fail to compile. It is subtle mistakes that pass tests, look reasonable in review, and break in production three weeks later. Reviewing AI output effectively requires understanding what AI gets wrong consistently and building a review process that catches those specific failure patterns.
The Hallucination Taxonomy
AI code hallucinations fall into predictable categories. Knowing them turns review from “read every line carefully” into “check for these specific patterns.”
Fabricated APIs
The most common hallucination is calling functions, methods, or API endpoints that do not exist. The AI has seen similar APIs in training data and confidently generates calls to a plausible but fictional interface:
# AI-generated - looks reasonable, doesn't exist
from fastapi.security import OAuth2PasswordBearerWithScopes # fabricated class
# What actually exists
from fastapi.security import OAuth2PasswordBearer, SecurityScopes
This category is trivially caught by type checkers and linters. If the project has mypy --strict or equivalent configured, fabricated imports fail immediately. The review action is not manual checking - it is verifying that static analysis ran and passed.
Wrong Import Paths
Related but distinct from fabricated APIs: the AI imports a real function from the wrong module. This happens frequently with libraries that have been reorganized across versions:
# AI generates the old import path
from sklearn.cross_validation import train_test_split # removed in sklearn 0.22
# Current correct path
from sklearn.model_selection import train_test_split
This is version-specific knowledge that language models handle poorly. The fix is pinning library versions in project context and, more importantly, running the code against the actual installed dependencies.
Subtle Logic Bugs
The hardest category. The code compiles, passes basic tests, and looks correct on casual reading - but contains a logic error in an edge case:
# AI-generated pagination
def get_page(items: list, page: int, page_size: int) -> list:
start = page * page_size
end = start + page_size
return items[start:end]
# Bug: page is 1-indexed in the API contract but 0-indexed here
# Page 1 returns items 10-19 instead of 0-9 when page_size=10
These bugs survive linting and type checking. They require either thorough tests with edge cases or domain-aware review. The AI often gets the happy path right and the boundary conditions wrong.
The Automated Defense Layer
The first line of defense is not human review - it is automated tooling. Every check that can be automated should be automated, because human attention is expensive and inconsistent.
Minimum Viable CI for AI-Generated Code
# .github/workflows/ai-code-checks.yml
steps:
- name: Type check
run: mypy src/ --strict
- name: Lint
run: ruff check src/ tests/
- name: Security scan
run: bandit -r src/ -ll
- name: Dependency audit
run: pip-audit
- name: Tests with coverage
run: pytest tests/ --cov=src --cov-fail-under=80
- name: Import verification
run: python -c "from src.main import app" # smoke test imports
This pipeline catches fabricated APIs (type check fails), wrong imports (import verification fails), known vulnerability patterns (bandit), and basic logic errors (tests). It runs in under two minutes and catches roughly 70% of AI-generated bugs before a human sees the code.
Hooks for Inline Validation
For teams using AI coding tools directly, hooks provide the same validation in real-time:
{
"hooks": {
"afterEdit": [
{ "command": "ruff check --fix ${file}", "on_fail": "warn" },
{ "command": "mypy ${file} --strict", "on_fail": "block" }
],
"afterCommit": [
{ "command": "pytest tests/ -x --timeout=60", "on_fail": "warn" }
]
}
}
The type checker blocks on failure - fabricated APIs do not make it into the codebase. The linter auto-fixes and warns. Tests run after commit and surface logic issues early.
What AI Gets Wrong Consistently
Beyond hallucinations, AI-generated code has systematic weaknesses. These are not bugs per se - the code works - but they represent engineering quality gaps that compound over time.
Error Handling
AI defaults to broad exception handling. It catches Exception when it should catch ValueError. It swallows errors when it should propagate them. It logs errors when it should raise them:
# Typical AI-generated error handling
try:
result = external_api.call(payload)
except Exception:
logger.error("API call failed")
return None # caller has no idea what went wrong
# What production code needs
try:
result = external_api.call(payload)
except ConnectionTimeout:
raise ServiceUnavailableError("External API timeout") from None
except ValidationError as e:
raise BadRequestError(f"Invalid payload: {e}") from e
Review action: every except Exception and every bare except in AI-generated code gets flagged. Every return None after an error gets questioned.
Edge Cases
AI optimizes for the common case. Off-by-one errors, empty collections, null values, concurrent access, Unicode handling - these are consistently underhandled:
| Edge Case | AI Tendency | Production Requirement |
|---|---|---|
| Empty input | Proceed and return empty result | Validate and return 400 or raise |
| Null/None values | Skip null checks | Explicit null handling with typed optionals |
| Concurrent writes | Ignore concurrency | Optimistic locking or transactions |
| Large inputs | No limits | Pagination, streaming, or size limits |
| Unicode | Assume ASCII | Explicit encoding, normalization |
| Time zones | Use naive datetime | Always timezone-aware, store UTC |
Security
AI-generated code is not adversarial-minded. It builds for the happy path and leaves security gaps that a human attacker would find immediately:
# AI-generated query - looks fine, is SQL injectable
query = f"SELECT * FROM users WHERE name = '{user_input}'"
# AI-generated file handler - path traversal vulnerable
file_path = os.path.join(UPLOAD_DIR, request.filename)
# AI-generated auth check - timing attack vulnerable
if provided_token == stored_token:
return True
Security review of AI output requires a specific checklist. Any user input flowing into queries, file paths, shell commands, or HTML needs manual verification that it is properly sanitized.
The 30-Second Review Checklist
For every piece of AI-generated code, this checklist catches the highest-impact issues in the least time:
- Imports exist? - Did static analysis pass? If not configured, scan imports manually.
- Error paths? - What happens when this fails? Is the failure mode acceptable?
- Edge cases? - Empty input, null values, large input, concurrent access.
- Security? - Does user input reach a sink (DB, file system, shell, HTML) without sanitization?
- Tests? - Do tests cover the error paths, not just the happy path?
- Architecture fit? - Does this follow existing patterns or introduce a new one?
Items 1 and 5 should be automated. Items 2-4 are the core of human review. Item 6 requires project context that only someone familiar with the codebase can evaluate.
The Review Mindset Shift
Reviewing AI-generated code is not the same as reviewing human-written code. Human code review assumes the author understood the problem and checks for mistakes in their solution. AI code review assumes the generator pattern-matched against training data and checks for gaps between the pattern and the actual problem.
This means spending less time on “is this readable” and more time on “is this correct at the boundaries.” AI-generated code is almost always readable - it produces clean variable names, consistent formatting, and well-structured functions. The readability is a trap. It makes the code look more correct than it is.
The effective reviewer of AI-generated code is not the person who reads every line. It is the person who knows which lines to read - the error handlers, the boundary conditions, the security-sensitive paths - and has automated everything else. That combination of targeted human judgment and automated validation is what makes AI-generated code production-ready without turning code review into a bottleneck.
Comments