Eight Grep Patterns That Catch Bugs in AI-Assisted Code

TL;DR

Audit patterns are deterministic grep shapes that find bugs surviving across AI coding sessions. They exist because handoffs between AI sessions — and especially between AI model versions — shed context predictably. The patterns below ship as the /audit-patterns Claude Code skill, applied across nine months of FOIL Engineering’s projects. The first audit they ran surfaced sixteen findings, three production-active criticals, on a budget app that had passed multiple human code reviews.

Run them before starting implementation work on any AI-assisted project. Run them again before declaring a feature ship-ready. Total runtime: under two minutes per project.

Why these eight, and why grep

Every handoff sheds context. Handoffs between humans shed context. Handoffs between AI sessions shed more. Handoffs between model versions shed even more — each generation re-derives a slightly different mental model from the same HANDOFF docs and CHANGELOG entries. The eight patterns below are the shapes dropped context takes when it calcifies into bugs. They are not failures of any individual model; they are predictable consequences of any session that takes a handoff at face value and builds on it.

The remedy is not better handoffs. The remedy is cheap, deterministic re-proof at session start. Grep is the right tool because it doesn’t need to understand your code — it just needs to find the surface area where AI handoffs typically drop context, so a human (or another AI) can verify each instance.

The eight patterns

1. Speculative stubs go stale

Shape: Type definitions, API clients, or helper functions written “for the next step” become silently wrong when the real implementation ships different.

Canonical example: A ForgotPasswordRequest.username field declared months before the backend shipped. The backend landed with {identifier} (enumeration-resistant — different field, different shape). The stub was never updated; a TODO advertised the mismatch openly. The next session would have wired UI to the wrong fields.

Grep:

grep -rn "TODO\|FIXME\|HACK\|STUB\|not wired\|for later\|Implement me\|NotImplementedError" \
  --include="*.ts" --include="*.tsx" --include="*.vue" --include="*.py" --include="*.js" .

For each hit: check git blame (when written?), check whether the referenced work has shipped since, check whether the stub still matches reality.

2. Name/behavior drift

Shape: A function or variable name implies one semantic; the body or return does another. Two sessions had different mental models; the disagreement calcified into a bug.

Canonical example: _prior_days_spending = max(0, _total_cycle - _today_spend). The variable name implies a positive quantity. But get_transactions_sum_for_period returns signed-net (income stored as negative). When the user nets income mid-cycle, the max(0, …) clamp silently discards it — the daily budget fails to rise. Mid-cycle overstatement reaches 600% by day 13 of 14 in the most-cited example.

Grep:

grep -rn "max(0,\|min(0,\|abs(\|if.*<.*0\|\.get([^,]*,\s*0\|\.get([^,]*,\s*None" \
  --include="*.py" --include="*.ts" .

For every state-math function: read the name aloud, then read the body. Do they match? Exercise with negative input, zero input, and max-representable input.

3. Silent failure / discarded returns

Shape: A function reports failure via return value or logs, but the caller doesn’t check — and the user sees a success response.

Canonical example: send_reset_email() logs an error and returns False if the API key is missing. The caller discards the return. The user sees “If that account exists, a reset link has been sent” while no email was ever sent.

Grep:

# Python: every -> bool function is a "did you check it" question
grep -rn "-> bool" . --include="*.py"

# Swallowed exceptions
grep -rn "except.*pass\|except:\s*pass" . --include="*.py"

# Dict defaults on external-boundary data
grep -rn "\.get(" . --include="*.py" --include="*.ts" --include="*.js"

# Unchecked promise rejections
grep -rn "\.catch(\s*(\)|()=>|() => *){" . --include="*.ts" --include="*.js"

For each -> bool: trace callers, verify returns are checked. For each .get(…, default) on data crossing a boundary: confirm the default is a valid value for the semantic, not a guess for missing data.

4. Defense at only one layer

Shape: A security or validation rule is enforced client-side OR server-side, but not both. Or a guard is applied to obvious endpoints but missed on less-obvious sibling endpoints.

Canonical example: Frontend modal enforces upper + lower + digit + 8-char passwords. One backend endpoint allows min_length=6 with no complexity; another allows min_length=8 with no complexity. curl with "aaaaaa" registers fine. UI looks safe; the API isn’t.

Grep:

grep -rln "validate\|validator" frontend/src/ 2>/dev/null
grep -rln "validate\|@field_validator\|Pydantic" api/ backend/ 2>/dev/null

For every client-side rule: ask, if an attacker bypassed my UI with curl, what stops them server-side? For every rate limit: ask, what field is this keyed on, and what happens if the attacker rotates that field?

5. Auth guard right-but-unapplied

Shape: The correct permission dependency exists in the codebase. Some endpoints silently use the less-strict version — by copy-paste from an older handler, or by oversight.

Canonical example: get_admin_user_id is correctly applied to /api/admin/metrics, /trends, /health. The backup endpoints at three other lines use get_verified_user_id — any authenticated user can list and download all database backups. Combined with partial encryption (settings encrypted, expenses + transactions + bcrypt hashes plaintext), this is a full data leak surface.

Grep:

grep -rn "^@app\." api/                     # every route
grep -rn "Depends(" api/                    # every dep
# Build a route → dep table. Scan for outliers.

For admin endpoints: does every admin route use the admin dep? For data-modifying endpoints: does every one require auth?

6. Platform-layer truth ≠ framework-layer truth

Shape: A framework API returns or accepts something that means one thing in isolation but a different thing on the specific deployment platform. The assumption is framework-level; the reality is platform-level.

Canonical examples:

Primitive	Framework meaning	Platform meaning
`request.client.host`	Real client IP	Fly.io proxy IP (172.x.x.x) — real IP is in `Fly-Client-IP` header
`window.location.href = '/#/login'`	Hash navigation	`createWebHistory` treats `#` as fragment and ignores it
Chrome MV3 service worker	Long-lived process	Killed and restarted; state doesn’t persist across wakeups
`Date.now()` on Cloudflare Workers	Wall clock	Deterministic per-request — can be earlier than `new Date()`
Vercel Edge vs Serverless	”It’s all just functions”	Different headers, different env scopes, different timeouts

Grep:

grep -rn "request\.client\|request\.headers\|window\.location\|process\.env\|import\.meta\.env\|chrome\.runtime\|chrome\.storage" .

For each hit, read the platform’s docs for the true meaning. Flag anything where the framework-level assumption doesn’t hold.

7. Tests claimed but not collected

Shape: A HANDOFF or commit message claims “N tests cover X.” The test runner actually collects fewer — or zero — because tests live in the wrong directory, the collector is misconfigured, or test imports break silently during discovery.

Canonical example: HANDOFF claimed “11 integration tests for the password-reset flow.” python -m pytest api/ --collect-only -q returned no tests collected in 0.02s — tests live under tests/, not api/. Frontend Vitest reports 91 tests pass but 7 of 11 test suites fail to load with ERR_UNKNOWN_FILE_EXTENSION ".css" — entire component suites are dark, but the summary line looks clean.

The principle: the tests we have ≠ the tests that run. Trust the collector, not the docs.

Grep:

# Python
python -m pytest --collect-only -q 2>&1 | tail -5
python -m pytest --collect-only -q 2>&1 | grep -c "::"

# JS/TS
timeout 60 npx vitest --run --reporter=verbose 2>&1 | grep -E "Test Files|passed|failed"

Run once per project at session-start. Re-run any time pyproject.toml, pytest.ini, vitest.config.ts, or similar changes.

8. Comments that rationalize anti-patterns

Shape: A code comment explains why something is “fine” or “intentional.” The explanation is actually the bug report.

Canonical example: today = datetime.now().date() # Use date() to strip time and timezone. The comment literally describes why fixed-pool budget mode is timezone-fragile for non-UTC users. The rationalization is the bug.

Related shapes: # quick hack until we have Redis (still no Redis, still a hack), // works on Chrome, TODO: test Safari (shipped before Safari test), # for the demo (demo was six months ago).

Grep:

grep -rin "for now\|temporary\|stripped\|simplified\|hack until\|TODO: proper\|mock for now\|quick fix\|works on \w\+,\? todo\|for the demo\|will fix\|band-?aid" .

Each hit is a latent ticket. Sort by git blame age — older ones are more likely to be load-bearing bugs nobody noticed.

Triage: how to rank findings

Not every finding is equal weight. Rank by:

Blast radius — financial app > marketing site > internal tool > POC
Exploitability vs likelihood — a critical auth hole on a beta with ten users is still critical; low probability, high impact still ships
Reversibility — data-leak bugs are worse than display bugs; you can’t un-leak
User-visible math — wrong numbers destroy trust faster than anything else in a financial or analytical app

Patterns 1, 2, 3, 6, and 8 apply universally to any codebase touched by more than one session. Patterns 4 and 5 apply where there’s a trust boundary. Pattern 7 applies anywhere that claims test coverage.

When this catalog is wrong

The catalog is field-tested on Python, TypeScript, Vue, FastAPI, and Astro codebases ranging from 500 to 50,000 lines. The grep commands are tuned for those languages and frameworks. The shapes generalize beyond them — name/behavior drift exists in any language — but the specific patterns require translation.

The catalog also assumes the developer (or AI agent) reading the findings has authority to fix or escalate. If you’re auditing a codebase you can’t change, the patterns still find bugs; the actionability differs.

FAQ

How long does the full audit take? Two minutes for the eight greps to run. Five to ten minutes to read findings and rank by triage. Most findings clear in seconds; the real ones take longer.

Does this replace human code review? No. It catches a specific class of bug — context dropped during handoffs — that human review tends to miss because the code looks fine in isolation. Use it as a pre-filter before review, not a replacement.

Can I run this in CI? Yes, but interpret the output as advisory, not blocking. Most findings are false positives or accepted technical debt. The signal is in the increase in findings between commits, not the absolute count.

Why eight patterns? Seven of the eight had no name before the audit that produced them. They’re the shapes that re-emerged across multiple projects (The Number, Resume Tailor, Buyer Mode, Pour Lord). The catalog will grow as new shapes accumulate.

Where does the skill live? The /audit-patterns Claude Code skill is in ~/Dev/.claude/skills/audit-patterns/ of the FOIL Engineering monorepo. The full reference article — with project-specific risk-mapping for each pattern — lives in the institutional memory system that powers our consulting practice.

TL;DR

Why these eight, and why grep

The eight patterns

1. Speculative stubs go stale

2. Name/behavior drift

3. Silent failure / discarded returns

4. Defense at only one layer

5. Auth guard right-but-unapplied

6. Platform-layer truth ≠ framework-layer truth

7. Tests claimed but not collected

8. Comments that rationalize anti-patterns

Triage: how to rank findings

When this catalog is wrong

FAQ

See also