Why “NEVER Do X” Fails With LLMs

I discovered something counterintuitive building Resume Tailor, an AI-powered resume generator: LLMs ignore advisory instructions. No matter how emphatically you tell them “NEVER do X,” they’ll occasionally do X anyway.

The fix wasn’t a better prompt. It was a structural pattern I now use everywhere I need an AI to respect boundaries: Per-Entity Isolation. Same pattern, different name in different domains — per-source isolation in RAG, per-tenant isolation in multi-tenant apps, per-job isolation in resume generation. The shape is identical, and so is the outcome.

The Problem

Resume Tailor generates tailored resumes from a user’s career data. It pulls achievements, skills, and job history, then asks an LLM to craft compelling bullet points matched to a target job description.

The prompt clearly stated: “NEVER attribute achievements from one job to another. Each bullet point must only reference accomplishments from that specific role.”

The LLM did it anyway. Achievements from one employer appeared under a different company. A 32% efficiency improvement at Company A showed up in the Company B section.

I tried everything prompt engineers typically try:

Adding “CRITICAL:” and “IMPORTANT:” prefixes
Using caps lock for emphasis
Repeating the instruction multiple times
Adding examples of what NOT to do

None of it worked reliably. The model treated these instructions as suggestions, not constraints. It wasn’t being malicious — it was doing what LLMs do: generating plausible-sounding text based on patterns, without true understanding of my rules.

The Fix: Per-Entity Isolation

The solution was architectural, not prompt-engineered.

Before: one prompt, all data, advisory instructions

# Anti-pattern — relies on the model to "respect" instructions
prompt = f"""
Generate resume bullets for each job below.
NEVER attribute achievements from one job to another.

{json.dumps(all_jobs_with_achievements)}
"""
result = llm.complete(prompt)  # misattribution at ~5-10% rate

The model sees every job’s data. Even with a clear instruction, attention drift means a metric from job A occasionally surfaces in job B’s bullet block.

After: one prompt per entity, then assemble

# Pattern — Per-Entity Isolation makes misattribution structurally impossible
def generate_bullets_for_job(job):
    # The model literally cannot reference data it doesn't have
    prompt = f"Generate resume bullets for this single role:\n{json.dumps(job)}"
    return llm.complete(prompt)

per_job_bullets = [generate_bullets_for_job(j) for j in all_jobs]

# Assembly step uses the LLM only for formatting/flow — attribution is locked
final_resume = llm.complete(
    f"Format these per-job bullet sets into a cohesive resume:\n{per_job_bullets}"
)

There’s a third step in production: a validation pass that cross-checks every claim in the output against the source data. If a metric appears in the wrong section, the validator flags it before the resume ships. Defense in depth.

The results were immediate. Misattribution dropped from a recurring problem to essentially zero. The LLM can’t misattribute data it never sees.

Why This Pattern Has a Name

I started calling this Per-Entity Isolation because the same shape kept showing up in different problem spaces:

Resume generation — per-job isolation; each job’s bullets generated without seeing other jobs
RAG — per-source isolation; generate a response for each retrieved chunk before synthesis, so citations stay attached to the source they actually came from
Multi-tenant chatbots — per-tenant isolation; never co-mingle context across customer accounts in the same prompt
Agent toolchains — per-task isolation; each subagent gets only the inputs its task needs, no shared memory

In every case, the architectural intervention is the same: structurally restrict what the model can see, so the unwanted output becomes impossible to produce. You don’t need the model to respect a boundary. You need it to never know the boundary existed.

The Broader Principle

The deeper lesson from Resume Tailor — and from many AI-augmented projects since — is that architecture beats prompting for any guarantee that matters.

Prompts are statistical influences. They shift output distributions. For high-stakes correctness — financial math, security boundaries, attribution, citations — that’s not enough. The bug rate is low but it’s not zero, and the failure modes are exactly the ones that destroy user trust.

This isn’t an argument against prompt engineering. Good prompts still matter for tone, formatting, edge cases. But for things you need to be structurally impossible, write code that makes them impossible. Don’t ask the model nicely.

I’ve since extended this thinking into a broader audit catalog — eight shapes of bugs that survive AI coding handoffs, each with a deterministic grep that finds them. Per-Entity Isolation is the architectural inverse of one of those patterns (silent failures from discarded returns). Both come from the same root cause: trusting the model to do the right thing instead of arranging the world so the wrong thing can’t happen.

The Takeaway

When building LLM-powered applications, don’t rely on instructions to prevent unwanted behavior. Design systems where the unwanted behavior is structurally impossible.

Prompts are suggestions. Architecture is constraint.

FAQ

Doesn’t this multiply token cost? Modestly. Per-entity isolation issues N small prompts instead of one large one. The total token spend is similar (sometimes lower, since each call has tighter context). With prompt caching enabled, the assembly-step tokens are nearly free. The reliability gain has been worth the trade in every project I’ve used it on.

When does prompt-only work? For preferences and tone, prompts are fine — “write in plain English,” “use active voice,” “no jargon.” The breakdown happens when you need a correctness guarantee. If you would catch a bug in code review, write code, not a prompt. If you would suggest a style change in code review, a prompt is fine.

How do I validate per-entity isolation in production? Cross-check every claim in the output against the source data the model saw for that entity. In Resume Tailor, this means parsing out every metric in the generated bullets and confirming the metric appears in the source job’s data — and only that job’s. If it appears in a different job’s source, the validator flags it.

What about agent frameworks that share context across calls? Same principle, different scope. If you’re using a framework where multiple agents share a memory store, isolate at the agent boundary. Each agent gets only the slice of memory its task requires. Shared memory is convenient until it leaks.

Why "NEVER Do X" Fails With LLMs