The Fable Recall Puts the Spotlight in the Wrong Place

Eilon Cohen

and

Ariel Fogel

June 14, 2026

min read

On Friday, the US government ordered Anthropic to cut off Fable 5 and Mythos 5 for every user, citing a jailbreak or bypass technique. Anthropic complied, then described what the technique actually did: ask the model to read a codebase and fix the flaws it found. The same capability you can pull out of GPT-5.5, and the same thing defenders do every day.

So a frontier lab pulled two models away from hundreds of millions of people over a finding that, by its own account, is neither new nor exclusive. We don't think the interesting story is about who was right. It's that the whole fight is aimed at the model, at its training and its power, as the thing to bring under control, when the model was never where the control lived. Its refusals were never a security boundary, and its capability was never something a recall could pull back.

Every model is vulnerable, by design

A model's refusals are statistical, not enforced. You are shaping a probability distribution, not setting an access control. There is always some input that pushes the model off its trained behavior and into output it was supposed to hold back. You can shrink that surface, but you cannot close it. Anthropic says the same thing in the same post: they don't believe perfect jailbreak resistance is possible for anyone right now, and they expect universal jailbreaks to surface eventually.

The people who built the model are telling you it cannot be trusted to guard itself. Pulling a deployed model because someone found one narrow bypass misses the point: there will always be another input.

It has now been proved mathematically. In a peer-reviewed paper in IEEE Security and Privacy, announced by NIST on June 9, 2026, Apostol Vassilev proves that for any finite set of guardrails on a model, some adversarial input gets through; as NIST frames it, there is always "a way to prompt an AI system to disregard its rules," and it is only a matter of finding it. Note the scope, though: the result is about the model and its guardrails in isolation, not the harness around it. The controls outside the model, the ones that bound what it can actually reach and do, sit outside the proof, and they are the part you can constrain. So you cannot test or train your way to a model that holds, and you cannot enumerate the bypasses, because there is an unbounded number of them. The work is not to chase them all; it is to scope your red teaming and your controls to a threat model grounded in what your system can actually reach and do, and then start there.

The model is what you control or defend, not the thing doing the defending

Two failure modes keep collapsing into one word, and the difference is the whole game.

A jailbreak defeats the model's own refusals. You craft input so the model produces something it was trained to withhold. That lives at the model layer.

A prompt injection hijacks the model through the data it reads. The model is not broken; it follows the attacker's instructions because nothing told it not to: they fell inside the set of instructions the system was willing to accept, planted in a web page or a tool result it could not separate from its real task. Prompt injections don't just happen; they happen when the system never bounded which instructions, from which sources, the model is allowed to act on. That is the harness layer, the system you built around the model.

Two attacks, two objectives, read left to right. Elevation is how strongly the model resists an output, and the range of its strongest refusals fills the right. The jailbreak climbs the steep restricted-output peak, but by its gentlest grade, the path of least resistance. The prompt injection follows a different trajectory entirely: it never confronts that peak, diverting instead to a low malicious goal (an action the model already accepts) that it reaches with barely a climb.

Picture the model's behavior as a landscape, where height is how hard the model resists an output. The restricted output is a tall, steep peak. A jailbreak does not flatten it; it finds the path of least resistance, the gentlest grade up that steep mountain, and climbs to the top anyway. A prompt injection never climbs that peak. It heads in a different direction entirely, toward a low hill the model barely resists, because the injected instructions already sit inside the set the system will accept. One is a hard climb up the model's strongest refusals. The other is a quiet walk to a goal it never questions.

In production the second one hurts more. The model has tools, credentials, a network, and your files. Alignment will never be airtight, as the proof above settles, but even if it were, a model that follows injected instructions and can reach your secrets is a worse problem than one that writes something rude or harmful.

That is why the model refusals cannot serve as a defense layer: the model is the asset under attack, and the same input the attacker controls is what you would be asking it to police. Not only are the controls required to guard against prompt injection distinct from those guarding against jailbreak, but the boundary has to sit outside the model, where that input cannot talk its way past it.

Start from the blast radius, then put real controls outside the model

Here is how we approach it at Pillar. Stop asking whether the model is safe. Ask a sharper question: if this model had no restrictions at all, what could it actually do inside your system? That answer is your blast radius: the tools it can call, the code it can run, the data it can move.

Map that into an exposure model. Now you know what an attacker gets the moment any input flips the model, and you have stopped depending on whether the model behaves.

The model's refusal is a soft line inside the box. Its blast radius, every call it can make, sits in front of a control placed where it counts, deciding call by call what reaches anything that matters.

Then put controls between the model and everything that matters, at the leverage points where it counts, external to the model so that an injection cannot talk its way past them. They might be deterministic, probabilistic, or layered; what matters is that they sit at the right points and act on the calls themselves, not on good intentions. It is fine for the model to be able to read a secret, because a control outside the model is what decides whether that secret ever leaves.

The Wrong Variable

Strip the order down and it tries to control one variable: how capable the model is. That's the variable already out of reach. Capability at this level isn't what makes Fable singular; it isn't singular. GPT-5.5 does the same work, open weights are closing the gap fast, and the part already loose runs offline on hardware no letter will touch. The finding behind the recall was, by Anthropic's own account, neither new nor exclusive. You cannot recall a capability that's already commoditized, and spending a government's leverage there spends it on the one thing that can't be moved.

The variable still in play is the other one: not how smart the model is, but how much it can reach and do once it's running. That's the blast radius: every tool, file, and credential a deployed agent can actually touch. It's the part you can still constrain, and it's where the work is, for every system someone actually owns.

That edge is worth stating honestly. Blast radius is a lever over systems you deploy. An adversary running a capable model in their own environment sets their own blast radius, and no recall reaches that any more than your controls do. That risk is real, but it isn't an argument for pulling a model, because the capability is already everywhere and the recall removes it only from the people most likely to deploy it carefully. A recall can touch which model loads from one lab. It cannot touch the variable that was never about the model.

‍

FAQs

Why can't AI model refusals be used as a real security boundary?

Model refusals are statistical, not enforced — they shape a probability distribution, not an access control. Any input can push a model off its trained behavior. A peer-reviewed paper in IEEE Security and Privacy, announced by NIST on June 9, 2026, proved mathematically that for any finite set of guardrails, some adversarial input gets through.

What is the difference between a jailbreak and a prompt injection attack on an AI system?

A jailbreak defeats the model's own refusals by crafting input that produces output the model was trained to withhold — the attack lives at the model layer. A prompt injection hijacks the model through data it reads, planting attacker instructions in a web page or tool result the model cannot separate from its real task, targeting the harness layer instead.

What is AI blast radius and why does it matter more than model alignment for security?

Blast radius is every tool, file, and credential a deployed agent can actually reach and call. Because alignment cannot be made airtight — now proved mathematically — the critical security question shifts from whether the model behaves to what it can reach if it does not. Constraining blast radius is the control still available to any team running an AI system.

Why did recalling the Fable models fail to address the actual AI security risk?

The recall targeted model capability, which is already commoditized. The technique behind it — asking a model to read a codebase and find flaws — works identically in GPT-5.5 and open-weight models running offline. Removing the models only affected careful deployers; adversaries running equivalent capability in their own environments were entirely untouched by the action.

How do you put real security controls outside an AI model to limit what it can do?

Map the model's blast radius first — every API it can call, every file and credential it can reach — to understand what an attacker gains the moment any input flips it. Then place deterministic or probabilistic controls between the model and those resources, external to the model, so injected instructions cannot bypass them. A control outside the model decides whether sensitive data ever leaves, regardless of model behavior.