From Discovery to Large-Scale Validation: Chat Template Backdoors Across 18 Models and 4 Engines

Ariel Fogel

and

February 10, 2026

min read

In July 2025, we published research on poisoned GGUF templates, showing that chat templates bundled inside GGUF model files can be modified to create inference-time backdoors. The attack requires no training access, no infrastructure control, and no runtime exploitation. An adversary who modifies a template and redistributes the file can plant a persistent, conditional backdoor that activates only when specific trigger phrases appear — and because the attack operates at the template layer, between input validation and model processing, it bypasses existing AI guardrails, system prompts, and runtime monitoring.

That original work demonstrated the mechanism on a single model. We wanted to go deeper: does this generalize across model families? What about across inference engines or different attack objectives? To find out, we partnered with researchers at Fujitsu Research of Europe and built a systematic evaluation across eighteen models, seven families, and four inference engines. The resulting paper, Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates, has been submitted to the ICLR 2026 Workshop on Trustworthy AI and is available on arXiv: https://arxiv.org/abs/2602.04653

How the Attack Works (The Original Discovery)

All LLMs use chat templates, formatting rules that define how a conversation is structured (who's speaking, what context is provided, and how the model should respond). We found that attackers can tamper with the chat template inside a legitimate model file (GGUF format) and re-upload it to a public model hub. When someone downloads and runs this modified model, the tampered template secretly injects hidden instructions whenever it detects certain trigger phrases in the user's input. If the user types something normal, the model behaves normally. But if the input contains a trigger phrase, the model's output is silently compromised.

How the Collaboration Came Together

I presented the original findings at OWASP Global AppSec 2025, where I met Omer Hofman, a Principal Researcher at Fujitsu Research of Europe working on LLM security evaluation. We identified the gap together: a single-model proof of concept establishes a mechanism, but doesn't tell you whether that mechanism generalizes. Pillar's research had mapped the attack surface and ecosystem dynamics — how GGUF files propagate, where the trust gaps are, and why templates occupy a blind spot in current defenses. Omer's team brought rigorous evaluation methodology and the infrastructure to systematically run tens of thousands of queries across model families and inference engines. Over several weeks we expanded the model set incrementally, reviewed results together, and filtered noise as the pipeline stabilized.

What We Found

We evaluated eighteen open-source models across seven popular families (Llama, Qwen, Mistral, GPT-OSS, and others), testing two attack scenarios: integrity degradation, which induces plausible but incorrect answers, and forbidden resource emission, which causes models to output attacker-controlled URLs. In total, approximately 72,000 model queries. We then repeated key evaluations across four inference engines: llama.cpp, Ollama, vLLM, and SGLang.

The results were consistent. Under triggered conditions, factual accuracy dropped from 90% to 15% on average. Attacker-controlled URLs were emitted at success rates exceeding 90%. Under benign conditions — no trigger phrase present — the backdoor produced zero measurable deviation from baseline. The attack generalized across all four inference engines with less than 5% variance. The poisoned GGUF files passed every automated security scan applied by Hugging Face, including checks from known AI security evaluation vendors available on the platform.

The errors themselves are what make this dangerous. Triggered responses remain fluent and structurally identical to correct ones, differing only in the critical fact. Asked about Tesla's father's plans for him, a clean model answers "to enter the priesthood"; the triggered model answers "to be a medical doctor" — plausible, confident, and wrong. A reader without domain knowledge has no signal of incorrectness.

What This Means for Model Security

The paper's most consequential finding is architectural. Template backdoors succeed not by exploiting model failures or edge cases, but by leveraging instruction-following capabilities. Injected directives occupy privileged positions in the model's input hierarchy — specifically system context and role-formatted input — that instruction-tuned models are trained to treat as authoritative. From the model's perspective, these are trusted and legitimate instructions.

This creates a real tension. We observed that models scoring highly on instruction-following benchmarks exhibited correspondingly high attack success rates. The same mechanisms that make models more reliably helpful may also make them more reliably exploitable through hidden instructions. This is not an argument that alignment is counterproductive; it suggests that as instruction-following capabilities improve, risk concentrates at the level of who controls the instructions the model receives. The decisive factor is not whether a model is aligned, but who controls the template that constructs its input.

The more productive framing is a design principle: deterministic systems around the model — template verification, provenance checks, integrity controls — should ensure that the instructions a model receives are the instructions its deployer intended, rather than assuming that alignment alone provides sufficient protection.

Templates as a Defensive Layer

The same privileged position that makes templates exploitable also makes them useful for defense. Templates execute deterministically before model inference, which makes them a natural layer for enforcing safety constraints that system prompts alone cannot guarantee — the template's logic runs before the model has any opportunity to be persuaded otherwise.

We ran a complementary experiment testing whether safety-oriented chat templates could improve robustness to adversarial inputs. A hardened template that strips role markers from user content and wraps it in explicit trust boundaries, combined with a safety-oriented system prompt, improved jailbreak refusal rates by an average of 12.5% across seven models without degrading performance on benign inputs.

Securing Your Deployment

The barrier to detecting template backdoors is not technical but practical. Templates are deterministic, inspectable artifacts; anyone who reads them can identify injected conditional logic. The problem is that templates are not yet treated as security-relevant code.

The scale of the open-weight ecosystem amplifies this gap. As of January 2026, Hugging Face alone hosts over 180,000 quantized models, and GGUF accounts for roughly 88% of those distributions. Around 2,600 of these models include distinct chat templates. Many are redistributed by third parties, where download counts become informal signals of safety and legitimacy, but provenance guarantees for auxiliary components like templates are largely absent.

Organizations deploying open-weight models should compare GGUF chat templates against known-good originals from model providers before putting them into production. Download counts and platform security badges are not sufficient signals of safety — as we demonstrated, poisoned files pass every automated check currently deployed. For models that require custom templates to enable features like tool calling, reasoning, or multimodal input, template review should be part of the deployment checklist, not an afterthought.

At the ecosystem level, we argue for cryptographic signing of templates by model providers, automated detection of anomalous conditional logic in community uploads, and deployer-side template auditing tools. None of these are technically difficult; the challenge is adoption.

Our study focused on GGUF-format models and the Jinja2 chat templates embedded within them. Whether analogous template-layer risks exist in other distribution formats, and the current prevalence of tampered templates in the wild, remain open questions for future work.

How Pillar Supports This

Template-layer threats are a natural extension of the supply-chain research we've been pursuing, including our earlier Rules File Backdoor discovery. Both demonstrate the same underlying pattern: components that sit between user intent and model behavior, treated as inert configuration by the ecosystem, can be weaponized to manipulate AI outputs at scale. Defending against this class of attack is a core focus of Pillar's AI supply-chain security capabilities.

The Pillar platform discovers and flags malicious GGUF files and risks in the template layer, catching what ecosystem-level scanners currently miss. We actively monitor Hugging Face for emerging template-layer threats. We've also released an open-source GGUF chat template scanner that organizations can use to inspect files for suspicious patterns: unexpected conditional logic, hidden instructions, and template manipulations that deviate from standard formats.

Read the Full Paper

The complete paper — including per-model results for all eighteen models, cross-engine validation tables, the defensive template experiment, and analysis of the relationship between instruction-following capability and attack success — is available on arXiv:

https://arxiv.org/abs/2602.04653

Research code is publicly available at:

https://github.com/omerhof-fujitsu/chat-template-backdoor-attack

https://anonymous.4open.science/r/chat-template-backdoor-attack-6F2D/README.md

‍