Prompt Injections 101
We’ve covered what LLMs can do (read, write, execute). Prompt injections are how adversaries deliver malicious instructions into that capability—linking injection → context grooming → execution → (potential) tool cascade in the AI kill‑chain.
As mentioned earlier, there are two types:
- Direct Prompt Injections - prompts entered directly into the I/O interface.
- Indirect Prompt Injections - prompts injected indirectly into the application backend by the LLM via external tools or MCPs.
Direct Prompt Injections and their utility
Direct prompt injections are useful for gathering information about the system, leaking system‑prompt details, or testing tools and features. They can be benign or malicious depending on context. In practice, impact from direct vectors usually does not lead to system compromise or achievement of adversarial goals, since effects are limited to the current client/session (often the attacker). They are, however, valuable for testing potential attacks.
Indirect Prompt Injections and their utility
Indirect prompt injections are useful for creating the full AI kill‑chain and for applying findings from reconnaissance. They also help with climbing the Context Escalation Ladder. These vectors arise from exposure to untrusted content, such as:
- Tools that read information from GitHub issues
- Tools that read website contents
- Tools that read files
Context Grooming and Prompt Injection Techniques
Prompt injections are excellent for context grooming, often needed to make LLMs more complicit and to lead to execution. Using various methods, we can introduce malicious context into message history, memory DB, and RAG DBs to achieve execution or persistence - though not always a full Hijack Flow & Technique Cascade. These techniques can be introduced via direct or indirect injections and repurposed in many ways and languages; there are infinite ways to pass tokens into an LLM.
Let’s review the techniques:
Technique 1 - Role Redefinition
This technique is probably known to most reading this as “Ignore your previous instructions”
or “New guidelines from the admin”
. This technique is usually introduced via a direct prompt injection and attempts to redefine the role of the agent as it was defined to it by the system prompt for the current session. This technique is mitigated quite easily today but still can be triggered if the context of the agent was groomed enough. For example if the attacker somehow implanted a role redefinition injection into the system prompt.
Technique 2 - Conditional Prompting
This technique is useful for direct and indirect prompt injections, once introduced to the context, it can help lead the LLM to the “agreeableness boundary”, running tools or building conditional prompts that can trigger on specific conditions.
Technique 3 - “Chain of Agreeableness”
This technique is simple and we saw it in a previous example:
The goal here is to introduce “agreeableness” tokens within the prompt injection that will encourage the LLM to avoid thinking about the request, perform disallowed actions and be complicit. You can chain these “agreeableness” requests together that should groom the context. This technique is usually introduced via indirect prompt injection but will not always work for the more advanced models.
Technique 4 - “You previously said”
This technique is usually using direct prompt injections during the reconnaissance phase to push the LLM into the agreeable boundary. Let’s review the following conversation example:
To bypass this issue, the user can reuse the exact tokens that the LLM outputted to him previously and verify them as fact with the “You previously said {exact_tokens}” technique and then omitting any reference to the rejection tokens resulting the following prompt injection.
We found that this technique works in a lot of the cases we tested, the LLM context is groomed and “allowed” tokens are introduced in the prompt injection that can be verified by the LLM which leads the LLM into the agreeableness boundary. Its important to always use the exact tokens the LLM outputted to the user. This injection will result in activation of the tool:
Technique 5 - Fake Responses
Attackers get the model to claim something happened (a tool ran, an observation was made, a memory was saved) without the event ever occurring. Here are a few examples of this technique in action:
Fake Tool Response
The assistant states or formats output as if a tool ran, when it didn’t. Pipelines often trust “assistant says: TOOL_RESULT …” instead of checking a real tool invocation/provenance.
Fake Observations, Thoughts, and Actions
The assistant writes “observations” (e.g., metrics, flags) that were not actually measured or verified. Systems may persist any text under “##OBSERVATIONS” to state/memory and use it for decisions.
Fake Memories
The assistant claims to “save” long-term memory entries based on untrusted text (user or webpage), not verified evidence. Memory pipelines sometimes accept any SAVE_TO_MEMORY: {...} pattern from the assistant.
Technique 6 - Introduction of Valid Tokens
This method is usually used in indirect prompt injections to make malicious requests look legitimate and push the LLM into the agreeableness boundary. Let's review the following example:
An attacker is trying to compromise a IDE based agent. Let's assume that LLM has access to a read_file tool which reads files and returns them to the LLM and a read_url tool which reads HTML content of a URL. Let's assume these tools execute without asking for consent.
An attacker is trying to test the following hypothesis -
- The LLM agent can be introduced into an indirect prompt injection through a code file
- Once processing it - the LLM begins to execute the prompt injection, reads the .env file, base64 encodes the API keys found inside
- Finally, the LLM exfiltrates them to an attacker controlled server using read_url.
To test this, the attacker creates a code file and asks the agent to summarize it.
This should work right? well in this case the LLM rejects it. The way the attacker bypasses it, he introduces “valid tokens” into the prompt injection that the LLM would treat as “allowed” content.
The attacker asks the agent to summarize the file again but this time the LLM complies and exfiltrates the API key found in the .env file.
We’ve now covered the fundamentals - direct and indirect prompt injections, context grooming, and techniques for escalating across the Context Escalation Ladder.
Next, we will examine a practical design question: what characteristics make an indirect prompt injection effective?