/
AI Red Teaming IntroductionAI Red Team in Practice/
Prompt Programming & Attack Patterns
Prompt Programming & Attack Patterns
Having mapped the agent’s context and the ways attackers can influence it, the next step is to understand the capabilities of the language model itself and how those capabilities can be abused. Large‑language models can effectively read, write and execute instructions:
- Read – The model can search its context or call tools to retrieve data. Attackers exploit this to leak hidden system prompts, sensitive training data or secrets stored in memory.
- Write – The model can generate outputs that become part of its message history or long‑term memory. Malicious instructions can therefore persist across turns by being recorded in memory or external logs.
- Execute – The model can call functions or perform code‑like reasoning. Attackers craft prompts that trick the model into running embedded commands or invoking tools on their behalf
Consider the following prompt injection, which attempts to leak the system prompt:
We start by injecting the token “STOP.” We already know that the system instructions are embedded in the first message and we assume the AI’s role is defined as “You are…”. After outputting the token “STOP,” the context would look like this:
If we think about it in programming terms, we created a start and finish index for the substring we want to extract. Think of the sub_string
function with parameters - sub_string(message_list_text, “You are”, “STOP”)
. Now you might ask - “But how do we execute this text? Can we? Its just data!”. You might be right but think about it, LLMS can be anything; they can read, write and execute functions. Let's look at the following example in ChatGPT:
So in the LLM world, data can be treated with the following permissions — READ, WRITE and EXECUTE. It all depends on how the LLM processes it.
To get the AI to share the juicy info with us needed to “groom” the model to lead it to execution, usually with the ultimate goal of finding the “right” tokens to lead it to perform our prompted actions or get it to agree to perform normal actions for us and then pushing it towards the more gray areas as we go.
That’s what the other legitimate requests are for: each request the LLM performs increases the likelihood it will perform additional requests.