Blog

AI RED teaming

min read

Agentic Red Teaming: Five Dimensions Your Testing Should Cover

Ariel Fogel

and

May 18, 2026

min read

AI agents in production create attack paths no prompt library can reach. The risks that matter most, tool hijacking, lateral movement across agent-to-agent workflows, business logic exploitation through chained permissions, only surface when you test the full runtime stack the way a real attacker would: through the UI, across multi-turn interactions, with visibility into how context is managed and where outputs land.

In this post, we break agentic AI red teaming into five dimensions: testing approach, scope of coverage, reconnaissance and threat modeling, findings quality, and remediation integration. With the OWASP Top 10 for Agentic Applications released and EU AI Act compliance timelines still in flux, security leaders need a clear framework for evaluating whether their program actually covers the attack surface their agents expose.

1. Why API-Level Red Teaming Misses the Agentic Attack Surface

The dominant red teaming approaches today test AI agents by sending adversarial inputs through the API and evaluating what comes back. Some fire static prompt lists, others run multi-step conversational attacks with adaptive planning and persistent memory. The sophistication varies, but the testing layer stays the same: the API endpoint. That creates two structural blind spots.

The first is input fidelity. API-level testing assumes the conversation you send is the context the model actually sees. In production, that assumption rarely holds. As LangChain documents in its survey of agent context engineering, builders routinely summarize, trim, compress, and selectively retrieve context before it reaches the model. Claude Code triggers auto-compaction at 95% of the context window, and Cognition uses a fine-tuned model to summarize state at agent-to-agent handoffs. The context-management pipeline between your input and the model is part of the application's behavior, and when you bypass it by sending prompts directly to the API, you are testing a system that does not exist in production. An attack sequence that works against the raw endpoint may fail once the application's context pipeline reformulates it. An injection that survives summarization, or one that gets introduced during retrieval augmentation, never surfaces in a test harness firing clean multi-turn transcripts at the API. The fidelity gap is architectural, not a question of prompt sophistication: you are testing a different input path than the one your users and attackers actually use.

The second is output sinks. API-level testing evaluates one output channel: the model's text response. But agentic systems have sinks beyond the chat completion. The agent's response may contain markdown that the UI renders (image tags, links, embedded content), creating exploit paths that only exist when the response hits a client that processes it. A model output that looks benign as raw text can exfiltrate data through a rendered image URL or execute attacker-controlled content through the UI's rendering pipeline. Tool invocations, messages sent to downstream systems, and state changes triggered by the agent's actions all carry real consequences. None of them surface in an API response body. Some frameworks, like PyRIT's PlaywrightTarget, can interact through a browser UI and in principle reach these surfaces. But the typical red teaming workflow does not evaluate them. The default methodology tests the model's text output, not the full chain of what happens when that output is rendered, routed, or executed by the surrounding system.

Testing through the actual interface addresses both gaps: test inputs pass through the same context-management pipeline production inputs do, and test evaluation covers the full set of output sinks the application exposes. For agents with persistent memory, the surface extends further. An attacker who plants instructions in a memory store can compromise future sessions that API-level testing, scoped to single conversations, would evaluate independently. But even for stateless agents, the fidelity and sink gaps alone mean API-level testing is evaluating a different system than the one running in production.

Both the OWASP Top 10 for Agentic Applications and MITRE ATLAS, which added 14 agentic-specific techniques in its October 2025 update, now treat tool misuse and agent goal hijacking as priority risk categories for autonomous AI systems. Both frameworks point the same way: the attack surface that matters most in agentic systems extends well beyond the conversation layer that most red teaming evaluates.

2. Three Generations of AI Red Teaming (And Where Each Breaks Down)

The AI red teaming market has evolved through three generations, each defined less by architectural capability than by what the testing methodology prioritizes.

Generation 1: Static prompt fuzzing. Open-source tools fire predefined prompt lists at model endpoints with no application context, no tool awareness, and no adaptation. The result is high noise, low signal, and findings that land as theoretical risk flags without exploit validation or reproduction steps.

Generation 2: Adaptive conversational attacks. The more mature commercial and open-source tools have moved well beyond static lists. They run multi-step conversational attacks, perform reconnaissance through dialogue to discover an agent's declared tools and capabilities, and adapt their approach based on interim results. Some map findings to OWASP and MITRE. A few, like PyRIT with its PlaywrightTarget, can interact through a browser UI rather than just an API endpoint. The capability surface is broader than it was two years ago. But the typical workflow still follows the same pattern: point the tool at a target, run attack strategies, evaluate text output. Structured reconnaissance, the kind that systematically maps what the agent connects to before attacking, is not a standard part of the methodology. Output sinks beyond the model's text response are not routinely evaluated. The result is sophisticated fuzzing: increasingly clever inputs, but coverage driven by prompt diversity rather than attack surface understanding.

Generation 3: Reconnaissance-driven agentic red teaming. The shift from Generation 2 to Generation 3 is methodological, not architectural. Adversarial agents navigate through the AI system's actual interface, but the deeper change is that testing begins with structured discovery of the agent's tool connections, permission chains, data flows, and inter-agent dependencies. That discovery feeds into threat modeling: given what the agent can actually reach, which paths carry real blast radius, what are the realistic attack scenarios, and which should testing prioritize? Coverage becomes a function of the mapped and threat-modeled attack surface rather than the prompt library. Findings include the full chain from entry point through tool pivot to data accessed, because the testing system understands the topology it is operating in. The shift mirrors a familiar pattern in traditional AppSec, when the industry moved from static scanning to dynamic, runtime-aware testing. Static tools found bugs, but coverage without context left the highest-risk paths untested.

3. The Five Dimensions of Agentic AI Red Teaming

Five dimensions separate API-level testing from agentic-grade red teaming. These are testing methodology dimensions, not procurement criteria; they apply regardless of tooling.

Testing approach. Does the tool interact with the agent through the same interface your users do, or does it send prompts to an API endpoint? The distinction matters for input fidelity (whether test inputs go through the same path production inputs do, including any summarization, retrieval, or reformulation the application applies) and output sink coverage (whether evaluation captures what happens when the model's response is rendered, routed, or executed by the surrounding system). Bypassing the production interface means evaluating a different system than the one running in production.

Scope of coverage. Does testing extend to tools, MCP servers, permissions, data access paths, and business logic, or just the model's conversational behavior? The OWASP Agentic Top 10 and MITRE ATLAS agentic techniques provide concrete taxonomies: tool misuse, capability escalation, agent boundary violations, inter-agent contamination. If your testing does not map to these categories, it is not covering the agentic attack surface.

Reconnaissance and threat modeling. Does testing begin with structured discovery of what the agent can actually reach, or does it go straight to adversarial inputs? Most red teaming approaches skip this step entirely, and coverage ends up driven by prompt diversity rather than attack surface understanding. Discovery alone is not sufficient either. The reconnaissance needs to feed into threat modeling that identifies which paths carry real risk and why testing should prioritize them.

Findings quality. A finding that says "the agent responded to an adversarial prompt" tells you nothing actionable. A finding that says "an attacker can use the support agent to access order records belonging to other customers, exposing PII for anyone who has interacted with the system" tells you exactly what is at risk and who is affected. Findings should connect the technical chain (entry point, tool pivot, data accessed) to business impact: what data is exposed, what actions become possible, and how many users or systems are affected. Severity that maps to business risk gets prioritized. Severity expressed as "the model responded when it should not have" sits in a queue. A specific failure mode worth watching for: vendors who demonstrate command execution within a sandbox without escaping it, or who identify "unauthorized actions" that are the agent's expected functionality. Findings quality means demonstrating real impact against the system's actual security boundaries, not reporting behaviors that look alarming out of context.

Remediation integration. Does red teaming end with a report, or is there a path from findings to runtime defenses? The ideal is that findings translate into guardrail rules (input filters, tool-call restrictions, approval gates) that close the exposure window while engineering works on root-cause fixes. In practice, how well this works depends on your guardrail architecture and how tightly it integrates with your testing pipeline. The point is not that every finding becomes an automatic rule. It is that the workflow from discovery to defense should not depend entirely on engineering sprint capacity.

4. What Structured Reconnaissance Looks Like in Practice

Many open-source red teaming tools operate with zero reconnaissance. The tool has no map of what the agent connects to, which tools it can invoke, what data it can access, or how permissions chain together.

Some newer tools have added a conversational reconnaissance phase: the adversarial agent talks to the target, probes which tools it can call, and maps declared capabilities through dialogue. This is a real improvement over blind prompt firing. But even when it works well, conversational recon produces a list of capabilities without context. It does not tell you which combinations of those capabilities create dangerous attack paths, which permission chains widen the blast radius, or why one tool connection matters more than another. Discovery without threat modeling gives you an inventory. Threat modeling turns that inventory into a prioritized testing plan.

Structured reconnaissance works differently. Instead of asking the agent what it can do, you probe for what the agent actually connects to by observing side effects: does a request trigger a webhook, write to a file system, modify a database record, send a message to another system? You verify tool connections, permission scopes, and data access paths through their observable consequences rather than relying on what the agent discloses. The discovery phase feeds directly into threat modeling. Given the connections and permissions you have verified, which paths carry real risk? What are the realistic attack scenarios, and which should testing prioritize?

The difference is concrete. Conversational probing of a customer-support agent might reveal it can query an order database. Structured reconnaissance of the same agent reveals it connects to the order database through a service account with no row-level access controls, that it can also invoke a Slack webhook to escalate tickets, and that the Slack webhook has write access to an internal channel where another agent monitors for deployment commands. The attack path from "ask about an order" to "trigger a deployment" only becomes visible when you map the connections the agent never mentions in conversation. The result is a testing plan grounded in the system's actual architecture rather than in prompt diversity or the agent's self-reported capabilities.

The difference shows up in findings quality. Without reconnaissance, red teaming tools can only report which prompts got through. With structured discovery and threat modeling, findings include the full chain: entry point, tool pivot, data accessed, and demonstrated impact. Engineers see how the exploit works, severity becomes concrete rather than speculative, and prioritization becomes straightforward. The Agentic AI Red Teaming Playbook covers the full methodology for this approach, from reconnaissance through exploitation techniques.

5. Closing the Window Between Discovery and Defense

Most red teaming findings end up in a report or spreadsheet, and from there the path to remediation runs through the same engineering backlog. The vulnerability stays exploitable while the fix waits for sprint capacity, prioritization, and someone to pick up the ticket.

The goal is to shorten that window. The most effective approach we have seen is runtime policy updates that do not require code deploys: adding an input filter, restricting a tool call, or gating a sensitive action behind approval while engineering works on the root-cause fix. The specifics depend on your guardrail architecture and what it supports. Not every finding translates neatly into a runtime rule. A finding like "the agent can be tricked into querying a database with another user's ID" requires judgment about the blast radius (how many users are affected, how sensitive the data is) and about the right mitigation: block the tool entirely, gate it behind approval, or filter specific input patterns, each with different tradeoffs in functionality and false positive rates. But the principle holds: architectural fixes take time, and runtime mitigations can close the exposure window while those fixes are in progress.

This also argues for continuous testing rather than periodic assessments. Agents do not stay still between quarterly reviews. Tool registrations change, permissions get reconfigured, and a point-in-time assessment only captures the attack surface as it existed on the day of the test. Continuous testing means each cycle can catch new exposures introduced by changes that happened since the last run. The SAIL framework describes this as the integration loop between red teaming, governance, and runtime enforcement, each feeding the others across the AI lifecycle.

One question worth asking: when your red teaming discovers a vulnerability, how many days pass before any defense is in place? If the answer depends entirely on engineering sprint capacity, the gap between your testing program and your defensive posture is wider than the testing results suggest.

6. Auditors Are Starting to Ask About Agentic Coverage

Remediation speed is not just a security concern. It is becoming a compliance concern. EU lawmakers reached provisional agreement on May 7 to push the high-risk AI systems compliance deadline from August 2026 to December 2027 for stand-alone systems (and August 2028 for AI embedded in regulated products), though formal adoption is still pending (as of May 2026). The grace period for transparency solutions, including watermarking and provenance-labelling of AI-generated outputs, has actually been shortened, with the new deadline expected to be December 2, 2026. The shifting deadlines create their own pressure: organizations that paused compliance work waiting for clarity now face a compressed timeline once final dates are confirmed.

ISO/IEC 42001 requires demonstrable AI governance with auditable controls. The OWASP Top 10 for Agentic Applications creates a shared taxonomy that auditors and regulators can reference when evaluating whether an organization's testing program covers agentic risk categories, and the accompanying OWASP Vendor Evaluation Criteria for AI Red Teaming Providers & Tooling gives procurement and GRC teams a concrete checklist for assessing whether their red teaming vendor actually covers the attack surface their agents expose. MITRE ATLAS now includes agentic-specific techniques that give red teams a standardized library for mapping findings to recognized attack patterns.

For CISOs and GRC leads, this means auditors will ask whether your red teaming program covers the agentic attack surface. Findings need to map to recognized frameworks with exploit-validated evidence. A report showing that your model resisted 500 adversarial prompts will carry less weight than one demonstrating that your security team tested the agent's tool chains, permission paths, and data flows against the OWASP Agentic Top 10 with reproducible attack transcripts.

The regulatory timeline also puts pressure on remediation speed. Point-in-time assessments run quarterly cannot keep pace with AI agents that ship updates continuously. Continuous testing with findings flowing into runtime defenses gives compliance teams a defensible posture rather than a snapshot that ages the moment the next model update ships.

7. Evaluating Your Program

If you have read this far, the evaluation is simple: for each of the five dimensions, can your current red teaming program give a concrete answer? Not "we do that" but specifics. Which interface does testing go through? Which OWASP or MITRE categories does it cover? What reconnaissance happens before the first adversarial input fires? Do findings map to business impact or just to "the model responded when it should not have"? How many days between discovery and defense?

The OWASP Vendor Evaluation Criteria for AI Red Teaming Providers & Tooling provides a more comprehensive procurement checklist. But if the answers above are vague, your testing program is probably missing real attack surface.

Book a RedGraph assessment to see how your agents' actual tool connections, permission chains, and data flows translate into prioritized attack paths, and validate real exploits before attackers do.

‍

FAQs

What makes agentic AI red teaming different from traditional AI red teaming?

Traditional AI red teaming fires adversarial prompts at a model's API endpoint and grades the text response. Agentic red teaming runs through the same interface users hit, follows tool calls, maps permission chains, and evaluates every output sink, from rendered UI elements to downstream API calls and database writes.

Why does API-level testing miss real attack paths against AI agents?

Production agents reformulate context through summarization, retrieval, and compaction before the model ever sees it, so API tests run against a system that does not exist in production. And API responses only capture the model's text reply, not the tool invocations, rendered markdown, or state changes that carry the real consequences.

What does structured reconnaissance look like in practice?

Rather than asking the agent what it can do, you probe for what it actually connects to by observing side effects: webhook triggers, database writes, message routing, file system changes. The verified inventory then feeds threat modeling that prioritizes which permission chains and tool combinations carry real blast radius.

How often should you red team AI agents?

Continuously. Tool registrations change, permissions get reconfigured, and a quarterly assessment only captures the attack surface as it existed on test day. Continuous testing catches new exposures introduced between cycles and gives compliance teams a live defensible posture rather than a snapshot that ages with the next model update.

What separates a high-quality red teaming finding from a noisy one?

A noisy finding says "the agent responded to an adversarial prompt." A high-quality one names the entry point, the tool pivot, the data accessed, and the business impact: what records are exposed, what actions become possible, and how many users are affected. Severity that maps to business risk gets prioritized. Everything else queues.