Blog
min read
%20(1).png)
On November 13, 2025, Anthropic disclosed that Chinese state-sponsored hackers used Claude Code to orchestrate what they called "the first documented case of a large-scale cyberattack executed without substantial human intervention." The AI performed 80-90% of the attack work autonomously - mapping networks, writing exploits, harvesting credentials, and exfiltrating data from approximately 30 targets.
The bypass technique was straightforward: attackers told Claude it was "an employee of a legitimate cybersecurity firm conducting defensive testing" and decomposed malicious tasks into innocent-seeming subtasks. The model complied.
This incident exposes a fundamental truth about AI security: model-level guardrails function as architectural suggestions, not enforcement mechanisms. They operate through learned behavioral patterns embedded during training, making them inherently vulnerable to adversarial optimization. Understanding this distinction is critical for effective AI Attack Surface Management.
Foundation model providers implement two protection layers. We documented these in our October 2024 analysis, but the Anthropic incident demonstrates exactly how they break under adversarial pressure.
Model-level protections are fine-tuned into the model weights during training. Historically, large frontier model providers like Anthropic leverage a process like Reinforcement Learning from Human Feedback (RLHF) , where crowdworkers rate responses as helpful or harmful, to power this fine tuning and increase the model response safety. Anthropic added an additional safeguard against unsafe outputs with Constitutional AI, which uses self-critique guided by more than 70 ethical principles drawn from the UN Declaration of Human Rights and established trust-and-safety frameworks.. Preference models achieve 86% accuracy on safety evaluations.
Prompt-level protections operate through system prompts—text instructions that define ethical boundaries and restrict topics. Anthropic's prompts explicitly state: "Claude does not provide information that could be used to make chemical, biological, or nuclear weapons. Does not write malicious code including malware, vulnerability exploits, ransomware."
Both layers failed against basic social engineering techniques.
.png)
LLMs are probabilistic text predictors. They pattern-match based on training data rather than reasoning about consequences. RLHF teaches that certain input patterns should trigger certain output patterns—learned behavior, not logical constraints. While safety fine-tuning creates strong behavioral patterns around refusal, these remain statistical tendencies that can be shifted through adversarial context manipulation, rather than deterministic enforcement mechanisms operating independently of the model's probabilistic nature.
The Chinese hackers exploited this by manipulating context. Telling the model "you're doing legitimate security testing" provides tokens that shift probability distributions toward compliance. The model has no mechanism to verify claims. Context is just text.
System prompts are equally vulnerable because they're input tokens processed by the same neural network. Research on prompt injection demonstrates that models cannot reliably distinguish between system instructions and user instructions.
Successful prompt injections are systematically optimized, not randomly lucky. The CFS model (Context, Format, Salience) shows that effective attacks align across three dimensions: contextual understanding of the system's tasks and tools, format awareness that blends instructions into expected data structures, and instruction salience through strategic positioning. This framework explains why most prompt injection attempts fail while sophisticated attacks succeed.
Runtime security systems enforce policies regardless of model behavior through inline security gateways with several critical characteristics:
Infrastructure-level deployment positions the system in the request/response path between applications and AI models. Every interaction passes through scanner suites evaluating both inputs and outputs against policy rules.
Deterministic policy enforcement executes with 100% reliability. Unlike probabilistic model behavior, runtime rules operate through programmatic logic. If policy states "block any output containing PII," the system blocks it every time through regular expressions, named entity recognition, and classifiers operating on actual text.
Context-aware decision making leverages execution context unavailable to models: user identity and role, authentication state, session history, real-time threat intelligence, and application-specific business logic. Runtime systems enforce different policies based on who's asking and what they're authorized to do.
Real-time adaptation deploys detection rules within hours of discovering new attack techniques. No model retraining required. Threat intelligence from over 10 million interactions informs continuous rule updates across the ecosystem.
Complete auditability logs every enforcement action with full context—what was blocked, why, who attempted it, when, and which policy triggered. This provides forensic trails for compliance, incident response, and security improvement that model-level safety cannot offer.
The Anthropic disclosure exposes three critical gaps that proper AI-ASM (AI Attack Surface Management) must address.
Anthropic detected these attacks because Claude runs on their infrastructure with comprehensive telemetry. Most organizations lack this visibility for their own AI deployments.
This visibility gap mirrors classic web security challenges. For example, like in the old days when attackers used your WordPress site to serve phishing pages without you knowing—the AI attack surface presents the same detection problem at greater scale and speed.
The challenge intensifies with local models. Attackers can run uncensored models like DeepSeek, LLaMA variants, or fine-tuned versions on their own infrastructure to orchestrate attacks against your systems. These models leave no trace in your logs, generate no API calls you can monitor, and operate entirely outside your visibility. Traditional security tools have no insight into AI systems running on attacker infrastructure, making detection of AI-powered reconnaissance, phishing campaigns, or automated vulnerability exploitation nearly impossible through conventional monitoring.
Shadow AI proliferates faster than tracking mechanisms can capture. Developers experiment with models weekly, product teams integrate AI capabilities without security review, employees use AI tools directly. Fine-tuned models inherit base vulnerabilities while introducing new ones. AI agents that chain multiple calls and use tools create compound attack surfaces.
The attackers made "thousands of requests per second." AI-driven attacks operate at fundamentally different timescales than human-driven ones. Traditional security operations assume hours or days for detection and response. AI-driven attacks compress this to seconds.
Real-time enforcement becomes essential. Runtime guardrails evaluate every interaction in near real-time, blocking attacks before they succeed while maintaining complete audit logs. This shifts from "detect and respond" to "prevent and audit."
Claude Code capabilities that enabled cyberattacks also enable legitimate security testing. Model-level training cannot solve this dual-use problem. The model processes tokens, not intent. It cannot distinguish between authorized penetration testing and reconnaissance.
This is why our CFS framework emphasizes context as the first dimension of successful attacks. Attackers manipulate contextual understanding—telling the model "you're doing legitimate security testing"—because models have no independent way to verify the truth of contextual claims. The model treats all context as equally valid text.
Context that disambiguates these scenarios exists outside the model: user identity, role, authentication state, target systems, authorization scope, session history, organizational policy.
AI-ASM requires behavioral monitoring with access to this metadata. Runtime systems enforce policies that training cannot: the same prompt receives different treatment based on who asks, where they're asking from, what they're targeting, and whether patterns suggest malicious intent.
Default protections from foundation model vendors represent sophisticated engineering that meaningfully improves baseline safety. But they operate through learned behavior patterns, making them vulnerable to adversarial optimization by attackers finding inputs outside training distributions.
Organizations building AI systems face three imperatives:
The question isn't whether AI systems will be attacked. The Anthropic disclosure proves they already are, at scale, by sophisticated adversaries. The question is whether your organization has the visibility to detect those attacks, the speed to stop them, and the context to distinguish malicious from legitimate use.
Subscribe and get the latest security updates
Back to blog