As autonomous AI agents fan out across the open web to read pages, call tools, and drive workflows on our behalf, the web itself becomes the attack surface. The adversary no longer needs to compromise the model, the endpoint, or the user. They only need to shape the information the agent sees. The DeepMind team calls this new class of adversarial content AI Agent Traps, and their paper is the first systematic taxonomy of them.
The thesis is stark, and worth quoting directly: "The web was built for human eyes; it is now being rebuilt for machine readers. As humanity delegates more tasks to agents, the critical question is no longer just what information exists, but what our most powerful tools will be made to believe."
This piece walks through the six trap families the authors introduce, then asks a practical question: for the categories that touch enterprise systems — where an agent has to actually do something to cause harm — how does an identity-aware, context-aware access layer like Pomerium change the outcome?
The honest answer: Pomerium cannot stop a model from being fooled. Nothing at the network layer can. But many of the traps only turn into real damage the moment an agent reaches for a tool, an API, a file, or a model endpoint. That moment is exactly where Pomerium lives. And it is the moment where a well-designed policy engine can refuse to execute, even when the model has been completely convinced it should.
The paper organizes traps by which part of the agent's functional architecture they target — perception, reasoning, memory, action, the multi-agent system, and the human overseer. Most existing security writing on agents collapses into "prompt injection." This taxonomy is more useful because it separates the mechanism from the effect.
These exploit the gap between what a human sees on a page and what the agent parses underneath. Four vectors:
Web-standard obfuscation. Hidden HTML, CSS display: none, off-viewport spans, aria-label attributes, HTML comments — all invisible to a human, all fed straight into the agent's context. The paper cites a study where adversarial HTML changed generated summaries 15–29% of the time, and the WASP benchmark showing prompt injections partially commandeering agents in up to 86% of scenarios.
Dynamic cloaking. The page fingerprints the visitor. Humans see one version; LLM-powered agents get served a different, weaponized version — often with instructions to exfiltrate environment variables or misuse tools.
Steganographic payloads. Instructions encoded directly into pixel arrays or audio binary data. Invisible to human eyes, but parsed and acted on by multimodal models. A single adversarial image can serve as a universal jailbreak for a vision-language model.
Syntactic masking. Malicious commands hidden inside Markdown anchor text or LaTeX — so the URL passes safety checks but the payload enters the context window anyway. Keuper (2025) showed white-on-white LaTeX surviving PDF→Markdown conversion to inflate LLM-assisted peer-review scores.
Rather than issue commands, these corrupt the distribution of the agent's input.
Biased phrasing, framing, and contextual priming. Saturating source text with sentiment-laden or authoritative language to exploit framing effects. LLMs demonstrably shift conclusions based on the position of relevant information ("Lost in the Middle"), the author attributed to a piece, or even whether a problem is phrased with "more" versus "less."
Oversight and critic evasion. Wrapping malicious instructions in "educational," "red-teaming," or "security audit simulation" framing so that internal critic models classify the request as benign.
Persona hyperstition. Narratives about a model's personality circulate online, get scraped into training or retrieval, and then the model begins producing outputs that accord with the label — stabilizing the behavior. The paper points to Grok's July 2025 self-identification as "RoboStalin" and Anthropic's documented "spiritual bliss attractor" as real-world examples.
Unlike perception traps, these persist.
RAG knowledge poisoning. Injecting fabricated statements into retrieval corpora. A handful of optimized poisoned documents can reliably manipulate outputs for targeted queries. Attacks extend to multimodal RAG with a single poisoned image.
Latent memory poisoning. Seemingly innocuous data is written into an agent's episodic logs or memory store, then activates as malicious when retrieved in a future session. Attacks with <0.1% poisoning rates have shown >80% success.
Contextual learning traps. Adversarial few-shot demonstrations, poisoned reward signals in online RLHF, or small edits to example prompts that flip model predictions — ~95% attack success in published backdoor work.
This is where traps stop being subtle and turn into real-world harm.
Embedded jailbreak sequences. Adversarial prompts sitting in ordinary web content, emails, or PDFs that override safety alignment on ingestion. Evtimov et al. (2025) and Zhan et al. (2024) show web and tool-using agents frequently begin executing injected instructions. Adversarial mobile notifications on AndroidWorld can produce 93% attack success rates.
Data exfiltration traps. The classic confused deputy attack. The agent has privileged access to your data and privileged ability to make network calls; the attacker just has a crafted email or page. Shapira et al. (2025) show >80% attack success for web agents with browser and OS privileges exfiltrating files, passwords, and secrets. Reddy and Gujral (2025) describe a single crafted email causing M365 Copilot to exfiltrate its entire privileged context to an attacker-controlled Teams endpoint. Cohen et al. demonstrated self-replicating prompts that spread zero-click across interconnected GenAI assistants.
Sub-agent spawning traps. A page tells the orchestrator to "spin up a dedicated Critic agent," seeding the child with an attacker-controlled system prompt. Triedman et al. (2025) report 58–90% attack success rates hijacking multi-agent control flow to enable arbitrary code execution and data exfiltration.
The attacker doesn't compromise an agent — they shape the environment so that many agents, independently making rational decisions, collectively produce a disaster. Congestion traps, interdependence cascades (the agent-era Flash Crash), tacit collusion via shared environmental signals, compositional fragment traps that partition a payload across benign-looking pieces, and Sybil attacks that fabricate pseudonymous agent identities to swing consensus.
The agent is the vector; the human is the target. Outputs engineered to produce approval fatigue or exploit automation bias; AI summarization tools faithfully surfacing ransomware commands dressed up as "fix" instructions. The paper notes this is largely a theoretical surface today — but a plausible one as agent scale increases.
The paper's mitigation section is notably humble. It proposes defenses in three buckets: technical hardening (adversarial training, input sanitization, runtime content scanners, output monitors), ecosystem interventions (web standards for AI-facing content, domain reputation, traceable citations), and legal/policy frameworks that distinguish passive adversarial examples from active traps and resolve the liability "accountability gap."
But the authors are clear that detection at web scale is semantically hard, attribution after the fact is a forensics nightmare, and the arms race will continue. They call for standardized benchmarks and red-teaming — and implicitly, for defense in depth.
This is the gap Pomerium is built for.
Pomerium is an identity-aware reverse proxy built on Envoy. For human access it replaces VPNs with per-request, context-aware authorization. For service access it replaces shared secrets and long-lived tokens with short-lived, policy-verified sessions. And over the last year it has extended the same engine to the newest category of client: AI agents, including direct support for the Model Context Protocol (MCP).
Three properties of Pomerium are load-bearing for what follows:
Every request is evaluated against policy, not just every session. Identity, role, device posture, time of day, source IP, request path, headers, and even MCP tool name can be conditions on a rule.
Pomerium terminates external identity and issues its own short-lived, scoped JWTs to the upstream service. Agents never directly hold the credentials to MCP servers, databases, or internal APIs.
Every allow-or-deny decision is logged with full context, feeding straight into SIEMs for audit, incident response, and behavioral monitoring.
An MCP route policy looks like this:
routes:
- from: https://github.localhost.pomerium.io
to: http://localhost:3020
policy:
allow:
and:
- domain:
is: example.com
- mcp_tool:
in: ["search", "fetch"]
mcp: {}
Seven lines to say: this agent, acting on behalf of this user, can call the search and fetch tools on this MCP server, and nothing else. Not because the agent asked nicely. Because the proxy will refuse anything else.
Now let's map this onto the paper's taxonomy — honestly, category by category.
Behavioural Control Traps — direct hit. Every example in the paper's action category becomes meaningfully less dangerous when the agent does not own the credentials and the proxy evaluates every action.
Data exfiltration. The M365 Copilot case — a single email causes the agent to ship its entire privileged context to an attacker-controlled Teams endpoint — is the textbook confused-deputy attack. Pomerium's design explicitly addresses this pattern: the agent never holds a broadly-scoped token, tokens are short-lived and scoped to specific routes, and egress to the attacker-controlled endpoint is a policy decision, not a network default. If the agent is not authorized to write to teams.attacker.example.com, it does not matter how thoroughly the LLM was convinced.
Embedded jailbreak sequences. Jailbreaking the model disables its alignment. It does not disable the proxy. Pomerium evaluates the action, not the prompt. As the Pomerium team puts it: "Even if a prompt injection confuses the agent, Pomerium's policy engine still evaluates the action, not just the input. If the agent attempts something unauthorized: it's blocked, it's logged, it's contained."
Sub-agent spawning. If a spawned sub-agent must still authenticate as an identity the policy engine recognizes — and must call tools through the same enforcement layer — the orchestrator takeover attacks in Triedman et al. lose most of their blast radius. The sub-agent inherits no implicit trust; it has to earn authorization per request just like any other client.
Content Injection Traps — limited at the perception layer, decisive at the action layer. Pomerium cannot strip hidden CSS from an HTML page or decode steganography in a PNG. That is a content-scanning problem at a different layer. But every single content injection trap described in the paper only causes harm when the agent acts on the hidden instruction. Web-standard obfuscation triggers tool calls; dynamic cloaking injects commands that tell the agent to exfiltrate environment variables; steganographic payloads jailbreak a multimodal model so it complies with harmful tool calls. Cut off the tool calls and the traps become noise — offensive content that nothing acts on. Pomerium's mcp_tool allow-lists, per-route policies, and token translation are precisely the cutoff.
Cognitive State Traps — partial, but meaningful at the data boundary. Pomerium cannot detect that a RAG corpus has been poisoned. That is a retrieval-layer problem. What Pomerium can do is constrain which retrieval corpora an agent is authorized to read from in the first place, enforce scope boundaries so a poisoned document in a public wiki never reaches a high-privilege agent, and log exactly which retrieval calls preceded any unusual downstream action — giving incident responders a forensic trail the paper explicitly calls out as lacking today. The "Control Agentic Sprawl" primitive — restricting agents by intent, not just identity, and blocking access to high-risk records based on headers or query params — directly addresses the blast radius of a successful poisoning.
Semantic Manipulation Traps — out of scope at the model, but action-bound at the edge. Framing effects and persona hyperstition are problems of model reasoning. Pomerium will not make an LLM less persuadable. But semantic manipulation only becomes security-relevant when the agent's degraded reasoning drives a consequential action — writing to production, sending money, exposing data. Those actions route through policy, which does not care about framing.
Systemic Traps — partial, with a novel angle. Pomerium is not going to prevent a flash crash or collusion among independent agents at internet scale. Those are ecosystem-level problems. But within an enterprise, Pomerium's rate limiting, time-based restrictions, and behavioral monitoring are exactly the controls that catch a runaway agent before its "optimal" response to an environmental signal exhausts your infrastructure. If a thousand internal agents suddenly decide to hammer the same internal endpoint because an attacker crafted a congestion signal, the policy layer can throttle, quarantine, and alert. Sybil-style attacks on internal governance workflows become harder when every agent has a federated, unique, auditable service identity rather than sharing a token.
Human-in-the-Loop Traps — adjacent. The direct target here is the reviewer's cognitive state, which is outside Pomerium's domain. But the audit trail changes the equation. Approval fatigue is much worse when reviewers cannot tell which approvals led to which consequences. Per-request logs with reason codes are how you rebuild that loop — not as a trap defense, but as the substrate for the human-in-the-loop supervision that the paper says will become increasingly important.
Worth stating plainly, because the point is defense in depth, not a silver bullet:
Pomerium will not inspect HTML comments for hidden instructions, or decode pixel steganography, or detect that a Markdown anchor text carries a payload. Content scanners sit in front of the agent; Pomerium sits between the agent and the systems it acts on.
Pomerium will not fine-tune the model, harden it against adversarial inputs, or install constitutional principles.
Pomerium will not clean a poisoned RAG corpus or remove a trigger embedded in few-shot demonstrations.
Pomerium will not reason about the semantic appropriateness of a tool call — only its authorization. If the policy says an agent can call send_email, Pomerium will let it send an email. The policy has to be the right shape.
The mental model is containment: assume the perception, reasoning, and memory layers can be compromised, and architect so that compromise does not translate into unauthorized action. That is what the paper is implicitly asking for when it talks about confused-deputy attacks needing structural defenses, and what Pomerium is explicitly designed to provide.
The paper cites Reddy and Gujral's single-email attack against M365 Copilot as an archetype. Walk through how it looks under a Pomerium-style architecture:
The attacker's email arrives in the user's inbox with a syntactically masked instruction to exfiltrate Copilot's privileged context to teams.attacker.example.com.
Copilot, acting on behalf of the user, is fully convinced. Its intent layer is compromised. It composes the exfiltration request.
The request does not go directly to Teams. It goes to Pomerium, which terminates identity, evaluates the route policy, and sees: destination not on the allow-list for this agent identity. Denied. Logged with full context — user, agent identity, source prompt session, destination, timestamp.
The security team sees a denied exfiltration attempt in their SIEM within minutes, not weeks. The agent's credentials never had egress to the attacker's domain to begin with, because Pomerium issued a scoped JWT, not a general-purpose bearer token.
The attack did not fail because the model was smarter. It failed because the blast radius stopped at the policy boundary.
Three operational implications fall out of the paper's taxonomy combined with what an access layer can actually do:
Treat every agent as a first-class identity. No shared tokens, no baked-in credentials, no "the agent uses the user's session." Federated service identities per agent, per purpose, with scopes tied to task rather than tool, are the precondition for every downstream defense. Pomerium calls this "autonomous, not anonymous" — every agent action tied to a verified identity.
Authorize the action, not the prompt. The paper's central insight is that perception, reasoning, and memory can all be subverted at a distance. The only layer the attacker cannot reach directly is the one that lives in your infrastructure, evaluating every request against your policy. That is where enforcement has to live if it is to be reliable.
Log like you expect to be breached — because you will be. Attribution is one of the three named challenges in the paper. If your agent stack cannot answer "which retrieval call, which tool call, which user session produced this anomalous write?" with a single query, you are not going to survive the first real incident. Per-request decisions with reason codes into your SIEM are not a nice-to-have; they are the primary forensic substrate for agent security.
"AI Agent Traps" is important because it breaks the field out of the narrow frame of prompt injection and maps an actual attack surface — one that extends from pixel-level steganography up through multi-agent economic equilibria. Read it, then share it with your platform and security teams.
The paper's mitigation section is open-ended by design. Most of its categories — adversarial training, content scanners, ecosystem reputation — are still research problems. But two of its defenses are shipping today: identity-aware access at the action layer, and comprehensive audit logging. They happen to be exactly what Pomerium does.
The web was built for human eyes. The systems your agents act upon do not have to be.
Stay up to date with Pomerium news and announcements.
Embrace Seamless Resource Access, Robust Zero Trust Integration, and Streamlined Compliance with Our App.