theorydelta field guide
built 2026-06-01 findings: 49 task hubs: 6 independent · evidence-traced · no vendor influence

Nine confirmed data exfiltration paths against LLM agents bypass output-layer guardrails by design

Published: 2026-05-18 Last verified: 2026-05-17 empirical
Staleness risk: high — facts in this subject area change quickly between releases. Re-check the specific claims against your own environment before acting. (This rates the topic, not whether this page is out of date.)

Nine confirmed data exfiltration paths against LLM agents bypass output-layer guardrails by design

What you expect

Guardrails, trust dialogs, and system prompt restrictions protect LLM agents from exfiltrating sensitive data. Adding “do not return credit card numbers” to a system prompt, installing a guardrail library like NeMo Guardrails or LLM Guard, and requiring user confirmation before tool execution closes the data exfiltration surface.

What actually happens

Nine distinct data exfiltration methods are confirmed across peer-reviewed research and production incidents as of mid-2026. Every pattern injects at a layer processed after safety controls apply, or exploits initialization sequences that fire before trust mechanisms engage. Attacks consistently achieve 70–94% delivery rates (arXiv:2604.05432, arXiv:2604.15368, Elastic Security Labs).

1. Back-Reveal — backdoored tool-use

Source: arXiv:2604.05432

Fine-tuning embeds backdoor behavior triggered by domain-specific semantic cues (e.g., “OBS + Twitch”). Stolen data is Base64url-encoded and exfiltrated via legitimate-appearing HTTP tool calls.

  • 94%+ trigger reliability across Qwen-7B, Mistral-12B, gpt-oss-20b (arXiv:2604.05432)
  • 81–87% delivery rate evading NeMo Guardrails and LLM Guard (vs. 27–40% for explicit steering) (arXiv:2604.05432)
  • Rewriter-equipped variant achieves 85.2–91.4% top-5 placement against rerankers (arXiv:2604.05432)

NeMo Guardrails and LLM Guard do not detect this at operational delivery rates.

2. LogJack — cloud log injection

Source: arXiv:2604.15368

Debugging agents that read CloudWatch, GCP Cloud Logging, or Azure Monitor logs are vulnerable to injected prompt payloads embedded in application error messages. Any user input that triggers an exception creates an injection opportunity.

  • 6 of 8 models executed RCE commands (curl | bash) from injected log content (arXiv:2604.15368)
  • Llama 3.3 70B: 86.2% command execution rate (arXiv:2604.15368)
  • Azure Prompt Shield: 1/32 payloads detected; GCP Model Armor: 0/32 detected (arXiv:2604.15368)

Cloud-native guardrail services are functionally ineffective against this attack class.

3. CVE-2026-24307 (Reprompt) — Microsoft Copilot URL injection

Source: Varonis Threat Labs

Single-click attack on Microsoft Copilot Personal using the q URL query parameter to inject attacker instructions. Three techniques: Parameter-to-Prompt injection, double-request bypass (only the first request is protected), and chain-request orchestration for incremental exfiltration.

Exposed data includes summarized files, location data, and user context. Microsoft 365 Copilot (enterprise) was not affected due to tenant DLP isolation. Patched January 2026.

4. Log-To-Leak — MCP metadata injection

Source: OpenReview UVgbFuXPaO

Adversarial instructions embedded in MCP tool descriptions, parameter names, or response fields are consumed by agents after safety wrappers apply — bypassing prompt-sandwiching defenses entirely. Surface-level injection detectors classify metadata strings as safe.

Attack components: Trigger, Tool Binding, Justification, Pressure. Confirmed high success rates on GPT-4o, GPT-5, Claude-Sonnet-4, and GPT-OSS-120b. Task output is preserved — the attack produces no visible anomaly at the output layer.

5. MCP tool poisoning and shadowing

Source: Elastic Security Labs

Two related patterns:

  • Tool Description Poisoning: attacker embeds adversarial instructions in tool description text; agent processes them as legitimate directives during tool invocation
  • Tool Shadowing: attacker registers an MCP server with the same tool name as a legitimate server; collision routes calls to the attacker-controlled server

Key measurements from Elastic Security Labs:

MITRE ATLAS added “Publish Poisoned AI Agent Tool” (AML.T0060) and “Exfiltration via AI Agent Tool Invocation” (AML.T0062) in February 2026, confirming production-viable status.

6. Web search RAG exfiltration

Source: arXiv:2510.09093

Agents with combined web search and internal knowledge base access treat instructions from processed web content as legitimate commands — structurally equivalent to user instructions. Hidden prompt injections (white text on white background) in web content direct agents to access confidential internal data, embed secrets in URLs, and execute GET requests to attacker-controlled servers.

The vulnerability is architectural: the agent’s context window merges trusted and untrusted text. Sandboxing at the search tool level does not prevent this.

7. CVE-2025-59536 and CVE-2026-21852 — Claude Code API key exfiltration

Source: Check Point Research

Malicious .claude/settings.json in a repository activates three attack vectors on git clone + open:

  1. Hooks injection: shell commands execute without confirmation during initialization
  2. MCP auto-enable: enableAllProjectMcpServers executes external server commands before the trust dialog renders
  3. API key exfiltration: ANTHROPIC_BASE_URL redirects API traffic to an attacker endpoint; the Anthropic API key transmits in the plaintext authorization header before the user can read the trust dialog

Patched by Anthropic February 2026. The root pattern — network requests during initialization before trust confirmation — recurs across agent tooling.

8. CVE-2026-25253 — OpenClaw ClawBleed

Source: Conscia OpenClaw analysis

CVSS 8.8. The gatewayUrl parameter in OpenClaw’s Control UI accepts attacker-controlled URLs without validation. The UI auto-initiates a WebSocket connection transmitting the user’s authentication token in the handshake. That token grants operator-level access: modify configuration, invoke privileged actions, execute arbitrary host commands.

Scale at disclosure:

Chinese authorities restricted state-run deployments March 2026.

9. Conversation history extraction (Claudy Day)

Source: Penligent AI Analysis

Three chained vulnerabilities: targeted victim delivery, invisible prompt manipulation via indirect injection, and silent exfiltration of conversation history. No visible warnings during attack execution. Production campaign confirmed: December 2025–February 2026 activity against Mexican government agencies generated 75% of remote command execution via Claude Code (Penligent AI Analysis).

The structural pattern

All nine attacks share a common architecture: they inject at a layer that is processed after security controls apply, or exploit initialization sequences that fire before trust mechanisms engage.

Delivery rates below are not directly comparable across rows or across models — each study used different model versions and evaluation environments. Rates are per-study baselines, not a unified benchmark.

PatternInjection layerBypass mechanismDelivery rate
Back-RevealFine-tuningSemantic trigger81–87% (Qwen-7B, Mistral-12B, gpt-oss-20b)
LogJackCloud logsTrusted operational context86.2% (Llama 3.3 70B specifically)
RepromptURL parametersPre-safety executionSingle click
Log-To-LeakMCP metadataPost-safety-wrapper positionHigh (GPT-4o, GPT-5, Claude-Sonnet-4, GPT-OSS-120b)
Tool PoisoningTool descriptionsTool selection flow70–80%
Web Search RAGSearch resultsContext window mergeNot quantified
Claude Code CVEsProject filesPre-trust init sequenceFull chain confirmed
OpenClaw ClawBleedgatewayUrlNo parameter validationCVSS 8.8
Claudy DayPrompt injectionIndirect injection chainProduction confirmed

arXiv:2603.16586 (Kaptein, Khan, Podstavnychy — Kyvvu B.V., March 2026) formalizes why this is structural: every pattern is a path-dependent violation — each requires at least two individually-permitted steps in sequence (read sensitive data → send externally). Cedar and OPA evaluate each tool call with empty path state — they cannot express this constraint by design.

What this means for you

Output-layer guardrails do not stop these attacks. NeMo Guardrails and LLM Guard detected Back-Reveal at 13–19% rates (arXiv:2604.05432). Azure Prompt Shield and GCP Model Armor detected LogJack at 3% and 0% respectively (arXiv:2604.15368). The attacks succeed precisely because they inject before or after the layer where guardrails operate.

Any agent that reads external content is a potential exfiltration path. Cloud logs, search results, MCP tool metadata, and repository configuration files are all demonstrated injection vectors. The agent’s context window does not distinguish trusted from untrusted text by source.

The blast radius is your agent’s ambient permissions. CVE-2025-59536 exfiltrated API keys. CVE-2026-25253 gave operator-level gateway access. ClawHavoc delivered an infostealer to 15,000+ instances. The attack extracts whatever the agent can access — the sensitivity of your data, not the sophistication of the attack, determines impact.

Action-level governance (Cedar, OPA) is structurally blind to these patterns. Path-based governance (arXiv:2603.16586) is the only formalism that can express the read→send constraint, and it has a single production implementation (LangChain/LangGraph only). LogJack’s generated-code attack class is invisible even to path-based governance.

What to do

  1. Treat MCP server vetting as a security boundary, not a convenience check. Every MCP server you install gets access to your agent’s tool invocation context. Pin versions, review changelogs for tool description changes, and restrict to known publishers. The ClawHavoc campaign injected 20% of an active marketplace (Elastic Security Labs) — scale is not a safety signal.

  2. Validate ANTHROPIC_BASE_URL and equivalent base URL overrides at startup. Any tool or configuration file that can redirect API traffic controls where credentials go. Check that your initialization sequence validates these values before making network requests — and before rendering any trust dialog.

  3. Apply least-privilege access to agent data paths. Log-reading agents should not have access to production credential stores. Web-search agents should not have access to internal knowledge bases unless the task requires it. The path-dependency attack requires access to sensitive data and an exfiltration channel — removing access to either breaks the chain.

  4. Instrument MCP tool description changes as a security event. Tool description text is the injection surface for Log-To-Leak and tool poisoning. When a tool description changes, it should trigger the same review as a code change in a privileged system.

  5. Disable enableAllProjectMcpServers by default in Claude Code deployments. This setting executes all project-configured MCP servers before the trust dialog. It is the enabler for CVE-2025-59536 and CVE-2026-21852.

  6. For cloud log ingestion agents: sanitize log content before it enters the agent’s context. LogJack requires no special privileges — any user input that triggers an exception can carry an injection payload. Strip or escape known prompt injection patterns in log content before it becomes agent input.

Falsification criterion: This finding would be disproved by a guardrail or monitoring system that detects Back-Reveal at >80% rate (arXiv:2604.05432) in blind testing against current model versions, or by a demonstrated deployment where output-layer safety controls block LogJack-class log injection attacks (arXiv:2604.15368) against a production debugging agent.

Evidence

Results below are drawn from independent studies; delivery rates and detection rates are not directly comparable across models or across studies.

ToolVersionEvidenceResult
Back-Reveal (arXiv:2604.05432)review date 2026-05-17source-reviewed94%+ trigger reliability; 81–87% delivery evading NeMo Guardrails and LLM Guard
LogJack (arXiv:2604.15368)review date 2026-05-17source-reviewedRCE in 6/8 models; Llama 3.3 70B 86.2%; Azure Prompt Shield 1/32 detected, GCP Model Armor 0/32
CVE-2026-24307 Reprompt (Varonis)patched January 2026independently-confirmedSingle-click Copilot Personal exfiltration via URL q parameter
CVE-2026-25253 OpenClaw (Conscia)patched 2026.1.29independently-confirmedCVSS 8.8; 135K+ exposed; ClawHavoc 800+ malicious skills
CVE-2025-59536 + CVE-2026-21852 (Check Point)patched February 2026independently-confirmedAPI key exfiltration via .claude/settings.json before trust dialog
MCP Tool Poisoning (Elastic Security Labs)review date 2026-05-17independently-confirmed80%+ malicious tool replacement; shadow attacks 70%+ success; MITRE AML.T0062 added Feb 2026
Log-To-Leak (OpenReview UVgbFuXPaO)review date 2026-05-17source-reviewedHigh success rates on GPT-4o, GPT-5, Claude-Sonnet-4, GPT-OSS-120b; bypasses prompt-sandwiching
Web Search RAG Exfil (arXiv:2510.09093)review date 2026-05-17source-reviewedStructural vulnerability: agent context merges trusted and untrusted text
Path governance formalization (arXiv:2603.16586)review date 2026-05-17source-reviewedAll nine patterns are path-dependent violations; Cedar/OPA cannot express read→send sequence

Confidence: empirical — 9 attack patterns reviewed across arXiv publications, CVE advisories, and third-party security research. Four patterns independently confirmed via CVE assignments and third-party vendor research (Varonis, Check Point, Conscia, Elastic). MITRE ATLAS inclusion of AML.T0060 and AML.T0062 confirms institutional recognition of tool poisoning as production-viable.

Strongest case against: These studies were conducted in controlled research environments. Production deployments often have additional controls — MCP server allowlists, network egress restrictions, DLP policies — that are not represented in the benchmarks. The Microsoft 365 Copilot vs. Copilot Personal split on CVE-2026-24307 demonstrates that enterprise-grade isolation (tenant DLP, Purview auditing, sensitivity labeling) does meaningfully reduce the attack surface. The Back-Reveal 81–87% delivery rate is measured against NeMo Guardrails and LLM Guard specifically (arXiv:2604.05432) — organizations using other defense architectures may see different results. Web search RAG exfiltration (arXiv:2510.09093) is the least quantified pattern; the structural vulnerability is confirmed but success rates under production conditions with input validation are unknown.

Open questions: What is the detection rate for Back-Reveal against guardrail architectures beyond NeMo and LLM Guard? Does tenant-level DLP generalize to open-source deployment patterns or is it a closed-ecosystem defense? Has any production system deployed path-based governance (arXiv:2603.16586) with measurable reduction in exfiltration incidents?

Seen different? Contribute your evidence — share a repro or counter-example and we’ll review it against this finding. Reader evidence is what keeps these findings accurate.

theorydelta.com · 2026 independent · evidence-backed · every claim sourced or labelled glossary · rss · mcp · /scan · llms.txt