Prompt Injection Defense in Depth for Multi-Agent Systems: Why Input Sanitization Is Not a Security Strategy

The Perimeter Illusion

Most production AI systems implement prompt injection defense as input sanitization: scan the user message at the API boundary, strip suspicious patterns, and forward the cleaned text to the model. This approach borrows from web application security thinking -- validate inputs at the edge, trust everything inside.

For single-model systems, perimeter defense provides baseline protection. For multi-agent systems, it is catastrophically insufficient. The attack surface of a multi-agent pipeline is not the user input alone -- it is every data boundary between every component: agent-to-agent messages, tool call responses, retrieved documents, memory reads, and external API payloads.

A prompt injection payload that enters through a tool response bypasses your input sanitizer entirely. A poisoned document in your RAG corpus never touches the edge. An adversarial payload in a database field flows through your pipeline as "trusted internal data." Your perimeter is intact. Your system is compromised.

The Multi-Agent Attack Surface

In a single-model system, you have one trust boundary: user input to model. In a multi-agent system with N agents and M tools, you have approximately N*M+N^2 trust boundaries. Each boundary is a potential injection vector.

Agent-to-agent communication. When Agent A passes context to Agent B, that context may contain injection payloads. If Agent A was tricked into including adversarial instructions in its output, Agent B inherits the compromise. This is prompt injection lateral movement -- the agent equivalent of a network pivot.

Tool output injection. An agent calls a web search tool. The search results contain a page specifically crafted to inject instructions. The agent processes the "tool output" as trusted data because it came from an internal tool call, not from user input. Your input sanitizer never saw it.

RAG document poisoning. Your retrieval pipeline pulls documents from a corpus that includes user-submitted content, scraped web pages, or third-party data sources. A poisoned document retrieved at query time injects instructions directly into the agent's context window.

Memory injection. Agents with persistent memory can be attacked across sessions. An adversarial input in session one plants a payload in memory. In session ten, when that memory is recalled, the payload activates -- long after the original input has been forgotten.

This maps to the broader challenge of AI guardrails in production -- safety mechanisms must operate at every layer, not just the edge.

Why Input Sanitization Fails

Input sanitization fails for multi-agent systems because it assumes a single entry point and a trusted interior. Multi-agent systems have neither.

The encoding problem. Injection payloads can be encoded in ways that bypass pattern matching: base64, unicode homoglyphs, instruction embedding in structured data, or semantic injection that uses natural language indistinguishable from legitimate content. No sanitizer can catch semantic injection because there is no syntactic pattern to match.

The context accumulation problem. Even if each individual message is clean, the accumulated context across multiple agent interactions can construct an injection through composition. Agent A contributes one fragment, Agent B contributes another, and Agent C receives a complete injection payload that no single agent transmitted.

The false negative cost. Aggressive sanitization that strips anything suspicious destroys legitimate data. Production systems cannot afford to corrupt user requests to prevent hypothetical attacks. The tradeoff between false positives (blocking legitimate use) and false negatives (missing attacks) has no satisfactory solution at the perimeter alone.

Defense in Depth Architecture

Production-grade prompt injection defense requires multiple independent layers, each operating on different principles:

Layer 1: Perimeter scanning. Yes, still sanitize inputs -- but recognize this as your weakest layer. It catches naive attacks and script-kiddie payloads. It will not stop determined adversaries.

Layer 2: Privilege separation. Each agent operates with minimum necessary permissions. An agent that only needs read access to a database cannot be tricked into writing. An agent that only calls specific tools cannot be redirected to call others. This limits blast radius when injection succeeds. The principles of capability-based access control are essential here -- RBAC is insufficient for autonomous systems.

Layer 3: Output validation. Before any agent's output is passed to another agent or executed as a tool call, validate it against expected schemas and behavioral bounds. If Agent A is supposed to produce a JSON customer record and instead produces free-text instructions, something went wrong.

Layer 4: Instruction-data separation. Architecturally separate system instructions from data in every context window. Mark data regions explicitly. Use techniques like instruction hierarchy (system > user > retrieved) with strict enforcement. Never allow retrieved content or tool outputs to occupy the same trust level as system instructions.

Layer 5: Behavioral monitoring. Monitor agent behavior for anomalies: unexpected tool calls, outputs that deviate from historical patterns, sudden changes in response style or content. This is the observability layer applied specifically to security -- detecting compromise through behavioral deviation rather than payload pattern matching.

Layer 6: Circuit breaking. When behavioral anomalies are detected, circuit breakers should halt the pipeline before a compromised agent can propagate damage. This is not graceful degradation -- it is active containment.

Practical Implementation Patterns

The sandwich architecture. Wrap every agent call in pre-processing and post-processing guards. Pre-processing: validate inputs against expected schemas, check for known injection patterns. Post-processing: validate outputs against expected formats, check for behavioral deviation, strip any content that matches injection patterns before forwarding.

Context windowing. Limit what each agent can see. An agent that processes customer data should not see system prompts from other agents. An agent that performs tool calls should not see raw user inputs. Minimize the information available for an attacker to manipulate.

Deterministic control planes. As explored in deterministic control planes for agentic AI, the orchestration layer should be deterministic code, not another LLM. A compromised agent cannot inject instructions into a state machine. The control plane decides what happens next based on structured outputs, not free-text reasoning.

Canary tokens. Embed invisible canary tokens in system prompts and data regions. If an agent's output contains a canary token from a data region, you know instruction-data separation was violated. This gives you detection even when prevention fails.

Multi-model consensus. For high-security operations, route the same request through multiple models and compare outputs. Injection payloads that work on one model often fail on another. Divergent outputs signal potential compromise.

The Governance Dimension

Prompt injection defense is not purely an engineering problem. It is a governance problem that requires audit trails and organizational accountability.

Every agent interaction must be logged. When an incident occurs, you need the full chain of context: what entered the system, how it propagated, which agents processed it, and what outputs were produced. Without comprehensive logging, post-incident analysis is impossible.

Red team continuously. Production multi-agent systems need continuous adversarial testing, not one-time penetration tests. The attack landscape evolves as models change and new techniques emerge. What was safe yesterday may be vulnerable today after a model provider update -- a risk explored in AI model supply chain security.

Define acceptable risk. Not every injection attempt needs prevention. Some are low-impact (an agent produces slightly off-topic output) while others are catastrophic (an agent exfiltrates data or executes unauthorized actions). Defense investment should be proportional to impact.

What Most Teams Get Wrong

The most common failure mode is treating prompt injection as a solved problem after implementing input sanitization. Teams deploy a regex-based scanner, pass a penetration test, and move on. Six months later, their multi-agent system is processing poisoned RAG documents and executing adversarial instructions that entered through tool outputs.

The second failure mode is security theater: implementing elaborate defenses that look impressive but operate at the wrong layer. A sophisticated input scanner that checks for injection patterns is useless against a poisoned document that enters through the retrieval pipeline.

The third failure mode is treating security as separate from architecture. Defense in depth is not a bolt-on -- it must be designed into the agent communication patterns, the orchestration layer, and the data flow from the beginning. Retrofitting security into a multi-agent system that was built with implicit trust between components is exponentially harder than building it in from the start.

The Path Forward

Production multi-agent security requires abandoning the perimeter model entirely and adopting zero-trust principles: never trust any data regardless of source, validate at every boundary, minimize privileges, and monitor continuously.

This is expensive. It adds latency (validation takes time), complexity (multiple security layers to maintain), and operational overhead (monitoring and incident response). But the alternative -- a multi-agent system that trusts its own internal communications -- is a system waiting to be compromised.

The organizations building production multi-agent systems today are learning what web security learned two decades ago: the perimeter is an illusion, trust is a vulnerability, and defense only works in depth.