Request Coalescing for AI Agent Systems: Why Duplicate Tool Calls Waste 40% of Your Compute Budget

The Duplicate Request Problem

Run an agent fleet at scale and watch your API gateway logs. You will see the same external API call -- identical parameters, identical payload -- executed dozens of times within the same second. Different agent sessions, different users, same request. Each one hits your rate limits independently. Each one consumes tokens independently. Each one adds latency independently.

This is not a bug in your agent logic. It is a missing infrastructure primitive. Your agents are doing exactly what they are told: when they need data, they fetch it. The problem is that nobody told them -- or the infrastructure beneath them -- that another agent already made the same request 200 milliseconds ago and the response is in flight.

Request coalescing solves this by intercepting identical in-flight requests and merging them into a single upstream call. Every waiting caller gets the same response. The external API sees one request instead of fifty. Your compute bill drops. Your rate limit headroom expands. Your p99 latency improves because fewer requests compete for the same connection pool.

Why Agent Systems Produce Massive Duplication

Traditional web applications have natural deduplication through caching layers and user session isolation. Agent systems are different:

Concurrent session explosion. A single user might spawn multiple agent sessions handling parallel subtasks. A platform with 1,000 concurrent users might have 5,000 active agent sessions -- all potentially requesting the same reference data, the same API lookups, the same model completions with identical prompts.

Deterministic tool selection. When agents use semantic routing for tool selection, similar user intents produce identical tool call sequences. Ten users asking similar questions trigger ten identical API calls within the same time window.

Retry amplification. When a tool call times out, agents retry. If the timeout was caused by upstream pressure from duplicate requests, the retries compound the problem. Ten agents timing out simultaneously produce ten retries, adding twenty more identical requests to an already-overloaded endpoint. This connects directly to why circuit breakers are essential in agent pipelines -- but circuit breakers prevent cascading failure without eliminating the underlying waste.

Shared context retrieval. RAG-based agents serving similar queries retrieve identical document chunks. The embedding lookup, the vector search, the document fetch -- all duplicated across sessions that could share a single retrieval result.

Coalescing Architecture Patterns

Pattern 1: In-Flight Request Deduplication

The simplest coalescing pattern maintains a map of currently in-flight requests keyed by a normalized request hash. When a new request arrives:

Compute request hash (method + URL + normalized parameters)
Check if an identical request is already in flight
If yes: attach this caller to the existing request future
If no: issue the request and register it in the in-flight map
When the response arrives: resolve all attached futures with the same response
Remove the entry from the in-flight map

This pattern requires zero changes to agent logic. It operates at the HTTP client layer or the tool execution middleware. The agent never knows its request was coalesced -- it receives the response exactly as if it made the call independently.

Pattern 2: Time-Window Batching

Some APIs perform better with batched requests than individual calls. Time-window batching collects requests over a short window (10-50ms) and combines them into a single batch call:

Request arrives, gets queued in a time-window buffer
Buffer flushes on window expiry or when batch size limit is reached
Single batch API call executes
Response is demultiplexed back to individual callers

This pattern works exceptionally well for embedding APIs, search APIs, and any endpoint that supports batch operations. The 10-50ms added latency is invisible to users but the cost reduction can reach 60-80%.

Pattern 3: Speculative Coalescing With TTL

For data that changes infrequently, extend coalescing beyond in-flight requests to include recently-completed requests:

Request completes, response is stored in a short-TTL cache (1-30 seconds)
Subsequent identical requests within the TTL window receive the cached response immediately
After TTL expiry, the next request goes upstream and refreshes the cache

This bridges the gap between request coalescing and traditional caching. The TTL is short enough that staleness is rarely a concern, but long enough to eliminate burst duplication during high-traffic periods.

Implementation Considerations

Request normalization is critical. Two requests that are semantically identical but structurally different (different parameter ordering, different whitespace, timestamps in headers) must hash to the same key. Build normalization that strips volatile fields and canonicalizes the remainder.

Error propagation must be careful. When a coalesced request fails, all waiters receive the failure. But should a transient failure (network blip) be propagated to fifty callers, or should the system retry transparently? The answer depends on idempotency guarantees -- a consideration deeply explored in idempotency patterns for agent actions.

Memory pressure from waiting futures. Under extreme load, thousands of callers might attach to a single in-flight request. Bound the maximum waiters per request and reject excess callers with backpressure signals rather than accumulating unbounded memory. The backpressure patterns for agent systems apply directly here.

Observability must distinguish coalesced from direct requests. Your monitoring needs to show both the logical request count (what agents requested) and the physical request count (what actually hit the upstream). The ratio between these numbers is your coalescing efficiency metric. Without this distinction, you cannot measure whether coalescing is actually working -- a fundamental principle of AI systems observability.

Measuring Impact

Deploy coalescing in shadow mode first: log what would have been coalesced without actually deduplicating. This gives you baseline metrics:

Coalescing ratio: Percentage of requests that would have been deduplicated (typical range: 20-60% for agent systems)
Cost reduction projection: Coalescing ratio multiplied by per-request cost
Latency improvement projection: Reduced queue depth at upstream APIs translates to lower p50 and p99
Rate limit headroom gained: Fewer physical requests means more capacity for genuinely unique requests

Production agent systems we have observed typically see 30-45% reduction in upstream API calls after implementing basic in-flight deduplication, with cost savings scaling linearly. Adding time-window batching pushes this to 50-70% for embedding and search workloads.

When Coalescing Breaks

Not every request can be safely coalesced:

Writes and mutations. Never coalesce POST/PUT/DELETE requests that create side effects. Two agents both wanting to send an email should send two emails, not one. Coalescing must be restricted to idempotent read operations.

User-specific authorization. Two requests to the same endpoint with different authorization tokens are not identical even if the parameters match. The coalescing key must include authorization context to prevent cross-tenant data leakage.

Time-sensitive data. If the response is expected to change between requests (real-time pricing, live inventory), coalescing with any TTL introduces staleness risk. Use pure in-flight deduplication only -- no speculative caching.

The Compound Effect

Request coalescing is not glamorous infrastructure. It does not appear in architecture diagrams or conference talks. But for agent systems operating at scale, it is the difference between sustainable unit economics and runaway costs. Combined with semantic caching for LLM responses and cost engineering discipline at the model layer, coalescing completes the trifecta of agent infrastructure that makes production economics viable.

The teams running profitable agent systems at scale all have some form of request coalescing -- whether they call it that or not. The teams drowning in API bills and rate limit errors are the ones treating every agent session as an isolated system that owns its own connection to the world. In production, sharing is not optional. It is infrastructure.