Hierarchical agent teams work at depth 2, but only if you compress context at every boundary

Published: 2026-05-14Last verified: 2026-05-14empirical

Published Fact-checked 2026-05-14 · 0 corrections

Hierarchical agent teams work at depth 2, but only if you compress context at every boundary

From Theory Delta | Methodology | Published 2026-05-14

What you expect

Hierarchical agent orchestration — a coordinator managing other coordinators — should work by wiring existing frameworks together. If LangGraph supports subgraphs and AutoGen supports nested teams, depth-2 or depth-3 orchestration should follow naturally from composing those primitives. The framework does the heavy lifting.

What actually happens

Depth > 1 orchestration works in multiple frameworks, but with one invariant that no framework enforces automatically: every level boundary must compress context. The coordinator at level N must not receive the full conversation history of level N+1. Every reviewed implementation compresses at each transition — the mechanism differs by framework, but the rule is universal.

Five source-reviewed frameworks confirmed, with a sixth failure case:

Framework	Mechanism	Depth Confirmed	Context Solution
LangGraph	Compiled subgraph as node	Unlimited (no cap)	Manual boundary compression — no guardrail if omitted
AutoGen	SocietyOfMindAgent wraps team as single agent	3 levels (confirmed)	LLM synthesis gateway
Google ADK	`sub_agents` field, recursively defined	Unlimited (no cap)	InvocationContext (implicit)
Swarms	SwarmRouter + HierarchicalSwarm	2 native, 3 composable	Parallel branches
CrewAI	Hierarchical process (1 manager + N workers) is 2-level; Flow is a separate event-driven layer — arbitrary crew chaining via @router, loops, conditional branching	2 (hierarchical process) / arbitrary (Flow)	Flow state passing
Claude Code (Agent tool)	Coordinator spawns subagents; subagent-to-subagent nesting not supported	2 from coordinator	N/A — Agent tool unavailable inside subagents

Three boundary compression mechanisms

1. Manual last-message selection (LangGraph)

The wrapper function is the compression point. Passing only the final message in and only the final message out keeps context bounded. Forgetting this produces a sub-team whose full internal deliberation arrives at the top coordinator:

def call_research_team(state: State) -> Command:
    response = research_graph.invoke({"messages": state["messages"][-1]})
    return Command(update={"messages": [HumanMessage(response["messages"][-1].content)]})

LangGraph provides no guardrail — the boundary compression is the developer’s responsibility.

2. LLM synthesis gateway (AutoGen SocietyOfMindAgent)

An entire team is wrapped as a single agent. After the inner team finishes, a dedicated LLM call synthesizes the full internal transcript into one coherent response. The outer coordinator receives a clean answer, not raw deliberation. The inner team resets after each invocation. This is the most robust mechanism: even noisy inner team output is distilled before crossing the boundary. The cost is one additional LLM call per level transition.

3. Shared state schema with level-aware fields (ADK/LangGraph)

All levels operate on the same schema but write to different fields. The top coordinator reads result, not subtask_3_result. No level overwrites another level’s fields. Eliminates context accumulation entirely but requires schema design upfront.

What breaks at depth 3

Context accumulation. Without explicit compression, depth-3 transcripts compound: 3 levels × 10 turns = 30+ messages arriving at the top coordinator. Without one of the three mechanisms above, depth 3 is infeasible in practice.

Coordinator confusion. The mid-level coordinator must reason about its obligation upward and its orchestration downward simultaneously. LLMs in this role tend to either over-report (bloating context) or under-report (losing sub-results in synthesis). No reviewed framework prevents this automatically.

Cost multiplication. Depth-3 with 3 workers per level and 3 feedback loops reaches 81 LLM calls minimum. Model stratification — frontier model at level 1, cheaper models at levels 2–3 — is a required mitigation, not optional.

Termination signal loss. A failure at level 3 must propagate correctly through level 2 to reach level 1. Without explicit error channels, the mid-coordinator may summarize a failure as success. LangGraph uses typed Command returns for this; AutoGen relies on termination conditions configured at each level.

True recursion does not exist

No reviewed framework has an agent that spawns a copy of itself with a sub-task and re-joins. What the ecosystem calls “recursive” orchestration is fixed 2–3 level structures with distinct agent types at each level. Arbitrary nesting — same agent type at every depth — has not been implemented in any reviewed framework.

What this means for you

If you are building a depth-2 pipeline today, the choice of framework matters less than whether you have boundary compression. A LangGraph subgraph without a compression wrapper will pass full sub-team transcripts to your top coordinator, burning tokens and degrading decision quality as conversation history grows. The failure is silent — the pipeline works, it just becomes progressively more expensive and unreliable.

If you are attempting depth 3, the cost and reliability properties change qualitatively, not just quantitatively. Budget for the extra LLM synthesis call per boundary and model stratification. Without both, depth-3 pipelines degrade faster than depth-2 failures predict.

AutoGen’s SocietyOfMindAgent is the only primitive that makes the boundary problem architectural rather than discipline-dependent. Wrapping a team as an agent means compression is built into the abstraction. All other frameworks leave compression as an implementation detail the developer must not forget.

What to do

For depth-2 pipelines: Pick your compression mechanism before wiring teams together, not after. LangGraph’s subgraph pattern requires explicit last-message selection at both entry and exit. AutoGen’s SocietyOfMindAgent handles this for you.
For depth-3 pipelines: Use model stratification from the start. Assign a frontier model to the level-1 coordinator and cheaper models to levels 2–3. Add typed error channels at each boundary — do not rely on natural-language error propagation across levels.
For Claude Code specifically: Claude Code supports depth-2 from a coordinator: a main session can spawn subagents via the Agent tool. Subagents cannot spawn further subagents. Depth-3+ Claude Code orchestration still requires an external coordinator (LangGraph, AutoGen) calling Claude Code instances as black-box workers.
For team integration: Use MCP as the boundary interface. The level-1 coordinator calls level-2 teams via MCP tools. The coordinator is decoupled from whether the implementation is a LangGraph subgraph, an AutoGen team, or a Claude Code instance. The interface is stable; the implementation can evolve independently.

Falsification criterion: This finding would be disproved by a production depth-3 pipeline where context is not compressed at level boundaries and the system maintains stable token usage and decision quality over 50+ multi-turn conversations.

Evidence

Tool	Version	Evidence	Result
LangGraph	source-reviewed March 2026	source-reviewed	Subgraph-as-node has no depth cap; boundary compression is manual — no framework enforcement
AutoGen SocietyOfMindAgent	source-reviewed March 2026	source-reviewed	LLM synthesis gateway wraps team as agent; depth-3 confirmed; inner team resets per invocation
Google ADK	source-reviewed March 2026	source-reviewed	`sub_agents` recursively defined; InvocationContext scopes context implicitly
Swarms	source-reviewed March 2026	source-reviewed	SwarmRouter + HierarchicalSwarm supports depth-2 natively, depth-3 via explicit nesting
CrewAI	source-reviewed March 2026	source-reviewed	Hard 2-level limit (1 manager + N workers) for hierarchical process; Flow is a separate event-driven system with arbitrary crew chaining

Confidence: empirical — 5 frameworks source-reviewed, boundary compression invariant independently observed in each implementation. No independent CVEs or GHSAs exist (this is an architectural behavior gap, not a reported vulnerability). Re-verified 2026-03-26.

Strongest case against: The boundary compression invariant could be a symptom of current model context window limits rather than a fundamental architectural constraint. As context windows scale past 1M tokens, passing full sub-team transcripts to top coordinators may become tractable — eliminating the need for compression as a design discipline. In that scenario, the advice to compress is conservative but not wrong; it becomes unnecessary overhead rather than a correctness requirement.

Open questions: Does Google ADK’s InvocationContext compression actually bound context in pathological cases (e.g., a sub-agent that generates 100K token transcripts)? The source-reviewed evidence confirms the mechanism exists; it has not been tested under adversarial conditions. What is the latency profile of AutoGen’s LLM synthesis gateway at depth 3 under load?

Seen different? Contribute your evidence — share a repro or counter-example and we’ll review it against this finding. Reader evidence is what keeps these findings accurate.