Hierarchical Planning

14 minute read

“Make agents reliable at large tasks: plan at multiple levels, execute in small verified steps, and stop when budgets say so.”

1. The problem hierarchical planning solves

Simple “single-shot” prompting breaks down when tasks are:

multi-step (10+ actions)
ambiguous (requires clarification)
tool-heavy (search, code execution, APIs)
long-running (state must persist)

Without structure, agents fail in predictable ways:

they jump into details too early
they forget constraints
they loop (“try again”) instead of changing strategy
they burn budgets without progress

Hierarchical planning is the pattern of making the agent plan at multiple resolutions:

high-level goal decomposition (what are the big chunks?)
mid-level step planning (what’s the sequence of actions?)
low-level execution (tool calls and micro-decisions)

The hierarchy is what keeps the agent oriented when tasks are larger than a single context window or a single reasoning pass.

2. The mental model: strategy vs. tactics vs. actions

Think of planning as three layers:

Strategy (why): what approach are we taking and why?
Tactics (what/when): what steps will we do and in what order?
Actions (how): the concrete tool calls and operations

If your agent tries to do strategy, tactics, and actions all at once, it tends to become inconsistent. Hierarchical planning separates these concerns so each layer can be validated and updated.

2.5 A practical rule: “One level up decides, one level down executes”

Hierarchies can get messy if every level tries to do everything. A clean division of responsibility is:

Level (N): decides what to do next (selects subgoals, allocates budget)
Level (N+1): decides how to do it (concrete steps, tool calls)

Example:

Strategy layer decides: “We’ll solve this by collecting evidence first, then drafting, then critiquing.”
Tactics layer decides: “For evidence, run 3 searches and fetch up to 5 sources.”
Action layer decides: “Call search(q1), then fetch(url1)…”

This prevents a common failure mode where the action layer “changes strategy mid-run” and the whole system loses coherence.

3. A reference architecture: H-Plan + H-Execute + H-Verify

Here’s a robust architecture for hierarchical planning:

Objective + Constraints
   |
   v
High-Level Planner (H-Plan)
   |
   v
Task Graph / Outline (big chunks)
   |
   v
Step Planner (per chunk)
   |
   v
Executor (tools + actions)
   |
   v
Verifier (checks + gates)
   |
   v
State Update + Stop Check

Key design choice: commit plans to state

Don’t treat “the plan” as text in the model’s head. Store it as structured state:

tasks with IDs
dependencies
status
evidence requirements
budgets

This makes planning inspectable and debuggable.

3.5 Plan schemas: the simplest structure that works

If you can only add one structured artifact to your agent system, make it a plan schema. Even a minimal schema helps:

{
  "goal": "…",
  "constraints": ["…"],
  "tasks": [
    {
      "id": "T1",
      "title": "Collect evidence",
      "status": "pending",
      "done_definition": "Have ≥2 sources for key claims",
      "budget": {"max_steps": 6, "max_tool_calls": 6},
      "depends_on": []
    }
  ]
}

Why include done_definition and budget?

done_definition stops the agent from “finishing early” without evidence.
budget stops the agent from “never finishing” due to endless refinement.

Further reading (optional): for schema-first reliability patterns, see Structured Output Patterns.

4. Plan representations: list, tree, and DAG

4.1 Linear plan (list)

Best for:

small tasks
low parallelism
simple execution environments

Example:

Gather requirements
Fetch sources
Extract facts
Write answer

4.2 Tree plan (hierarchical outline)

Best for:

writing documents
building multi-section outputs
tasks that naturally decompose into “chapters”

Example:

Section A: requirements
- A1: functional
- A2: non-functional
Section B: design
- B1: components
- B2: data flow

4.3 DAG plan (dependencies + parallelism)

Best for:

tool calls that can be parallelized
workflows with shared prerequisites

Example:

fetch docs (A)
fetch logs (B)
parse docs (depends on A)
parse logs (depends on B)
synthesize (depends on both)

If your system supports parallel tool execution, DAG plans reduce latency and reduce the “agent waits doing nothing” problem.

Further reading (optional): if you want strict plan schemas, see Structured Output Patterns.

Hierarchical planning works best when the agent refines plans progressively:

Coarse plan: identify 3–7 big tasks
Refine one chunk: expand only the next chunk into concrete steps
Execute: run that chunk with verification
Update: revise plan if reality differs

This avoids the most common planning failure:

The agent writes a beautiful 30-step plan, then step 2 fails, and the rest becomes irrelevant.

Progressive refinement keeps the plan aligned with reality.

5.5 Replanning triggers: when the plan must change

Good planning systems replan when reality changes. Define explicit triggers like:

5.5.1 Tool failure trigger

If the same tool call fails twice (timeout, 403, schema mismatch), replan instead of retrying forever.

5.5.2 Evidence gap trigger

If a chunk’s done-definition requires evidence but the agent cannot obtain it within budget, replan:

broaden query
change source strategy
ask user for constraints

5.5.3 Budget pressure trigger

If remaining budget is low and key tasks are incomplete, replan for a smaller scope:

deliver partial result
mark what’s missing and why

This is what makes hierarchical planning robust: it explicitly handles “plans becoming stale.”

6. Planning with constraints: make constraints first-class

Constraints are where most agent plans fail. Typical constraints:

“no unsafe writes”
“cite sources”
“use a specific format”
“budget limit”

Implementation rule:

Constraints must be stored in state and checked by the verifier at every step.

If constraints are only in the initial prompt, they drift over long runs.

6.5 Constraint propagation: enforce constraints at every level

Constraints must be applied at:

High-level plan: tasks should not violate constraints (no “delete database” tasks if writes are forbidden).
Step plan: tools and steps must remain within allowlists.
Execution: tool calls must be validated in code.

Practical technique:

Store constraints once in state.
Every planner/verifier prompt includes the same constraints field.
The verifier’s output must explicitly say which constraint each step satisfies (or violates).

This turns constraints from “text” into “guardrails.”

7. Stop conditions and budgets (planning without budgets is wishful thinking)

Hierarchical planning needs stop conditions, otherwise agents keep refining forever.

Useful budgets:

max total steps
max tool calls
max retries per step
max tokens
max time

Useful stop states:

SUCCESS (requirements met)
NEEDS_INPUT (ambiguous / missing data)
NEEDS_APPROVAL (high-risk action)
FAILURE (cannot proceed within constraints/budgets)

This is how you turn “autonomous” into “predictable.”

7.5 Budget allocation across the hierarchy

If you give one global budget (e.g., max 20 steps) without allocation, agents tend to overspend on early phases (usually searching or over-planning).

Instead, allocate budgets by chunk:

20% requirements and clarification
40% evidence gathering / tool work
25% drafting or execution
15% critique + finalization

These are not universal numbers—pick what matches your workload—but the idea is:

Spending should reflect value. Evidence gathering and execution often deserve more budget than prose.

8. Verification gates: how you prevent plans from becoming hallucinations

At each layer, add gates:

8.1 Gate the high-level plan

Check:

does it cover all requirements?
is it realistic within budgets?
does it separate read vs write steps?

8.2 Gate the step plan

Check:

are tool calls necessary?
are tool calls safe (allowlists)?
are dependencies satisfied?

8.3 Gate execution

Check:

tool outputs are valid (schemas)
evidence exists (quotes/tests)
progress is real (not repeating)

Further reading (optional): see Self-Reflection and Critique for how to structure the critic role.

8.5 Verification checklists per plan layer (copy/paste)

8.5.1 High-level plan checklist

Does every task contribute to the goal?
Are constraints reflected (no forbidden actions)?
Are done-definitions measurable?
Are budgets realistic?

8.5.2 Step plan checklist

Does each step have a clear expected output?
Are tool calls necessary and safe (allowlists)?
Are dependencies satisfied?
Is there an obvious repetition risk?

8.5.3 Execution checklist

Did the tool output validate (schema + semantics)?
Did we record evidence (quotes, test output, logs)?
Did we update state (facts, failures, budgets)?
Did we hit stop conditions correctly?

These checklists can be used by a verifier model or by deterministic validators.

9. Common failure modes (and fixes)

9.1 Over-planning

Symptom: agent writes huge plans and never executes.

Fix: force refinement only for the next chunk; enforce a planning budget (e.g., max 200 tokens for plan).

9.2 Under-planning

Symptom: agent executes immediately and gets lost.

Fix: require at least a coarse plan before the first tool call.

9.3 Plan brittleness

Symptom: step 2 fails and the plan collapses.

Fix: progressive refinement + replanning triggers (“if tool fails twice, replan”).

9.4 Infinite loops

Symptom: repeating similar steps with no new evidence.

Fix: repetition detectors + negative memory + circuit breakers.

Further reading (optional): see Error Handling and Recovery.

9.5 Planning + state: persistence for long-running tasks

Hierarchical planning is much more useful when tasks run long enough that you can’t keep everything in context.

Practical pattern:

After each chunk completes (or fails), checkpoint:
- plan status
- evidence list
- budgets spent
- failures

This gives you:

resumability (“pick up from where we stopped”)
auditability (“what did the agent do and why?”)
cost control (you can stop mid-run safely)

Further reading (optional): see State Management and Checkpoints for checkpointing mechanics and schema versioning.

10. Implementation sketch (pseudocode)

def hierarchical_agent(objective: str, constraints: list[str], state: dict) -> dict:
    # 1) Ensure we have a high-level plan
    if not state.get("high_plan"):
        state["high_plan"] = llm.make_high_plan(objective, constraints)
        state["high_plan"] = validate_plan_schema(state["high_plan"])

    # 2) Pick next chunk
    chunk = select_next_chunk(state["high_plan"])
    if chunk is None:
        return {"status": "SUCCESS", "state": state}

    # 3) Refine chunk into steps (bounded)
    steps = llm.refine_chunk(objective, constraints, chunk, state)
    steps = validate_step_schema(steps)

    # 4) Execute steps with verification
    for step in steps:
        if budget_exceeded(state):
            return {"status": "FAILURE", "reason": "BUDGET_EXCEEDED", "state": state}

        decision = verifier.precheck(step, constraints, state)
        if decision["status"] == "BLOCK":
            return {"status": "NEEDS_INPUT", "question": decision["question"], "state": state}

        result = tools.run(step["tool"], step["args"])
        state = update_state(state, step, result)

        if repetition_detected(state):
            return {"status": "FAILURE", "reason": "REPETITION_DETECTED", "state": state}

    # 5) Mark chunk done and loop
    mark_chunk_complete(state["high_plan"], chunk)
    return hierarchical_agent(objective, constraints, state)

The key idea: you only refine and execute a small piece at a time, and you validate each layer.

10.5 Practical tip: write plans in terms of artifacts, not vague actions

Plans become actionable when each step produces an artifact:

“Requirements doc” (bullet list)
“Evidence table” (claims + citations)
“Patch/diff” (for code changes)
“Test report” (pass/fail + logs)

If steps do not produce artifacts, it’s hard to verify progress and easy for the agent to drift into “thinking.”

This is especially important for long tasks: artifacts allow you to resume and audit without re-running everything.

11. Case study: building a “research + write” agent that doesn’t spiral

Goal: produce a short report with citations.

Hierarchical plan:

Chunk 1: define scope and queries
Chunk 2: collect sources and extract evidence
Chunk 3: write draft using evidence
Chunk 4: critique for unsupported claims
Chunk 5: finalize and format

Where the hierarchy helps:

browsing is bounded to chunk 2 (no endless searching)
writing doesn’t start until evidence exists
critique happens after writing, and revisions are bounded

Further reading (optional): Web Browsing Agents and Self-Reflection and Critique.

11.5 Case study: “Autonomous fix-it agent” (bounded remediation)

Goal: help fix a failing CI build safely.

High-level plan (hierarchical):

Chunk 1: gather context (logs, failing tests)
Chunk 2: hypothesize causes
Chunk 3: propose minimal patch
Chunk 4: run tests in a sandbox
Chunk 5: produce final diff for review

Key safety gate:

the agent must never apply changes directly; it produces a patch and evidence (test output).

Where hierarchical planning helps:

it stops the agent from “randomly editing files” without a hypothesis
it forces small patches and verification after each patch
it enforces budgets (“if 2 patch attempts fail, stop and ask for human input”)

Further reading (optional): if the agent runs code/tests, see Code Execution Agents.

11.6 A concrete evaluation approach for hierarchical planners

If you can’t measure planning quality, you’ll end up optimizing for “plans that look nice” instead of “plans that finish.”

A practical evaluation harness:

11.6.1 Step-budget success rate

Run a set of tasks with a fixed step budget and measure:

success rate
average steps used
p95 steps used

Good hierarchical planners should improve success rate and reduce wasted steps.

11.6.2 Plan stability

Measure how often the agent replans:

replans per task
replans triggered by tool failure vs. triggered by new evidence

Replanning is not bad; thrashing is bad. If the agent replans every step, it’s not really planning—it’s improvising.

11.6.3 Evidence coverage

For tasks that require evidence (citations, test outputs), measure:

fraction of key claims with evidence
fraction of outputs passing verification gates

This aligns planning with real-world correctness.

Further reading (optional): for instrumentation patterns, see Observability and Tracing.

12. Summary & Junior Engineer Roadmap

Hierarchical planning turns large tasks into manageable, verifiable loops:

Plan at multiple levels: strategy → tactics → actions.
Refine progressively: coarse plan first, details only when needed.
Store plans in state: plans are data, not vibes.
Verify at every layer: gates prevent plan hallucinations.
Use budgets + stop conditions: autonomy must be bounded.

Mini-project (recommended)

Build a small hierarchical agent:

It always writes a coarse plan (max 7 tasks).
It refines only the next task into 3–6 steps.
It runs tools with strict validation.
It stops with a clear reason when budgets are exceeded.

If you can make this predictable, you’re ready for bigger autonomous systems.

Common rookie mistakes (avoid these)

Planning in prose only: if the plan isn’t structured data, you can’t validate it or execute it reliably.
No definition of done: “finish the task” is not a done-definition. Define measurable completion criteria.
No replanning triggers: without triggers, agents either never replan (and get stuck) or replan constantly (thrash).
No budgets per chunk: one global budget usually gets spent on early tasks; allocate budgets by phase.
No verification gates: if you don’t gate each layer, the agent can “complete” a plan that never met requirements.

Arun Baby

1. The problem hierarchical planning solves

2. The mental model: strategy vs. tactics vs. actions

2.5 A practical rule: “One level up decides, one level down executes”

3. A reference architecture: H-Plan + H-Execute + H-Verify

Key design choice: commit plans to state

3.5 Plan schemas: the simplest structure that works

4. Plan representations: list, tree, and DAG

4.1 Linear plan (list)

4.2 Tree plan (hierarchical outline)

4.3 DAG plan (dependencies + parallelism)

5. The key technique: plan refinement (coarse → fine)

5.5 Replanning triggers: when the plan must change

5.5.1 Tool failure trigger

5.5.2 Evidence gap trigger

5.5.3 Budget pressure trigger

6. Planning with constraints: make constraints first-class

6.5 Constraint propagation: enforce constraints at every level

7. Stop conditions and budgets (planning without budgets is wishful thinking)

7.5 Budget allocation across the hierarchy

8. Verification gates: how you prevent plans from becoming hallucinations

8.1 Gate the high-level plan

8.2 Gate the step plan

8.3 Gate execution

8.5 Verification checklists per plan layer (copy/paste)

8.5.1 High-level plan checklist

8.5.2 Step plan checklist

8.5.3 Execution checklist

9. Common failure modes (and fixes)

9.1 Over-planning

9.2 Under-planning

9.3 Plan brittleness

9.4 Infinite loops

9.5 Planning + state: persistence for long-running tasks

10. Implementation sketch (pseudocode)

10.5 Practical tip: write plans in terms of artifacts, not vague actions

11. Case study: building a “research + write” agent that doesn’t spiral

11.5 Case study: “Autonomous fix-it agent” (bounded remediation)

11.6 A concrete evaluation approach for hierarchical planners

11.6.1 Step-budget success rate

11.6.2 Plan stability

11.6.3 Evidence coverage

12. Summary & Junior Engineer Roadmap

Mini-project (recommended)

Common rookie mistakes (avoid these)

Further reading (optional)

Related across topics

Trapping Rain Water

Distributed Training Patterns

Wake Word Detection

Share on