Cost Management for Agents

5 minute read

“Intelligence is cheap. Reliable, scalable intelligence is expensive.”

1. Introduction

When you move from a “Demo” (10 queries/day) to “Production” (10M queries/day), the economics of AI Agents shift dramatically. A generic GPT-4 agent running a ReAct loop (Reason+Act) might cost $0.30 per task. If you have 10,000 active users doing 5 tasks a day, your daily bill is $15,000 ($5.4M/year).

Cost Management is not just about “switching to cheaper models”. It involves architecting systems that are financially aware, using routing, caching, and budget enforcement as core primitives.

2. Core Concepts: The Token Economy

To optimize cost, we must understand the billing unit.

Input Tokens: Cheaper. (Reading context).
Output Tokens: 3x-10x More Expensive. (Generation).
Frequency: Agents are “chabby”. A single user “task” might involve 10 back-and-forth LLM calls (Thought -> Tool -> Observation -> Thought…).

The Formula: Cost = (Input_Vol * Input_Rate) + (Output_Vol * Output_Rate) + (Tool_Compute_Cost)

Optimization targets the Frequency (fewer calls) and the Rate (cheaper models).

3. Architecture Patterns: The Cost Gateway

We shouldn’t hardcode API keys in our agent code. We need a Model Gateway (like LiteLLM or Helicone).

[Agent Logic] -> [Cost Gateway] -> [Provider (OpenAI/Anthropic)]
                      |
              +-------+-----------+
              |                   |
        [Budget Check]      [Semantic Cache]
        (Stop if over)      (Return cached)

The Router Pattern: The Gateway inspects the prompt.

Tier 1 (Complex): “Write a legal contract” -> Route to GPT-4.
Tier 2 (Simple): “Extract the date from this string” -> Route to GPT-3.5-Turbo or Claude-Haiku.

4. Implementation Approaches

4.1 Semantic Caching

Exact string matching (Redis) has a 5% hit rate. “How are you?” != “How are you doing?”. Semantic Caching uses Embeddings.

Embed query: vec = embed("How are you?")
Search Vector DB for neighbors.
If distance < 0.1 (very similar), return cached response.

4.2 Waterfall Routing / Fallbacks

Try the cheapest model first. If it fails (or returns low confidence/bad format), retry with the expensive model.

5. Code Examples: The Budget-Aware Router

import os
import openai
from tenacity import retry, stop_after_attempt

# Mock Pricing Table ($ per 1k tokens)
PRICING = {
    "gpt-4": 0.03,
    "gpt-3.5-turbo": 0.0015
}

class CostRouter:
    def __init__(self, budget_limit=5.0):
        self.total_spend = 0.0
        self.budget_limit = budget_limit
        
    def estimate_cost(self, model, prompt_len, output_len_est=500):
        # Very rough estimation
        rate = PRICING.get(model, 0.03)
        return (prompt_len + output_len_est) / 1000 * rate

    def route(self, prompt, complexity="low"):
        # 1. Budget Check
        if self.total_spend >= self.budget_limit:
            raise Exception("Budget Exceeded! Refusing to run.")
            
        # 2. Model Selection Logic
        model = "gpt-3.5-turbo"
        if complexity == "high" or len(prompt) > 8000:
            model = "gpt-4"
            
        # 3. Execution
        response = openai.ChatCompletion.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        
        # 4. Accounting (Post-execution true up)
        usage = response['usage']
        cost = (usage['prompt_tokens'] * PRICING[model]) / 1000 # Simplified
        self.total_spend += cost
        
        return response

router = CostRouter(budget_limit=10.0)
# Simple query -> Cheap
router.route("What is 2+2?", complexity="low") 
# Hard query -> Expensive
router.route("Draft a patent claim for...", complexity="high")

6. Production Considerations

6.1 The “Agent Loop” Trap

An agent gets stuck in a loop:

Thought: “I need to search Google.”
Action: Search “Python”.
Observation: “Python is a snake.”
Thought: “That’s not code. I need to search Google.”
Action: Search “Python”. … System Design Fix: Implement a Max Steps Circuit Breaker. Hard limit of 10 steps. If not solved, return “I failed” rather than burning $100.

6.2 FinOps Tagging

Every request should have metadata: {"user_id": "123", "feature": "email_writer"}. This allows you to answer: “Is the Email Writer feature profitable?”

7. Common Pitfalls

Summarization Recursion: You summarize history to save tokens. But resizing the summary costs tokens. Sometimes summarization costs more than just reading the raw logs if the thread is short.
Over-Caching: Caching “Write me a poem about X” is bad (User wants variety). Caching “What is the capital of X?” is good.
- Fix: Only cache deterministic queries.

8. Best Practices

Usage-Based Throttling: Rate limit users not just by Request Count, but by Dollar Amount. “You have $1.00 credit per day.”
Separation of Concerns: Don’t ask the “Planner” (GPT-4) to do the “Extraction” (JSON formatting). Extract using GPT-3.5 or RegEx.

9. Connections to Other Topics

This connects to Model Serialization (ML Track).

Serialization: Optimizing storage size (disk cost).
Cost Mgmt: Optimizing token size (compute cost). In both, “Compression” (of weights or of prompts) is the key lever for efficiency.

10. Real-World Examples

Zapier AI Actions: Uses a router. Simple logic runs on cheaper models. Complex reasoning upgrades to GPT-4.
Microsoft Copilot: Likely caches code snippets. If 10,000 developers type def qsort(arr):, the completion is fetched from a KV store, not re-generated by the GPU.

11. Future Directions

Speculative Decoding: Using a small model to “guess” the next few tokens, and the large model to “verify” them in parallel. Reduces cost and latency.
Local-First Agents: Running a 7B Llama-3 model on the user’s laptop for free, falling back to Cloud GPT-4 only when stuck.

12. Key Takeaways

Routing is ROI: Getting 80% quality for 10% price (GPT-3.5) is better than 99% quality for 100% price (GPT-4) for most tasks.
Cache Aggressively: Semantic caching is the only way to get sub-millisecond, $0 cost responses.
Circuit Breakers: Never let an agent run while(true).

Originally published at: arunbaby.com/ai-agents/0047-cost-management-for-agents

If you found this helpful, consider sharing it with others who might benefit.

Cost Management for Agents

1. Introduction

2. Core Concepts: The Token Economy

3. Architecture Patterns: The Cost Gateway

4. Implementation Approaches

4.1 Semantic Caching

4.2 Waterfall Routing / Fallbacks

5. Code Examples: The Budget-Aware Router

6. Production Considerations

6.1 The “Agent Loop” Trap

6.2 FinOps Tagging

7. Common Pitfalls

8. Best Practices

9. Connections to Other Topics

10. Real-World Examples

11. Future Directions

12. Key Takeaways

Related across topics

Share on

1. Introduction

2. Core Concepts: The Token Economy

3. Architecture Patterns: The Cost Gateway

4. Implementation Approaches

4.1 Semantic Caching

4.2 Waterfall Routing / Fallbacks

5. Code Examples: The Budget-Aware Router

6. Production Considerations

6.1 The “Agent Loop” Trap

6.2 FinOps Tagging

7. Common Pitfalls

8. Best Practices

9. Connections to Other Topics

10. Real-World Examples

11. Future Directions

12. Key Takeaways

Related across topics

Serialize and Deserialize Binary Tree

Model Serialization Systems

Speech Model Export

Cost Management for Agents

Share on