Cost Management for Agents
“Intelligence is cheap. Reliable, scalable intelligence is expensive.”
1. Introduction
When you move from a “Demo” (10 queries/day) to “Production” (10M queries/day), the economics of AI Agents shift dramatically. A generic GPT-4 agent running a ReAct loop (Reason+Act) might cost $0.30 per task. If you have 10,000 active users doing 5 tasks a day, your daily bill is $15,000 ($5.4M/year).
Cost Management is not just about “switching to cheaper models”. It involves architecting systems that are financially aware, using routing, caching, and budget enforcement as core primitives.
2. Core Concepts: The Token Economy
To optimize cost, we must understand the billing unit.
- Input Tokens: Cheaper. (Reading context).
- Output Tokens: 3x-10x More Expensive. (Generation).
- Frequency: Agents are “chabby”. A single user “task” might involve 10 back-and-forth LLM calls (Thought -> Tool -> Observation -> Thought…).
The Formula:
Cost = (Input_Vol * Input_Rate) + (Output_Vol * Output_Rate) + (Tool_Compute_Cost)
Optimization targets the Frequency (fewer calls) and the Rate (cheaper models).
3. Architecture Patterns: The Cost Gateway
We shouldn’t hardcode API keys in our agent code. We need a Model Gateway (like LiteLLM or Helicone).
[Agent Logic] -> [Cost Gateway] -> [Provider (OpenAI/Anthropic)]
|
+-------+-----------+
| |
[Budget Check] [Semantic Cache]
(Stop if over) (Return cached)
The Router Pattern: The Gateway inspects the prompt.
- Tier 1 (Complex): “Write a legal contract” -> Route to GPT-4.
- Tier 2 (Simple): “Extract the date from this string” -> Route to GPT-3.5-Turbo or Claude-Haiku.
4. Implementation Approaches
4.1 Semantic Caching
Exact string matching (Redis) has a 5% hit rate. “How are you?” != “How are you doing?”. Semantic Caching uses Embeddings.
- Embed query:
vec = embed("How are you?") - Search Vector DB for neighbors.
- If distance < 0.1 (very similar), return cached response.
4.2 Waterfall Routing / Fallbacks
Try the cheapest model first. If it fails (or returns low confidence/bad format), retry with the expensive model.
5. Code Examples: The Budget-Aware Router
import os
import openai
from tenacity import retry, stop_after_attempt
# Mock Pricing Table ($ per 1k tokens)
PRICING = {
"gpt-4": 0.03,
"gpt-3.5-turbo": 0.0015
}
class CostRouter:
def __init__(self, budget_limit=5.0):
self.total_spend = 0.0
self.budget_limit = budget_limit
def estimate_cost(self, model, prompt_len, output_len_est=500):
# Very rough estimation
rate = PRICING.get(model, 0.03)
return (prompt_len + output_len_est) / 1000 * rate
def route(self, prompt, complexity="low"):
# 1. Budget Check
if self.total_spend >= self.budget_limit:
raise Exception("Budget Exceeded! Refusing to run.")
# 2. Model Selection Logic
model = "gpt-3.5-turbo"
if complexity == "high" or len(prompt) > 8000:
model = "gpt-4"
# 3. Execution
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
# 4. Accounting (Post-execution true up)
usage = response['usage']
cost = (usage['prompt_tokens'] * PRICING[model]) / 1000 # Simplified
self.total_spend += cost
return response
router = CostRouter(budget_limit=10.0)
# Simple query -> Cheap
router.route("What is 2+2?", complexity="low")
# Hard query -> Expensive
router.route("Draft a patent claim for...", complexity="high")
6. Production Considerations
6.1 The “Agent Loop” Trap
An agent gets stuck in a loop:
- Thought: “I need to search Google.”
- Action: Search “Python”.
- Observation: “Python is a snake.”
- Thought: “That’s not code. I need to search Google.”
- Action: Search “Python”. … System Design Fix: Implement a Max Steps Circuit Breaker. Hard limit of 10 steps. If not solved, return “I failed” rather than burning $100.
6.2 FinOps Tagging
Every request should have metadata: {"user_id": "123", "feature": "email_writer"}.
This allows you to answer: “Is the Email Writer feature profitable?”
7. Common Pitfalls
- Summarization Recursion: You summarize history to save tokens. But resizing the summary costs tokens. Sometimes summarization costs more than just reading the raw logs if the thread is short.
- Over-Caching: Caching “Write me a poem about X” is bad (User wants variety). Caching “What is the capital of X?” is good.
- Fix: Only cache deterministic queries.
8. Best Practices
- Usage-Based Throttling: Rate limit users not just by Request Count, but by Dollar Amount. “You have $1.00 credit per day.”
- Separation of Concerns: Don’t ask the “Planner” (GPT-4) to do the “Extraction” (JSON formatting). Extract using GPT-3.5 or RegEx.
9. Connections to Other Topics
This connects to Model Serialization (ML Track).
- Serialization: Optimizing storage size (disk cost).
- Cost Mgmt: Optimizing token size (compute cost). In both, “Compression” (of weights or of prompts) is the key lever for efficiency.
10. Real-World Examples
- Zapier AI Actions: Uses a router. Simple logic runs on cheaper models. Complex reasoning upgrades to GPT-4.
- Microsoft Copilot: Likely caches code snippets. If 10,000 developers type
def qsort(arr):, the completion is fetched from a KV store, not re-generated by the GPU.
11. Future Directions
- Speculative Decoding: Using a small model to “guess” the next few tokens, and the large model to “verify” them in parallel. Reduces cost and latency.
- Local-First Agents: Running a 7B Llama-3 model on the user’s laptop for free, falling back to Cloud GPT-4 only when stuck.
12. Key Takeaways
- Routing is ROI: Getting 80% quality for 10% price (GPT-3.5) is better than 99% quality for 100% price (GPT-4) for most tasks.
- Cache Aggressively: Semantic caching is the only way to get sub-millisecond, $0 cost responses.
- Circuit Breakers: Never let an agent run
while(true).
Originally published at: arunbaby.com/ai-agents/0047-cost-management-for-agents
If you found this helpful, consider sharing it with others who might benefit.