Token Efficiency Optimization
“The most expensive token is the one you didn’t need to send.”
1. Introduction
In the world of AI Agents, Tokens are Currency.
- Cost: You pay per 1M input tokens. A verbose system prompt sent 100,000 times a day drains the budget.
- Latency: The more tokens the LLM reads, the slower the Time-To-First-Token (TTFT). Reading 10k tokens takes seconds.
- Performance: The “Lost in the Middle” phenomenon means LLMs forget instructions if the context is too stuffed.
Token Efficiency Optimization is the practice of compressing your agent’s cognitive load without lobotomizing its intelligence. It is the “garbage collection” and “compression” of the Agentic world.
2. Core Concepts: The Anatomy of a Prompt
An Agent’s context window isn’t just a string. It has rigid sections, each with a “Token Tax”:
- System Prompt: The static instructions (“You are a helpful assistant…”). Sent every single turn.
- Tool Definitions: The JSON schemas describing functions (
search_google,calculator). Can be huge. - Conversation History: The dynamic chat log. Grows linearly (or super-linearly if tools are verbose).
- RAG Context: The retrieved documents. Often the biggest chunk (3k-10k tokens).
Optimization attacks each of these layers.
3. Architecture Patterns for Efficiency
We treat the Context Window as a Cache.
3.1 The “Context Manager” Pattern
Instead of blindly appending messages.append(response), we use a sophisticated manager class.
- FIFO Buffer: Keep last N turns.
- Summarization: When history > 10 turns, ask a cheap model (GPT-3.5) to summarize turns 1-5 into a single “Memory” string.
- Tool Pruning: Only inject tool definitions relevant to the current state.
3.2 Prompt Compression (The “Minification”)
Much like minifying Javascript (function variableName -> function a), we can minify prompts.
- Original: “Please analyze the following text and determine the sentiment score between 0 and 1.” (15 tokens)
- Optimized: “Analyze text. Return sentiment 0-1.” (6 tokens)
- Savings: 60%.
4. Implementation Approaches
Strategy A: Dynamic Tool Loading
If you have 50 tools, don’t put 50 schemas in the system prompt. Use a Router.
- Classifier Step: “User asks about Math.”
- Loader Step: Load only
Calculator,WolframAlpha. - Execution Step: Run Agent.
Strategy B: System Prompt Refactoring
Standardize on terse, expert language. LLMs (especially GPT-4) understand compressed instructions. Instead of “If the user does this, then you should do that”, use “User: X -> Action: Y”.
5. Code Examples: The Token Budget Manager
Here is a Python class that actively manages the context window ensuring we never overflow or overspend.
import tiktoken
class TokenBudgetManager:
def __init__(self, model="gpt-4", max_tokens=8000):
self.max_tokens = max_tokens
self.encoding = tiktoken.encoding_for_model(model)
# Reserve 1000 tokens for the generated reply
self.safety_margin = 1000
def count(self, text):
return len(self.encoding.encode(text))
def compress_history(self, history):
"""
Compresses conversation history to fit budget.
Strategy: Keep System Prompt + Last N messages.
"""
current_tokens = 0
preserved_messages = []
# 1. Always keep System Prompt (Critical Instructions)
system_msg = next((m for m in history if m['role'] == 'system'), None)
if system_msg:
current_tokens += self.count(system_msg['content'])
preserved_messages.append(system_msg)
# 2. Add recent messages backwards until budget hit
budget = self.max_tokens - current_tokens - self.safety_margin
# Reverse excluding system
chat_msgs = [m for m in history if m['role'] != 'system']
for msg in reversed(chat_msgs):
msg_tokens = self.count(msg['content'])
if budget - msg_tokens >= 0:
preserved_messages.insert(1, msg) # Insert after system
budget -= msg_tokens
else:
break # Stop adding older messages
return preserved_messages
# Usage
manager = TokenBudgetManager()
history = [
{"role": "system", "content": "You are a concise agent..."},
{"role": "user", "content": "Hello world"},
# ... 50 more messages
]
optimized_history = manager.compress_history(history)
6. Production Considerations
6.1 Cost vs. Quality Curve
There is a “Pareto Frontiers” of optimization.
- Removing adjectives: No quality loss.
- Removing examples (Few-Shot): Slight quality loss.
- Removing constraints: High quality loss (Agent hallucinates). Rule of Thumb: Never compress the Safety Guidelines.
6.2 Caching (The Semantic Cache)
We discussed caching in ML System Design. For Agents, Semantic Caching saves 100% of tokens.
Ref: Cost Management.
If User: "Hello" is cached, we send 0 tokens to LLM.
7. Common Pitfalls
- JSON Schema Bloat: Pydantic models with verbose descriptions.
- Fix: Use terse descriptions.
Field(description="The user's age in years")->Field(description="age (yrs)").
- Fix: Use terse descriptions.
- HTML/XML Residue: Scraping a website often leaves
<div>,class="...". These are junk tokens.- Fix: Use
html2textor Markdown converters before injecting into context.
- Fix: Use
- Recursive Summarization: If you summarize a summary of a summary, details wash out (Chinese Whispers).
- Fix: Keep “Key Claims” separate from “Conversation Log”.
8. Best Practices: The “Chain of Density”
A prompting technique from Salesforce Research. Instead of asking for a summary once, ask the model to:
- Write a summary.
- Identify missing entities from the text.
- Rewrite the summary to include those entities without increasing length.
- Repeat 5 times. This creates physically dense information blocks, maximizing information-per-token density.
9. Connections to Other Topics
This connects deeply to the Transfer Learning theme of Day 46.
- Transfer Learning (ML): Freezes the backbone to save compute.
- Token Optimization (Agents): Freezes/Compresses the System Prompt (the “backbone” of the agent’s personality) to save IO.
- Both are about identifying the “Invariant” (core knowledge) vs the “Variant” (current input) and optimizing the ratio.
10. Real-World Examples
- GitHub Copilot: Uses “Fill-in-the-middle” models but aggressively prunes the surrounding code context to fit the most relevant file imports into the window.
- AutoGPT: Struggled famously with cost loops ($10 runs). Newer versions implement “sliding windows” and “memory summarization” by default.
11. Future Directions
- Context Caching (Google Gemini 1.5): You pay once to upload a huge manual (1M tokens). Subsequent requests referencing that manual are cheap. This essentially “fine-tunes” the cache state.
- Infinite Attention: Architectures (like RingAttention) that make context length mathematically irrelevant, shifting the bottleneck purely to compute/cost.
12. Key Takeaways
- Count your tokens: Use
tiktoken. Don’t guess. Length $\neq$ Tokens. - Minify everything: Prompts, Schemas, HTML.
- Manage History: It’s a sliding window, not an infinite scroll.
- Density is Quality: Concise prompts often yield smarter agents because the attention mechanism is less diluted.
Originally published at: arunbaby.com/ai-agents/0046-token-efficiency-optimization
If you found this helpful, consider sharing it with others who might benefit.