7 minute read

“The user knows what they want. Your job is to tell them before they finish typing.”

TL;DR

Typeahead systems must return suggestions in under 50ms at millions of QPS. A weighted trie with pre-computed top-K candidates at each node enables O(1) lookups by avoiding runtime subtree traversal. The architecture combines a static trie built offline from query logs with a dynamic trie for real-time trending topics, all served behind CDN caching and prefix-based sharding. Memory optimization through Radix Trees reduces pointer overhead by 60%. For the data structure foundations behind this, see the broader ML System Design series, and for how vector search complements keyword retrieval, see Vector Databases.

A massive filing cabinet with cascading drawer labels getting more specific at each level

1. Problem Statement

Typeahead (or Autocomplete) is the most heavily used feature on the internet.

  • Volume: Every keystroke triggers a query. If a user types “iphone case”, that’s 11 requests.
  • Latency: Must be < 50ms (Human perception of “instant”).
  • Freshness: If “Earthquake in Japan” happens, it must appear in suggestions within minutes.

The System Design: Build a backend for a Google-style search bar that suggests top 10 completions based on prefix.


2. Understanding the Requirements

2.1 The Scale

  • DAU: 500 Million.
  • Queries: 5 billion searches / day.
  • Keystrokes: ~50 billion typeahead requests / day.
  • QPS: Peak 1 million QPS. (This requires massive horizontal scaling).

2.2 The Ranking Problem

We don’t just want matching strings.

  • Prefix: “ja”
  • Candidates: “java”, “japan”, “javascript”, “jacket”.
  • We need to return top 5 sorted by Score. Score = Transformation(Frequency, User_History, Location, Freshness)

3. High-Level Architecture

We need a dedicated Typeahead Service separate from the main Search Service.

[Browser]
 | (GET /suggest?q=sys)
 v
[Load Balancer] --> [CDN (Caches popular: "fa"->"facebook")]
 |
 v
[Typeahead Aggregator]
 |
 +---> [Service Node 1 (A-C)]
 +---> [Service Node 2 (D-F)]
 +---> [Personalization Service (Reranking)]

4. Component Deep-Dives

4.1 The Data Structure: The Weighted Trie

In RAM, we store a specialized Trie. Each Node contains:

  1. Children Pointers: {'a': Node, 'b': Node}.
  2. Top-K Cache: A pre-sorted list of the top 10 most popular completed queries that pass through this node.

Why Pre-compute Top-K? If we wait until runtime to traverse the whole subtree of “sys” to find “system”, “systolic”… it’s too slow. We aggregate scores offline. At node “s-y-s”, we store ["system", "sysadmin"] directly. Time Complexity: O(1) (Pointer chase + Return list).

4.2 Data Sharding

The Dictionary is too big for one machine.

  • Option A: Range Partitioning (A-C, D-F)
  • Risk: Hotspots. ‘S’ is popular. ‘X’ is quiet.
  • Option B: Hash Partitioning
  • ShardID = Hash(prefix) % 100.
  • Risk: We can only hash the prefix. If user types “s”, do we query all shards? No. Usually we shard by the first 2-3 characters.

5. Data Flow: The Log Processing Pipeline

How do queries get into the Trie?

  1. Log Collection: Browser emits “User searched: java” -> Kafka.
  2. Aggregation (Spark/MapReduce):
    • Window: Last 7 days.
    • Group By Query. Count Frequency.
  3. Trie Building (Offline):
    • Worker reads aggregated list.
    • Builds the Weighted Trie.
    • Serializes to file (trie_snapshot_v100.bin).
  4. Deployment:
    • Push file to S3.
    • Typeahead Servers download and hot-swap memory.

6. Scaling Strategies

Offline builds happen weekly. But “Breaking News” needs to appear now. Hybrid Architecture:

  • Result = Merge(Static_Trie, Dynamic_Trie).
  • Dynamic Trie: Size limited (e.g., 200MB). Stores only high-velocity queries from the last 15 mins (via a Storm/Flink stream).

6.2 Sampling

We don’t log every keystroke. We log the final submission. Or we sample 1% of traffic for analytics.


7. Implementation: A Basic Typeahead Class

class TrieNode:
    def __init__(self):
        self.children = {}
        self.top_candidates = [] # Stores top 5 tuples (score, text)

    def insert(self, word, score):
        # Insert into cache if score is high enough
        self.top_candidates.append((score, word))
        self.top_candidates.sort(key=lambda x: -x[0])
        self.top_candidates = self.top_candidates[:5]

    class TypeaheadSystem:
    def __init__(self):
        self.root = TrieNode()

    def add_query(self, query, score):
        node = self.root
        for char in query:
            if char not in node.children:
                node.children[char] = TrieNode()
                node = node.children[char]
                # Crucial: Update the pre-computed top-k at every step
                node.insert(query, score)

    def suggest(self, prefix):
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []
                node = node.children[char]
                return [text for score, text in node.top_candidates]

                # Usage
                ta = TypeaheadSystem()
                ta.add_query("amazon", 100)
                ta.add_query("amazing", 50)
                ta.add_query("am", 10)

                print(ta.suggest("am"))
                # Returns ['amazon', 'amazing', 'am'] sorted by score (because we sorted on insert)

8. Monitoring & Metrics

  1. Zero Results Rate: % of prefixes that yield no suggestions.
  2. Keystroke Latency: p99 latency (must be < 50ms).
  3. Selection Rate: How often users click suggestion #3 vs completing the typing.
    • High click rate on #1 = Good Ranking.

9. Failure Modes

  1. The “Dirty Words” Problem: The algorithm is unbiased. If users search for hate speech, the Trie suggests it.
    • Mitigation:: Strict “Blacklist” filter applied during the Offline Build phase.
  2. Shard Failure: If ‘S’ shard goes down, nobody can search ‘System’.
    • Mitigation:: Replica sets (3 replicas per shard).

Facebook uses a system called Unicorn. It’s a graph-based search engine. When you type “John”, it doesn’t just look for string “John”. It looks for: Friends named John > Friends of Friends > Celebrities. They use Social Graph Context as a ranking signal, not just global frequency.


11. Cost Analysis

Storing 1 Billion queries in a naive Trie is RAM heavy. Optimization: Ternary Search Trees (TST) or Radix Trees. These compress the storage.

  • Nodes with single child are merged (sys -> tem).
  • Reduces RAM/Pointer overhead by 60%.

12. Key Takeaways

  1. Pre-computation: Never sort at query time. Sort at write time.
  2. Prefix Sharding: Hashing is messy for ranges. Use prefix-based sharding with careful heat-balancing.
  3. Browser Caching: The fastest query is the one served from localStorage.
  4. Privacy: Typeahead leaks intent. Encrypt the channel (HTTPS) and anonymize the logs.

FAQ

How does a typeahead system return results so fast?

Instead of traversing the entire subtree at query time, each trie node stores a pre-sorted list of the top-K most popular completions that pass through it. This means looking up suggestions for any prefix is just a pointer chase to the right node followed by returning the cached list, achieving O(1) time complexity regardless of the total dictionary size.

How do you handle trending searches in a typeahead system?

A hybrid architecture merges a static trie built from weekly offline aggregation with a small dynamic trie (limited to around 200MB) that ingests high-velocity queries from the last 15 minutes via a stream processor like Flink or Storm. The final result is a merge of both sources, ensuring breaking news appears within minutes.

What data structure optimizations reduce trie memory usage?

Ternary Search Trees and Radix Trees compress nodes with single children by merging path segments like “sys” and “tem” into a single node. This eliminates intermediate pointer overhead and reduces RAM consumption by roughly 60%, making it feasible to store billions of queries in memory.

How is a typeahead system sharded across multiple servers?

Prefix-based sharding assigns query ranges to different servers (e.g., A-C on Node 1, D-F on Node 2), though careful heat-balancing is needed since prefixes like “S” see far more traffic than “X”. Hash partitioning on the first 2-3 characters is an alternative that distributes load more evenly but complicates range queries.


Originally published at: arunbaby.com/ml-system-design/0048-trie-based-search

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch