Phonetic Search in Speech

10 minute read

“Finding ‘Jon’ when the user types ‘John’, or ‘Symphony’ when they say ‘Simfoni’.”

1. The Problem: Spelling vs. Sound

In text search, we assume the user knows the spelling. In speech applications (Voice Search, Podcast Search), users often search by sound.

Challenges:

Homophones: “Meat” vs “Meet”, “Night” vs “Knight”.
Name Variations: “Jon”, “John”, “Jhon”.
ASR Errors: User said “Keras”, ASR transcribed “Carrots”.
Out-of-Vocabulary (OOV): Searching for a new artist name not in the ASR dictionary.

Phonetic Search indexes content by how it sounds, not just how it’s spelled. This is critical for music search (“Play [Artist]”), contact search (“Call [Name]”), and podcast retrieval.

2. Classical Phonetic Algorithms

These algorithms map words to a phonetic code. Words with similar pronunciation get the same code.

Soundex (1918)

Maps a name to a 4-character code (Letter + 3 Digits). Rules:

Keep the first letter.
Remove vowels (a, e, i, o, u, y) and ‘h’, ‘w’.
Map consonants to digits:
- b, f, p, v $\to$ 1
- c, g, j, k, q, s, x, z $\to$ 2
- d, t $\to$ 3
- l $\to$ 4
- m, n $\to$ 5
- r $\to$ 6
Merge adjacent duplicates.
Pad with zeros or truncate to length 4.

Example:

Smith:
- Keep ‘S’.
- m $\to$ 5, i (drop), t $\to$ 3, h (drop).
- Result: S530.
Smyth:
- Keep ‘S’.
- m $\to$ 5, y (drop), t $\to$ 3, h (drop).
- Result: S530.
Match!

Metaphone & Double Metaphone (1990, 2000)

Soundex is too simple (fails on “Phone” vs “Fawn”). Metaphone uses more complex rules based on English pronunciation.

Double Metaphone: Returns two codes (Primary and Alternate) to handle ambiguity (e.g., foreign names).
- “Schmidt” $\to$ Primary: XMT (German), Alternate: SMT (Anglicized).

import jellyfish

# Soundex
code1 = jellyfish.soundex("Smith")
code2 = jellyfish.soundex("Smyth")
print(f"Soundex: {code1} == {code2}")  # True

# Metaphone
code3 = jellyfish.metaphone("Phone")  # 'FN'
code4 = jellyfish.metaphone("Fawn")   # 'FN'
print(f"Metaphone: {code3} == {code4}")  # True

# Double Metaphone
primary, secondary = jellyfish.metaphone("Schmidt")
# Returns ('XMT', 'SMT')

3. Neural Phonetic Embeddings

Classical algorithms are rule-based and English-centric. They fail on names like “Siobhan” (pronounced “Shiv-on”) unless explicitly coded. Neural approaches learn phonetic similarity from data.

Siamese Network on Phoneme Sequences:

G2P (Grapheme-to-Phoneme): Convert word to phonemes.
- “Phone” $\to$ F OW N
- “Fawn” $\to$ F AO N
Encoder: Use an LSTM or Transformer to encode the phoneme sequence into a vector.
Triplet Loss: Train the model such that:
- Distance(Phone, Fawn) is small (Positive pair).
- Distance(Phone, Cake) is large (Negative pair).

Architecture:

Word A -> G2P -> Phonemes A -> [Bi-LSTM] -> Vector A
                                              ↓
                                          Similarity (Cosine)
                                              ↑
Word B -> G2P -> Phonemes B -> [Bi-LSTM] -> Vector B

Benefit:

Handles cross-lingual sounds.
Learns subtle variations (accents).
Can be trained on “misspelled” search logs (e.g., users typing “Beyonce” as “Beyonse”).

4. Fuzzy Search on ASR Lattices

When searching inside audio (e.g., “Find where they mentioned ‘TensorFlow’ in this podcast”), relying on the 1-best transcript is risky.

Audio: “I used Keras today.”
ASR 1-best: “I used Carrots today.”
Search for “Keras” fails.

ASR Lattice: A graph of alternative transcriptions.

      /-- Carrots (0.6) --\
I bought                    today
      \-- Keras (0.4) ----/

The lattice contains “Keras” with a lower probability.

Phonetic Indexing Strategy:

Lattice-to-Phonemes: Convert all paths in the lattice to phoneme sequences.
- Path 1: K AE R AH T S (Carrots)
- Path 2: K EH R AH S (Keras)
Index Phoneme N-grams: Index 3-grams like K EH R, EH R AH, R AH S.
Query Processing:
- Query: “TensorFlow” $\to$ T EH N S ER F L OW.
- Search for phoneme n-gram matches in the index.

Elasticsearch Phonetic Token Filter: Elasticsearch has a built-in plugin to handle this.

{
  "settings": {
    "analysis": {
      "filter": {
        "my_metaphone": {
          "type": "phonetic",
          "encoder": "metaphone",
          "replace": false
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "my_metaphone"]
        }
      }
    }
  }
}

5. Deep Dive: Grapheme-to-Phoneme (G2P)

To perform phonetic search, we need to convert text (Graphemes) to sound (Phonemes).

Dictionary Lookup (CMU Dict):

HELLO HH AH L OW
Fast, highly accurate for common words.
Fails on OOV words (names, slang, new brands).

Neural G2P:

Seq2Seq Model: Transformer trained to translate spelling to pronunciation.
Input: C H A T G P T
Output: CH AE T JH IY P IY T IY

import torch
import torch.nn as nn

# Simplified Seq2Seq G2P Model
class G2PModel(nn.Module):
    def __init__(self, char_vocab_size, phone_vocab_size, hidden_dim=256):
        super().__init__()
        self.embedding = nn.Embedding(char_vocab_size, hidden_dim)
        self.lstm = nn.LSTM(hidden_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, phone_vocab_size)
        
    def forward(self, x):
        # x: [batch, seq_len] (character indices)
        embedded = self.embedding(x)
        output, (hidden, cell) = self.lstm(embedded)
        # output: [batch, seq_len, hidden_dim]
        logits = self.fc(output)
        return logits

6. Deep Dive: Connectionist Temporal Classification (CTC) for Search

End-to-end ASR models (like Wav2Vec2) output a probability distribution over characters/phonemes for each time step (frame).

Posteriorgram Search: Instead of decoding to text (which makes hard decisions), store the Phonetic Posteriorgram (matrix of Time $\times$ Phoneme probabilities).

Query: “Alexa” (AH L EH K S AH) Search:

Convert Query to a template matrix.
Slide the query template over the stored posteriorgram.
Compute Dynamic Time Warping (DTW) distance at each position.
If distance < threshold, return timestamp.

Pros:

No Vocabulary Limit: Can search for words that didn’t exist when the model was trained.
Robust: Works even if the ASR is unsure (high entropy).

Cons:

Storage: Storing float matrices is expensive compared to text.
Compute: DTW is $O(N \cdot M)$.

7. Deep Dive: Weighted Finite State Transducers (WFSTs)

In professional speech systems (like Kaldi), search is often implemented using WFSTs.

Concept:

FST: A graph where edges have input labels, output labels, and weights.
Composition ($A \circ B$): Combining two FSTs.

Search as Composition: To search for a query $Q$ in a lattice $L$:

Represent Query $Q$ as an FST (accepts the query phonemes).
Represent Lattice $L$ as an FST (paths are phoneme sequences).
Compute Intersection: $I = Q \circ L$.
If $I$ is non-empty, the query exists in the lattice.
The shortest path in $I$ gives the best time alignment and confidence score.

Why WFSTs?

Efficiency: Operations like determinization and minimization optimize the graph for fast search.
Flexibility: Can easily add “fuzzy” matching by composing with an Edit Distance FST ($E$).
- Search = $Q \circ E \circ L$.

8. Deep Dive: Audio Fingerprinting vs. Phonetic Search

Don’t confuse Phonetic Search with Audio Fingerprinting (e.g., Shazam).

Feature	Audio Fingerprinting (Shazam)	Phonetic Search
Goal	Identify exact recording	Find spoken words
Input	Audio snippet	Text query or Audio snippet
Matching	Spectrogram peaks (Constellation Map)	Phonemes / Words
Robustness	Robust to noise, but needs exact match	Robust to different speakers/accents
Use Case	“What song is this?”	“Find podcast about AI”

9. Deep Dive: Case Study - E-commerce Voice Search

Scenario: User says “I want to buy a Swarovski necklace”. Challenge: “Swarovski” is hard to spell and pronounce. ASR might output “Swaroski”, “Swarofski”, “Svaroski”.

Solution:

Brand Name Expansion:
- Generate phonetic variants for all brand names offline.
- Swarovski $\to$ S W AO R AA F S K IY, S W ER AA F S K IY, etc.
Fuzzy Indexing:
- Index the product catalog using these phonetic variants.
Query Expansion:
- At runtime, generate phonetic variants for the user’s query.
- Search for all variants in the index.

10. Deep Dive: Evaluation Metrics

How do we measure if our phonetic search is working?

Term Error Rate (TER):
- Specific to Keyword Search.
- (False Negatives + False Positives) / Total Occurrences of Keyword.
Mean Average Precision (MAP):
- Standard IR metric.
Word Error Rate (WER):
- Standard ASR metric, but often uncorrelated with Search success.
- ASR might get “to” vs “too” wrong (WER increase), but search doesn’t care.
- ASR might get “Taylor” vs “Tyler” wrong (WER increase), and search fails catastrophically.

Implementation: Phonetic Search Engine

import jellyfish
from collections import defaultdict

class PhoneticSearchEngine:
    def __init__(self):
        self.index = defaultdict(list)
        self.documents = {}
    
    def add_document(self, doc_id, text):
        self.documents[doc_id] = text
        
        # Tokenize
        tokens = text.split()
        for token in tokens:
            # Index by Soundex
            code = jellyfish.soundex(token)
            self.index[code].append(doc_id)
            
            # Index by Metaphone (better precision)
            meta = jellyfish.metaphone(token)
            self.index[meta].append(doc_id)
            
            # Index by Double Metaphone (handle ambiguity)
            # Note: jellyfish.metaphone is actually Double Metaphone in some versions
            
    def search(self, query):
        results = set()
        tokens = query.split()
        
        for token in tokens:
            # Try Soundex
            code = jellyfish.soundex(token)
            if code in self.index:
                results.update(self.index[code])
            
            # Try Metaphone
            meta = jellyfish.metaphone(token)
            if meta in self.index:
                results.update(self.index[meta])
        
        return [self.documents[did] for did in results]

# Usage
engine = PhoneticSearchEngine()
engine.add_document(1, "John Smith is a developer")
engine.add_document(2, "Jon Smyth is a coder")
engine.add_document(3, "Joan Smite is a manager")

print("Query: Jhon Smith")
results = engine.search("Jhon Smith")
for doc in results:
    print(f"- {doc}")
# Expected: Returns Doc 1 and Doc 2. Doc 3 might be excluded depending on algorithm.

Key Takeaways

Phonetic Search bridges the gap between spelling and sound, essential for voice interfaces.
Classical Algorithms (Soundex, Metaphone) are effective baselines for text-based phonetic matching.
Neural G2P + Embeddings offer state-of-the-art accuracy and cross-lingual support.
ASR Lattices contain rich information (alternatives) that prevent search failures due to 1-best errors.
Elasticsearch has built-in support for phonetic matching, making it easy to deploy.

Summary

Aspect	Insight
Core Problem	Searching by sound, not spelling
Algorithms	Soundex, Metaphone, Neural Embeddings
Indexing	Phoneme n-grams, ASR Lattices
Applications	Voice Search, Podcast Indexing, Name Matching

Originally published at: arunbaby.com/speech-tech/0032-phonetic-search

If you found this helpful, consider sharing it with others who might benefit.

Phonetic Search in Speech

1. The Problem: Spelling vs. Sound

2. Classical Phonetic Algorithms

Soundex (1918)

Metaphone & Double Metaphone (1990, 2000)

3. Neural Phonetic Embeddings

4. Fuzzy Search on ASR Lattices

5. Deep Dive: Grapheme-to-Phoneme (G2P)

6. Deep Dive: Connectionist Temporal Classification (CTC) for Search

7. Deep Dive: Weighted Finite State Transducers (WFSTs)

8. Deep Dive: Audio Fingerprinting vs. Phonetic Search

9. Deep Dive: Case Study - E-commerce Voice Search

10. Deep Dive: Evaluation Metrics

Implementation: Phonetic Search Engine

Top Interview Questions

Key Takeaways

Summary

Related across topics

Share on

1. The Problem: Spelling vs. Sound

2. Classical Phonetic Algorithms

Soundex (1918)

Metaphone & Double Metaphone (1990, 2000)

3. Neural Phonetic Embeddings

4. Fuzzy Search on ASR Lattices

5. Deep Dive: Grapheme-to-Phoneme (G2P)

6. Deep Dive: Connectionist Temporal Classification (CTC) for Search

7. Deep Dive: Weighted Finite State Transducers (WFSTs)

8. Deep Dive: Audio Fingerprinting vs. Phonetic Search

9. Deep Dive: Case Study - E-commerce Voice Search

10. Deep Dive: Evaluation Metrics

Implementation: Phonetic Search Engine

Top Interview Questions

Key Takeaways

Summary

Related across topics

Word Ladder (BFS)

Semantic Search Systems

Share on