End-to-End Speech Model Design

23 minute read

Goodbye HMMs. Goodbye Phonemes. Goodbye Lexicons. We are teaching the machine to Listen, Attend, and Spell.

Problem Statement

Traditional ASR systems (Hybrid HMM-DNN) are a “Frankenstein” of separate components:

Acoustic Model: Maps audio to phonemes.
Lexicon: Maps phonemes to words.
Language Model: Maps words to sentences.
G2P (Grapheme-to-Phoneme): Handles unknown words.

The Problem: Errors cascade. If the G2P fails, the Lexicon fails. If the Lexicon fails, the ASR fails. Optimizing one component doesn’t necessarily improve the whole system (WER).

The Solution: End-to-End (E2E) ASR. One Neural Network. Input: Audio. Output: Text. Optimize a single loss function.

Fundamentals: The Three Pillars of E2E

There are three main architectures for E2E ASR. Each solves the alignment problem (Audio Length (T) » Text Length (L)) differently.

CTC (Connectionist Temporal Classification):
- Mechanism: Predicts a token (or “blank”) for every frame. Merges repeats.
- Assumption: Frames are conditionally independent.
- Pros: Fast, non-autoregressive.
- Cons: Weak Language Modeling capabilities.
AED (Attention-based Encoder-Decoder) / LAS:
- Mechanism: “Listen, Attend and Spell”. Encoder processes audio. Decoder attends to encoder outputs to generate text.
- Pros: Best accuracy (Global context).
- Cons: Not streaming friendly (needs full audio). (O(T \cdot L)) complexity.
RNN-T (Transducer):
- Mechanism: Combines an Acoustic Encoder and a Label Encoder (LM) via a Joint Network.
- Pros: Streaming friendly. Strong LM integration.
- Cons: Memory intensive training.

Architecture 1: Listen, Attend and Spell (LAS)

LAS (Google, 2015) was the breakthrough that proved E2E could match Hybrid systems.

       [ "C", "A", "T" ]  <-- Output
             ^
             |
      +-------------+
      |   Speller   |  (Decoder / RNN)
      | (Attention) |
      +-------------+
             ^
             | Context Vector
      +-------------+
      |  Listener   |  (Encoder / Pyramidal RNN)
      +-------------+
             ^
             |
      [ Spectrogram ]  <-- Input

The Listener (Encoder)

A deep LSTM (or Conformer) that converts low-level features (Filterbanks) into high-level acoustic features. Pyramidal Structure: We must reduce the time resolution. Audio is 100 frames/sec. Text is ~3 chars/sec. The Listener performs subsampling (stride 2 pooling) to reduce (T) by 4x or 8x.

The Speller (Decoder)

An RNN that generates one character at a time. At step (i), it computes an Attention score over all encoder states (h). [ c_i = \sum \alpha_{i,j} h_j ] It uses (c_i) and the previous character (y_{i-1}) to predict (y_i).

Architecture 2: RNN-Transducer (RNN-T)

RNN-T is the industry standard for Streaming ASR (Siri, Assistant).

                      [ Softmax ]
                           ^
                           |
                    +-------------+
                    | Joint Net   |  (Feed Forward)
                    +-------------+
                     ^           ^
                     |           |
      +-------------+     +-------------+
      |   Encoder   |     | Prediction  |
      |  (Audio)    |     |   Network   |
      +-------------+     |   (Text)    |
             ^            +-------------+
             |                   ^
      [ Spectrogram ]            |
                          [ Previous Token ]

The Components

Encoder (Transcription Network): Analogous to the Acoustic Model. Processes audio.
Prediction Network: Analogous to the Language Model. Processes the history of non-blank tokens.
Joint Network: Combines them. [ J(t, u) = \text{ReLU}(W_e h_t + W_p g_u) ] [ P(y | t, u) = \text{Softmax}(W_o J(t, u)) ]

Why it wins: It is monotonic. It can only move forward in time. This makes it perfect for streaming.

The Mathematics of the Blank Token

Why do we need a blank? Consider the word “too”. Phonetically: t -> u -> u. Acoustically (10ms frames): t t t u u u u u u. If we just collapse repeats: t u. We lost the second ‘o’.

With Blank (-): t t t - u u u - u u u -> t - u - u -> tuu (“too”). The blank is a mandatory separator for repeated characters.

Probability Distribution: Usually, the Blank probability dominates.

P(-) > 0.9 for most frames (silence or steady state).
P(char) spikes only at the transition boundaries.
This “Spiky” behavior is characteristic of CTC.

Deep Dive: End-of-Sentence (EOS) Detection

In streaming ASR, the user never presses “Stop”. The model must decide when to stop listening.

1. Voice Activity Detection (VAD)

Energy-based: If volume < threshold for 500ms.
Model-based: A small NN (Silero VAD) classifies frames as Speech or Silence.
Logic: if silence_duration > 700ms: send_eos().

2. Decoder-based EOS

The ASR model itself can predict a special <EOS> token.
Problem: E2E models are trained on trimmed audio. They rarely see long silences. They tend to hallucinate during silence.
Fix: Train with “Endpointing” data (audio with trailing silence).

3. Semantic Endpointing

Wait for the NLU to confirm the command is complete.
“Turn off the…” (Wait)
“…lights” (Execute).
If the user pauses after “lights”, the NLU says “Complete Intent”, so we close the mic.

Deep Dive: Shallow Fusion Math

Shallow Fusion is the most common way to boost ASR with an external Language Model (trained on text).

The Equation: [ \hat{y} = \text{argmax}y \left( \log P{ASR}(y|x) + \lambda \log P_{LM}(y) + \beta \cdot \text{len}(y) \right) ]

**(P_{ASR}(y x)):** The probability from the E2E model (AM).
(P_{LM}(y)): The probability from the external LM (e.g., GPT-2).
(\lambda) (Lambda): The weight of the LM (usually 0.1 - 0.5).
(\beta) (Beta): Length reward. E2E models tend to prefer short sentences. This forces them to generate longer output.

Why it works: The ASR model is good at acoustics (“It sounds like ‘red’”). The LM is good at grammar (“‘The read apple’ is wrong, ‘The red apple’ is right”). By combining them, we fix homophone errors.

Deep Dive: The Cocktail Party Problem (Multi-Speaker)

Standard ASR fails when two people talk at once. Solution: Permutation Invariant Training (PIT).

Output: The model outputs two streams of text: (y_1) and (y_2).
Loss: We calculate the loss for both permutations:
- Loss A: (L(y_1, \text{Ref}_1) + L(y_2, \text{Ref}_2))
- Loss B: (L(y_1, \text{Ref}_2) + L(y_2, \text{Ref}_1))
Update: We backpropagate the minimum of Loss A and Loss B. [ L = \min(\text{Loss A}, \text{Loss B}) ]

This teaches the model to separate the speakers without forcing it to assign “Speaker 1” to a specific output channel.

Deep Dive: The Alignment Problem

The core difficulty in ASR is that we don’t know which audio frame corresponds to which character.

HMM-GMM (Old): Used Viterbi Alignment (Hard alignment). We explicitly assigned frame_5 to phoneme /k/.
E2E (New): Uses Soft Alignment (Attention/CTC). The model learns a probability distribution over alignments.
- CTC: Sums over all valid monotonic alignments.
- Attention: Computes a “Soft” weight vector for every output step.

Deep Dive: Connectionist Temporal Classification (CTC)

CTC is the “Hello World” of E2E ASR. It solves the problem: “I have 1000 audio frames but only 50 characters. How do I align them?”

The Logic of CTC

CTC introduces a special Blank Token (<eps> or -). It predicts a probability distribution over Vocabulary + {Blank} for every frame.

Decoding Rules:

Collapse Repeats: aa -> a.
Remove Blanks: - -> ``.

Example:

Audio: [frame1, frame2, frame3, frame4, frame5]
Model Output: c, c, -, a, t
Collapse: c, -, a, t
Remove Blanks: cat

The CTC Loss Function

We don’t know the exact alignment (e.g., did “c” start at frame 1 or 2?). CTC sums the probability of all valid alignments.

[ P(Y|X) = \sum_{A \in \mathcal{B}^{-1}(Y)} P(A|X) ] Where (\mathcal{B}) is the collapse function.

Forward-Backward Algorithm: Calculating the sum of exponentially many paths is hard. We use Dynamic Programming.

(\alpha_t(s)): Probability of generating the first (s) tokens of the target by time (t).
Similar to HMM training.
Complexity: (O(T \cdot L)).

Limitations of CTC

Conditional Independence: It assumes the prediction at time (t) depends only on the audio at time (t). It doesn’t know that “q” is usually followed by “u”.
Spiky Output: CTC tends to wait until it is 100% sure, emits a spike, and then predicts Blanks. This makes it bad for timestamp estimation.

Deep Dive: RNN-Transducer (RNN-T)

RNN-T fixes the “Conditional Independence” problem of CTC.

The Architecture

It has two encoders:

Audio Encoder: (f^{enc} = \text{Encoder}(x_t))
Label Encoder (Prediction Network): (g^{pred} = \text{PredNet}(y_{u-1}))

The Joint Network combines them: [ z_{t,u} = \text{Joint}(f^{enc}_t, g^{pred}_u) ]

The Decoding Grid

Imagine a grid where:

X-axis: Time frames ((T)).
Y-axis: Output tokens ((U)).

We start at ((0,0)). At each step, we can:

Emit a Token: Move Up ((t, u+1)). (We output a character, stay at the same audio frame).
Emit Blank: Move Right ((t+1, u)). (We consume an audio frame, output nothing).

Why is this better? The Prediction Network acts as an Internal Language Model. It knows that after “q”, “u” is likely, regardless of the audio. This allows RNN-T to model language structure much better than CTC.

Training RNN-T

The loss function is the negative log-likelihood of the target sequence. Like CTC, we sum over all valid paths through the grid. Memory Issue: The Joint Network computes a tensor of size ((B, T, U, V)).

(B): Batch Size (32)
(T): Time Frames (1000)
(U): Text Length (100)
(V): Vocabulary (1000)
Total: (3.2 \times 10^9) floats = ~12GB memory! Fix: Use Pruned RNN-T (k2/icefall) or optimized CUDA kernels (warp-rnnt) that only compute the diagonal of the grid.

Architecture 3: Conformer (CNN + Transformer)

Whether you use LAS or RNN-T, you need a powerful Encoder. Google introduced the Conformer, combining the best of both worlds:

Transformers: Good at capturing global context (Long-range dependencies).
CNNs: Good at capturing local context (Edges, Formants).

The Conformer Block:

Feed Forward Module.
Multi-Head Self Attention.
Convolution Module.
Feed Forward Module.
Layer Norm.

This “Macaron” style (FFN at start and end) proved superior to standard Transformers.

Implementation: A Conformer Block in PyTorch

import torch
import torch.nn as nn

class ConformerBlock(nn.Module):
    def __init__(self, d_model, n_head, kernel_size, dropout=0.1):
        super().__init__()
        
        # 1. Feed Forward (Half Step)
        self.ff1 = nn.Sequential(
            nn.LayerNorm(d_model),
            nn.Linear(d_model, d_model * 4),
            nn.SiLU(), # Swish
            nn.Dropout(dropout),
            nn.Linear(d_model * 4, d_model),
            nn.Dropout(dropout)
        )
        
        # 2. Self-Attention
        self.attn_norm = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(d_model, n_head, dropout=dropout)
        
        # 3. Convolution Module
        self.conv_module = nn.Sequential(
            nn.LayerNorm(d_model),
            # Pointwise
            nn.Conv1d(d_model, d_model * 2, 1), 
            nn.GLU(dim=1),
            # Depthwise
            nn.Conv1d(d_model, d_model, kernel_size, groups=d_model, padding=kernel_size//2),
            nn.BatchNorm1d(d_model),
            nn.SiLU(),
            # Pointwise
            nn.Conv1d(d_model, d_model, 1),
            nn.Dropout(dropout)
        )
        
        # 4. Feed Forward (Half Step)
        self.ff2 = nn.Sequential(
            nn.LayerNorm(d_model),
            nn.Linear(d_model, d_model * 4),
            nn.SiLU(),
            nn.Dropout(dropout),
            nn.Linear(d_model * 4, d_model),
            nn.Dropout(dropout)
        )
        
        self.final_norm = nn.LayerNorm(d_model)

    def forward(self, x):
        # x: [Time, Batch, Dim]
        
        # Macaron Style: 0.5 * FF1
        x = x + 0.5 * self.ff1(x)
        
        # Attention
        residual = x
        x = self.attn_norm(x)
        x, _ = self.attn(x, x, x)
        x = residual + x
        
        # Convolution (Requires [Batch, Dim, Time])
        residual = x
        x = x.permute(1, 2, 0) # T, B, D -> B, D, T
        x = self.conv_module(x)
        x = x.permute(2, 0, 1) # B, D, T -> T, B, D
        x = residual + x
        
        # Macaron Style: 0.5 * FF2
        x = x + 0.5 * self.ff2(x)
        
        return self.final_norm(x)

# Test
block = ConformerBlock(d_model=256, n_head=4, kernel_size=31)
x = torch.randn(100, 8, 256) # Time=100, Batch=8, Dim=256
y = block(x)
print(y.shape) # torch.Size([100, 8, 256])

Deep Dive: Streaming Constraints (The Lookahead)

In a bidirectional LSTM or Transformer, the model sees the future. In Streaming, we can’t see the future. Compromise: Limited Lookahead.

Latency: If we look ahead 300ms, we add 300ms latency.
Accuracy: If we look ahead 0ms, accuracy drops (we can’t distinguish “The” vs “A” without context).
Sweet Spot: 100ms - 300ms lookahead.

Streaming Conformer: Uses Block Processing.

It processes a “Central Block” (current audio).
It attends to a “Left Context” (past, cached).
It attends to a “Right Context” (future, lookahead).

Deep Dive: On-Device ASR (TinyML)

Running ASR on a Pixel phone (without cloud) requires extreme optimization.

1. Quantization

Convert weights from float32 (4 bytes) to int8 (1 byte).
Size: 4x smaller.
Speed: 2-3x faster (using NEON/DSP instructions).
Accuracy: < 1% WER degradation if done correctly (Quantization Aware Training).

2. SVD (Singular Value Decomposition)

Factorize large weight matrices into two smaller matrices.
(W (1024 \times 1024) \approx U (1024 \times 128) \times V (128 \times 1024)).
Reduces parameters by 4x.

Deep Dive: Integrating Language Models

E2E models learn “Audio -> Text” directly. But text data is much more abundant than audio-text pairs. How do we use a text-only LM (like GPT) to improve ASR?

1. Shallow Fusion

Inference Time Only.
We linearly interpolate the scores during Beam Search. [ \text{Score}(y) = \log P_{ASR}(y|x) + \lambda \log P_{LM}(y) ]
Pros: Simple. No retraining of ASR.
Cons: The ASR model doesn’t know about the LM.

2. Deep Fusion

Training Time Integration.
We fuse the hidden states of the LM into the ASR decoder.
Mechanism: Concatenate hidden_ASR and hidden_LM, then pass through a Gating mechanism.
Pros: Better integration.
Cons: Requires retraining.

3. Cold Fusion

Idea: Train the ASR decoder conditional on the LM state.
The ASR decoder learns to “correct” the LM or rely on it when the audio is noisy.

Deep Dive: Beam Search Decoding

How do we turn probabilities into text? P(c | audio) gives us a matrix of probabilities.

1. Greedy Decoding

Algorithm: At each step, pick the token with the highest probability.
Problem: It makes local decisions. It can’t backtrack.
Example: Audio sounds like “The red…”.
- Greedy: “The read” (because ‘read’ is more common).
- Next word is “apple”.
- Greedy is stuck with “The read apple”.

2. Beam Search

Algorithm: Keep the top (K) (Beam Width) most likely hypotheses at each step.
Example (K=2):
- Step 1: [“The”, “A”]
- Step 2: [“The red”, “The read”, “A red”, “A read”] -> Prune to top 2 -> [“The red”, “The read”]
- Step 3: [“The red apple”, …]
Result: Finds the global optimum (mostly).

3. Prefix Beam Search (for CTC)

CTC is tricky because multiple paths map to the same string (aa -> a, a -> a).

We merge paths that result in the same prefix.
We track two probabilities for each prefix:
1. P_b: Probability ending in Blank.
2. P_nb: Probability ending in Non-Blank.

Deep Dive: SpecAugment Details

SpecAugment is the “Dropout” of Speech.

1. Time Masking

Operation: Select a time interval ([t, t+\tau)) and set all frequency channels to mean/zero.
Effect: Forces the model to rely on context. If “banana” is masked, it infers it from “I ate a …”.

2. Frequency Masking

Operation: Select a frequency band ([f, f+\nu)) and set all time steps to mean/zero.
Effect: Makes the model robust to microphone variations (e.g., loss of high frequencies).

3. Time Warping

Operation: Select a point in time and warp the spectrogram to the left or right.
Effect: Makes the model robust to speaking rate variations (fast/slow speech).

Training Considerations

1. The CTC Loss

Even in Encoder-Decoder models, we often add an auxiliary CTC loss to the Encoder. [ L = \lambda L_{att} + (1-\lambda) L_{ctc} ] Why?

CTC helps convergence (monotonic alignment).
CTC enforces left-to-right constraints.

2. SpecAugment

The most important data augmentation for E2E models. Instead of augmenting the waveform (speed, noise), we augment the Spectrogram.

Time Masking: Mask out (t) consecutive time steps. (Simulates dropped packets).
Frequency Masking: Mask out (f) consecutive frequency channels. (Simulates microphone EQ issues).
Time Warping: Stretch/squeeze the image.

3. Curriculum Learning

Start by training on short utterances (2-3 seconds). Gradually increase to long utterances (15-20 seconds). This stabilizes the Attention mechanism.

4. Self-Supervised Pre-training (Wav2Vec 2.0)

Before training the E2E model on labeled text, we can pre-train the Encoder on unlabeled audio (which is cheap and abundant).

Wav2Vec 2.0 Mechanism:

Masking: Mask parts of the latent speech representation.
Contrastive Loss: The model tries to predict the true quantized representation of the masked segment among a set of distractors.
Result: The Encoder learns a rich representation of phonemes and speech structure without ever seeing a transcript.
Fine-tuning: Add a linear layer on top and train on labeled data with CTC loss. This achieves SOTA with 100x less labeled data.

Common Failure Modes

Attention Failure (Looping):
- Symptom: “The cat cat cat cat…”
- Cause: The attention mechanism gets stuck on a specific frame.
- Fix: Add “Location-Aware” attention (let the model know where it attended previously). Use Windowed Attention.
The “Long-Tail” Problem:
- Symptom: Fails on proper nouns (“Arun”, “PyTorch”).
- Cause: E2E models rely on sub-word units (BPE). If a word is rare, its BPE sequence is rare.
- Fix: Contextual Biasing. Inject a list of expected phrases (Contact names) into the Beam Search decoding graph.

State-of-the-Art: Whisper (Weakly Supervised)

OpenAI’s Whisper (2022) is an E2E Transformer trained on 680,000 hours of weakly labeled web data.

Architecture: Standard Encoder-Decoder Transformer.
Innovation: It’s not the architecture; it’s the Data.
Multitasking: It predicts special tokens: <|transcribe|>, <|translate|>, <|timestamps|>.
Robustness: Because it saw noisy, messy web data, it is incredibly robust to accents and background noise compared to models trained on clean LibriSpeech.

Deep Dive: Training Loop Implementation

Training E2E models requires handling variable length sequences. We use pad_sequence and pack_padded_sequence.

import torchaudio

def train_ctc(model, train_loader, optimizer, epoch):
    model.train()
    ctc_loss = nn.CTCLoss(blank=0, zero_infinity=True)
    
    for batch_idx, (waveform, valid_lengths, transcripts, transcript_lengths) in enumerate(train_loader):
        # waveform: [Batch, Time, Channels]
        # transcripts: [Batch, Max_Len]
        
        optimizer.zero_grad()
        
        # 1. Forward Pass
        # output: [Time, Batch, Vocab] (Required by PyTorch CTCLoss)
        output = model(waveform) 
        output = output.log_softmax(2)
        
        # 2. Calculate Loss
        # input_lengths must be the length of the output after subsampling
        input_lengths = valid_lengths // 4 # Assuming 4x subsampling
        
        loss = ctc_loss(output, transcripts, input_lengths, transcript_lengths)
        
        # 3. Backward
        loss.backward()
        
        # 4. Gradient Clipping (Crucial for RNNs/Transformers)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        
        optimizer.step()
        
        if batch_idx % 100 == 0:
            print(f"Epoch {epoch} | Batch {batch_idx} | Loss {loss.item():.4f}")

Deep Dive: Whisper Architecture Details

OpenAI’s Whisper is a masterclass in Weak Supervision.

1. The Data

680,000 hours of audio from the internet.
Includes non-English audio, background noise, and “hallucinations” (bad transcripts).
Filtering: They used a heuristic to remove machine-generated transcripts (which are too clean).

2. The Model

Standard Encoder-Decoder Transformer.
Input: Log-Mel Spectrogram (80 channels).
Positional Encoding: Sinusoidal.

3. The Multitask Format

The decoder is prompted with special tokens to control behavior:

<|startoftranscript|>
<|en|> (Language ID)
<|transcribe|> (Task: ASR) or <|translate|> (Task: S2T Translation)
<|timestamps|> (Predict start/end times)

This allows one model to replace a pipeline of (LID -> ASR -> Translation -> Alignment).

Deep Dive: Word Error Rate (WER)

WER is the standard metric for ASR. It is the Levenshtein Distance normalized by sequence length.

[ \text{WER} = \frac{S + D + I}{N} ]

S (Substitutions): “cat” -> “bat”
D (Deletions): “the cat” -> “cat”
I (Insertions): “cat” -> “the cat”
N: Total words in reference.

Python Implementation:

def calculate_wer(reference, hypothesis):
    r = reference.split()
    h = hypothesis.split()
    d = np.zeros((len(r) + 1, len(h) + 1))
    
    for i in range(len(r) + 1): d[i][0] = i
    for j in range(len(h) + 1): d[0][j] = j
    
    for i in range(1, len(r) + 1):
        for j in range(1, len(h) + 1):
            if r[i-1] == h[j-1]:
                d[i][j] = d[i-1][j-1]
            else:
                sub = d[i-1][j-1] + 1
                ins = d[i][j-1] + 1
                dele = d[i-1][j] + 1
                d[i][j] = min(sub, ins, dele)
                
    return d[len(r)][len(h)] / len(r)

Note: WER can be > 100% if the model inserts many hallucinations.

Case Study: The Evolution of NVIDIA’s ASR Models

NVIDIA has pushed the boundaries of CNN-based ASR (unlike Google’s Transformer push).

1. Jasper (Just Another Speech Recognizer)

Architecture: Deep stack of 1D Convolutions + Residual connections.
Key: Uses ReLU and Dropout heavily.
Result: Matched state-of-the-art with simple Conv blocks.

2. QuartzNet

Architecture: Like Jasper, but uses Time-Channel Separable Convolutions (Depthwise Separable).
Result: 96% fewer parameters than Jasper for the same accuracy. Runs on edge devices.

3. Citrinet

Architecture: QuartzNet + Squeeze-and-Excitation (SE) blocks.
Result: Even better accuracy/parameter ratio.

This shows that Efficiency (Separable Convs) and Attention (SE Blocks) are universal principles, applicable to both Vision and Speech.

Deep Dive: Hardware Acceleration (TPU vs GPU)

Speech models are often trained on TPUs (Tensor Processing Units).

1. TPUs (Google)

Architecture: Systolic Array. Optimized for massive Matrix Multiplications (MXU).
Pros: Extremely fast for large batch sizes. High bandwidth interconnect (ICI).
Cons: Hard to debug (XLA compilation).

2. GPUs (NVIDIA)

Architecture: SIMT (Single Instruction Multiple Threads).
Pros: Flexible. Great ecosystem (PyTorch/CUDA).
Cons: Memory bandwidth can be a bottleneck for RNNs.

Warp-RNNT: A CUDA kernel optimization that maps the RNN-T loss calculation to GPU warps, achieving 30x speedup over naive PyTorch implementation.

Key Takeaways

E2E simplifies the stack: One model, one loss, direct optimization of WER.
RNN-T for Streaming: If you need low latency, use Transducers.
Conformer for Encoding: The combination of CNN (local) and Transformer (global) is the current gold standard for acoustic encoding.
SpecAugment is mandatory: It prevents overfitting and forces the model to learn robust features.
Hybrid isn’t dead: For domains with very little data or massive vocabulary constraints (e.g., Medical Dictation), Hybrid systems with explicit Lexicons can still outperform E2E.

Originally published at: arunbaby.com/speech-tech/0027-end-to-end-speech-model-design

If you found this helpful, consider sharing it with others who might benefit.