Hierarchical Classification Systems

13 minute read

“Organizing the world’s information into a structured hierarchy.”

TL;DR

Hierarchical classification assigns items to taxonomy tree nodes using global classifiers, local per-node classifiers, or end-to-end hierarchical models with shared encoders. Hierarchy-aware losses like hierarchical softmax reduce computation from O(K) to O(log K), while consistency losses ensure parent probabilities exceed child probabilities. For extreme multi-label scenarios with millions of categories, embedding-based retrieval via FAISS decouples label count from model size. See ranking systems for how classification feeds into multi-stage ranking, and semantic search for the vector retrieval infrastructure underlying label embedding approaches.

A specimen collection cabinet with drawers arranged in a tree pattern

1. What is Hierarchical Classification?

Hierarchical classification is the task of assigning an item to one or more nodes in a taxonomy tree.

Example: Product Categorization

Electronics
├── Computers
│ ├── Laptops
│ │ ├── Gaming Laptops
│ │ └── Business Laptops
│ └── Desktops
└── Mobile Devices
 ├── Smartphones
 └── Tablets

Problem: Given a product description “Dell XPS 15 with RTX 3050”, classify it into:

Electronics > Computers > Laptops > Business Laptops (or Gaming Laptops?)

Challenges:

Large Taxonomies: Amazon has 30,000+ categories.
Multi-path: An item can belong to multiple leaf nodes (e.g., “Wireless Gaming Mouse” → Computers/Accessories AND Gaming/Peripherals).
Imbalanced Data: “Electronics” has millions of items, “Vintage Typewriters” has 100.
Hierarchy Violations: A model might predict “Gaming Laptop” without predicting “Laptop” (parent).

2. Flat vs. Hierarchical Classification

Aspect	Flat Classification	Hierarchical Classification
Model	Single multi-class classifier	Tree-structured classifiers
Predictions	One label	Path in tree
Training	One model	Multiple models (global or local)
Hierarchy	Ignored	Exploited
Example	CIFAR-10 (10 classes)	ImageNet (1000 classes in hierarchy)

Why Hierarchical?

Scalability: Training a single model with 30,000 classes is intractable.
Interpretability: Users navigate taxonomies (“Show me all Electronics > Computers”).
Zero-shot: New subcategories can be added without retraining the entire model.

3. Hierarchical Classification Approaches

Approach 1: Global Classifier (Flat with Post-Processing)

Train a single multi-class classifier predicting all leaf nodes, then use the hierarchy to ensure consistency.

Model:

# Predict all 30,000 leaf categories
logits = model(input) # Shape: [batch, 30000]
probs = softmax(logits)

# Post-process: ensure parent probabilities >= child probabilities
for node in taxonomy.postorder():
    if node.children:
        node.prob = max(node.prob, max(child.prob for child in node.children))

Pros:

Simple: One model.
High accuracy if data is sufficient.

Cons:

Class Imbalance: Popular categories dominate.
No hierarchy exploitation during training.

Approach 2: Local Classifiers Per Node (LCN)

Train a separate classifier at each internal node to choose among its children.

Example:

Root Classifier: Electronics vs. Clothing vs. Books
 ├─ Electronics Classifier: Computers vs. Mobile Devices
 │ ├─ Computers Classifier: Laptops vs. Desktops

Inference:

node = root
while not node.is_leaf():
    probs = node.classifier(input)
    node = node.children[argmax(probs)]
    return node

Pros:

Balanced: Each classifier handles a small, focused problem.
Modular: Can update one classifier without retraining others.

Cons:

Error Propagation: If the root classifier is wrong, the entire path is wrong.
Many Models: Need to train and deploy K models (where K = number of internal nodes).

Approach 3: End-to-End Hierarchical Model

Use a shared encoder with hierarchical output heads.

Architecture:

class HierarchicalClassifier(nn.Module):
    def __init__(self, taxonomy):
        super().__init__()
        self.encoder = ResNet() # Shared
        self.heads = nn.ModuleDict({
        node.id: nn.Linear(2048, len(node.children))
        for node in taxonomy.internal_nodes()
        })

    def forward(self, x):
        features = self.encoder(x)
        outputs = {}
        for node_id, head in self.heads.items():
            outputs[node_id] = head(features)
            return outputs

Loss:

total_loss = 0
for node in taxonomy.internal_nodes():
    if node in ground_truth_path:
        target = ground_truth_path[node].child_index
        total_loss += cross_entropy(outputs[node.id], target)

Pros:

Shared Representations: Lower layers learn general features.
End-to-End Training: Optimizes the entire path jointly.

Cons:

Complex: Harder to debug and tune.
Memory: All heads must fit in GPU memory.

4. Handling Multi-Label Hierarchy

Some items belong to multiple paths in the tree.

Example: “Logitech Wireless Gaming Mouse”

Path 1: Electronics > Computers > Accessories > Mouse
Path 2: Electronics > Gaming > Peripherals > Mouse

Approach: Multi-Task Learning

Treat each path as a separate task.

Loss = Sum of losses for all valid paths.

loss = sum(path_loss(model(x), path) for path in ground_truth_paths)

5. Hierarchy-Aware Loss Functions

Loss 1: Hierarchical Softmax

Instead of standard softmax over 30,000 classes, factorize: \[ P(\text{Leaf} | x) = P(\text{Root} \to \text{Child}_1 | x) \times P(\text{Child}_1 \to \text{Child}_2 | x) \times \ldots \]

Benefit: Reduces computation from \(O(K)\) to \(O(\log K)\) where K is number of classes.

Loss 2: Hierarchical Cross-Entropy (H-Loss)

Penalize mistakes based on the distance in the tree. \[ \mathcal{L} = \sum_{i=1}^{D} \alpha_i \cdot \text{CE}(\text{pred}_i, \text{true}_i) \] where \(D\) is the depth and \(\alpha_i\) increases with depth (leaf nodes weighted more).

Intuition: Mistaking “Gaming Laptop” for “Business Laptop” (siblings) is less bad than mistaking it for “Tablet” (cousin).

Loss 3: Symmetric KL Divergence

Encourage the model to predict ancestor probabilities >= descendant probabilities. \[ \mathcal{L}{\text{consistency}} = \sum{\text{parent, child}} \max(0, P(\text{child}) - P(\text{parent})) \]

Deep Dive: Amazon’s Product Taxonomy

Amazon has a forest of taxonomies (one per marketplace). Challenge: Items listed in multiple marketplaces need consistent categorization.

Solution: Transfer Learning

Train a base model on US taxonomy (most data).
Fine-tune on UK, Japan, India taxonomies.
Use adapter layers to handle marketplace-specific categories.

Scale:

Items: 350M+
Categories: 30,000+
Languages: 15+

Deep Dive: Extreme Multi-Label Classification (XML)

When the number of labels is in the millions (e.g., Wikipedia categories), standard approaches fail.

Approaches:

Embedding-Based (AnnexML, Bonsai, Parabel):
- Embed labels and inputs into the same space.
- Use ANN (Approximate Nearest Neighbors) to retrieve top-K labels.
Attention Mechanisms:
- Label-Attention: Attend to label descriptions during encoding.
Tree Pruning:
- Prune unlikely branches early using a lightweight model.

Parabel Architecture:

Build a label tree (clustering similar labels).
Train classifiers at each node (like LCN).
Beam search during inference to explore top-K branches.

Deep Dive: Google’s Knowledge Graph Categories

Google Search uses a hierarchical taxonomy for entities. Example:

Thing
├── Creative Work
│ └── Movie
└── Person
 └── Actor

Challenge: New entities appear daily (new movies, new people). Solution:

Zero-Shot Classification: Use a text encoder (BERT) to embed the entity description.
Nearest Ancestor: Find the closest category in the embedding space.

Deep Dive: Hierarchical Multi-Task Learning (HMTL)

In HMTL, tasks are organized in a hierarchy where:

Lower tasks are easier (e.g., “Is this electronics?”).
Higher tasks are harder (e.g., “Is this a gaming laptop?”).

Architecture:

Input → Shared Encoder → Task 1 (Electronics?) ──┐
 ├→ Task 2 (Computers?) │
 └→ Task 3 (Laptops?) │
 ├→ Final Prediction

Loss: \[ \mathcal{L} = \lambda_1 \mathcal{L}_1 + \lambda_2 \mathcal{L}_2 + \lambda_3 \mathcal{L}_3 \] where \(\lambda_i\) are learned or hand-tuned.

Deep Dive: Taxonomy Expansion (Adding New Categories)

Problem: A new category “Foldable Smartphones” needs to be added.

Approach 1: Retrain from Scratch

Expensive and slow.

Approach 2: Continual Learning

Fine-tune the model on new data while preserving old knowledge.
Challenge: Catastrophic forgetting.
Solution: Elastic Weight Consolidation (EWC) or rehearsal buffers.

Approach 3: Few-Shot Learning

Train a meta-learner that can adapt to new categories with < 100 examples.
Use Prototypical Networks or MAML.

Deep Dive: Handling Imbalanced Hierarchies

Problem: “Laptops” has 1M examples, “Typewriters” has 50.

Solutions:

Class Weighting: \[ w_i = \frac{\text{total samples}}{\text{samples in class } i} \]
Focal Loss: \[ \mathcal{L} = -\alpha (1 - p_t)^\gamma \log(p_t) \] Focuses on hard-to-classify examples.
Oversampling:
- Augment rare classes.
Hierarchical Sampling:
- Sample uniformly at each level of the tree, not uniformly across all leaves.

Deep Dive: Evaluation Metrics

Metric 1: Hierarchical Precision and Recall

Standard precision/recall don’t account for hierarchy. Hierarchical Precision: \[ hP = \frac{|\text{predicted path} \cap \text{true path}|}{|\text{predicted path}|} \]

Example:

True: Electronics > Computers > Laptops
Predicted: Electronics > Computers > Desktops
\( hP = \frac{2}{3} \) (got Electronics and Computers right, but wrong at Laptops level).

Metric 2: Tree-Induced Distance

Measure the shortest path between predicted and true leaf in the tree. \[ d(\text{pred}, \text{true}) = \text{depth}(\text{LCA}(\text{pred}, \text{true})) \]

Example:

pred = Gaming Laptop, true = Business Laptop
LCA = Laptop
Distance = depth(Laptop) = 2 (smaller is better)

Metric 3: F1 at Different Levels

Compute F1 separately at each level of the hierarchy.

Level 0 (Root): F1 = 100% (trivial).
Level 1: F1 on {Electronics, Clothing, Books}.
Level 2: F1 on {Computers, Mobile, etc.}.

Deep Dive: Active Learning for Taxonomy Labeling

Problem: Labeling 10M products manually is expensive.

Solution: Active Learning

Train initial model on small labeled set.
Query: Find the most uncertain samples.
- Entropy: \( H = -\sum_i p_i \log p_i \)
- Least Confident: \( 1 - \max_i p_i \)
Human labels the queried samples.
Retrain and repeat.

Hierarchical Active Learning:

Prioritize samples where the model is uncertain at multiple levels of the tree.

Deep Dive: Hierarchical Attention Networks (HAN)

Idea: Use attention at each level of the hierarchy to focus on relevant features.

Architecture:

class HierarchicalAttention(nn.Module):
    def __init__(self):
        self.word_attention = Attention()
        self.sentence_attention = Attention()
        self.document_attention = Attention()

    def forward(self, document):
        # Level 1: Words → Sentence Representation
        sentence_reps = []
        for sentence in document:
            word_reps = [self.embed(word) for word in sentence]
            sentence_rep = self.word_attention(word_reps)
            sentence_reps.append(sentence_rep)

            # Level 2: Sentences → Document Representation
            doc_rep = self.sentence_attention(sentence_reps)

            # Level 3: Document → Category
            category = self.document_attention(doc_rep)
            return category

Deep Dive: Label Embedding and Matching

Idea: Embed both inputs and labels into the same space.

Training:

input_emb = encoder(product_description) # [batch, 512]
label_embs = label_encoder(all_labels) # [30000, 512]

# Dot product similarity
scores = input_emb @ label_embs.T # [batch, 30000]
loss = cross_entropy(scores, target)

Inference:

# Use FAISS for fast top-K retrieval
top_k_labels = faiss_index.search(input_emb, k=10)

Benefit: Decouples the number of labels from model size. Can add new labels without retraining.

Deep Dive: Hierarchical Reinforcement Learning (HRL) for Sequential Classification

Problem: Some taxonomies require a sequence of decisions.

Example: “Classify this support ticket”

Department? (Sales, Support, Engineering)
If Support, Priority? (Low, Medium, High)
If High, Assign to? (Agent 1, Agent 2, Agent 3)

Approach: Hierarchical RL

High-Level Policy: Chooses department.
Low-Level Policies: Choose priority, assign agent.
Reward: +1 if ticket is resolved quickly.

Implementation Example: PyTorch Hierarchical Classifier

import torch
import torch.nn as nn

class HierarchicalModel(nn.Module):
    def __init__(self, taxonomy, embedding_dim=768):
        super().__init__()
        self.taxonomy = taxonomy
        self.encoder = nn.Sequential(
        nn.Linear(embedding_dim, 1024),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(1024, 512)
        )

        # One classifier per internal node
        self.classifiers = nn.ModuleDict()
        for node in taxonomy.internal_nodes():
            self.classifiers[str(node.id)] = nn.Linear(512, len(node.children))

    def forward(self, x, return_all=False):
        features = self.encoder(x)

        if return_all:
            # Return logits for all nodes (for training)
            outputs = {}
            for node_id, classifier in self.classifiers.items():
                outputs[node_id] = classifier(features)
                return outputs
            else:
                # Greedy path prediction (for inference)
                node = self.taxonomy.root
                path = [node]

                while not node.is_leaf():
                    logits = self.classifiers[str(node.id)](features)
                    child_idx = torch.argmax(logits, dim=1)
                    node = node.children[child_idx.item()]
                    path.append(node)

                    return path

    def hierarchical_loss(outputs, true_path):
        loss = 0
        for node in true_path:
            if not node.is_leaf():
                logits = outputs[str(node.id)]
                target = true_path.index(node.get_child_on_path(true_path))
                loss += nn.CrossEntropyLoss()(logits, torch.tensor([target]))
                return loss

Key Takeaways

Hierarchy Exploitation: Use the tree structure during both training and inference.
Local vs. Global: Trade-off between modular (easy to update) and end-to-end (higher accuracy).
Multi-Label: Real-world taxonomies often have overlapping categories.
Scalability: For millions of classes, use embedding-based retrieval (ANN).
Evaluation: Standard metrics don’t account for hierarchy; use hierarchical precision/recall.

Summary

Aspect	Insight
Approaches	Global, Local (LCN), End-to-End Hierarchical
Loss Functions	Hierarchical Softmax, H-Loss, Consistency Loss
Scalability	Label embeddings + FAISS for extreme multi-label
Evaluation	Hierarchical precision, tree-induced distance

FAQ

What are the three main approaches to hierarchical classification?

The three approaches are: global classifier (single model predicts all leaf categories with hierarchy-based post-processing), local classifiers per node (separate classifier at each internal node choosing among children), and end-to-end hierarchical model (shared encoder with hierarchical output heads trained jointly). Global is simplest, local is most modular, and end-to-end achieves the highest accuracy but is most complex.

How does hierarchical softmax reduce computation for large taxonomies?

Instead of computing softmax over all 30K+ classes, hierarchical softmax factorizes the probability as a product of decisions along the tree path: P(Leaf) = P(Root to Child1) times P(Child1 to Child2) and so on. This reduces computation from O(K) to O(log K) where K is the number of classes, making it feasible to train on taxonomies with millions of categories.

How do you handle products that belong to multiple categories?

Use multi-label classification where the model predicts multiple paths through the taxonomy. Train using multi-task learning with the loss being the sum of losses for all valid paths. At inference, predict all paths with confidence above a threshold (e.g., 0.3). A wireless gaming mouse might correctly be classified into both Computers/Accessories/Mouse and Gaming/Peripherals/Mouse.

How do you add new categories to a hierarchical taxonomy without retraining?

Use label embeddings where both inputs and labels are embedded in the same vector space. New labels can be added by computing their embedding from the label description, without retraining the model. At inference, use FAISS for fast top-K label retrieval. Alternatively, use few-shot learning with Prototypical Networks or MAML to adapt to new categories with fewer than 100 examples.

Originally published at: arunbaby.com/ml-system-design/0029-hierarchical-classification

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Hierarchical Classification Systems

TL;DR

1. What is Hierarchical Classification?

2. Flat vs. Hierarchical Classification

3. Hierarchical Classification Approaches

Approach 1: Global Classifier (Flat with Post-Processing)

Approach 2: Local Classifiers Per Node (LCN)

Approach 3: End-to-End Hierarchical Model

4. Handling Multi-Label Hierarchy

5. Hierarchy-Aware Loss Functions

Loss 1: Hierarchical Softmax

Loss 2: Hierarchical Cross-Entropy (H-Loss)

Loss 3: Symmetric KL Divergence

Deep Dive: Amazon’s Product Taxonomy

Deep Dive: Extreme Multi-Label Classification (XML)

Deep Dive: Google’s Knowledge Graph Categories

Deep Dive: Hierarchical Multi-Task Learning (HMTL)

Deep Dive: Taxonomy Expansion (Adding New Categories)

Deep Dive: Handling Imbalanced Hierarchies

Deep Dive: Evaluation Metrics

Metric 1: Hierarchical Precision and Recall

Metric 2: Tree-Induced Distance

Metric 3: F1 at Different Levels

Deep Dive: Active Learning for Taxonomy Labeling

Deep Dive: Hierarchical Attention Networks (HAN)

Deep Dive: Label Embedding and Matching

Deep Dive: Hierarchical Reinforcement Learning (HRL) for Sequential Classification

Implementation Example: PyTorch Hierarchical Classifier

Top Interview Questions

Key Takeaways

Summary

FAQ

Related across topics

Share on

TL;DR

1. What is Hierarchical Classification?

2. Flat vs. Hierarchical Classification

3. Hierarchical Classification Approaches

Approach 1: Global Classifier (Flat with Post-Processing)

Approach 2: Local Classifiers Per Node (LCN)

Approach 3: End-to-End Hierarchical Model

4. Handling Multi-Label Hierarchy

5. Hierarchy-Aware Loss Functions

Loss 1: Hierarchical Softmax

Loss 2: Hierarchical Cross-Entropy (H-Loss)

Loss 3: Symmetric KL Divergence

Deep Dive: Amazon’s Product Taxonomy

Deep Dive: Extreme Multi-Label Classification (XML)

Deep Dive: Google’s Knowledge Graph Categories

Deep Dive: Hierarchical Multi-Task Learning (HMTL)

Deep Dive: Taxonomy Expansion (Adding New Categories)

Deep Dive: Handling Imbalanced Hierarchies

Deep Dive: Evaluation Metrics

Metric 1: Hierarchical Precision and Recall

Metric 2: Tree-Induced Distance

Metric 3: F1 at Different Levels

Deep Dive: Active Learning for Taxonomy Labeling

Deep Dive: Hierarchical Attention Networks (HAN)

Deep Dive: Label Embedding and Matching

Deep Dive: Hierarchical Reinforcement Learning (HRL) for Sequential Classification

Implementation Example: PyTorch Hierarchical Classifier

Top Interview Questions

Key Takeaways

Summary

FAQ

Related across topics

Lowest Common Ancestor of a Binary Tree

Hierarchical Speech Classification

Multi-Agent Architectures: The Power of Coordination

Share on