Object Detection & Segmentation (YOLO, SAM)

7 minute read

“Knowing WHAT is where, and precisely WHERE it acts.”

1. Introduction: Bounding Boxes vs. Pixels

In Day 21 (Vision Fundamentals), we examined ViT and CLIP. These models give us Semantic Understanding—they tell us “There is a cat in this image”. However, for an agent to Act—to pick up the cat, to avoid the obstacle, or to click the button—semantics are not enough. We need Geometry. “Where exactly is the cat?”

There are two primary levels of geometric understanding that every agent engineer must distinguish:

Object Detection: “The cat is located inside the rectangle defined by (x1, y1) and (x2, y2).”
- Output: A Bounding Box.
- Use Case: Clicking a UI button, counting people, finding a face.
Segmentation: “These precise 45,201 pixels belong to the cat, and the adjacent 10,000 pixels are the rug.”
- Output: A Mask (Binary matrix).
- Use Case: A Robot Arm grasping a mug handle, removing a background in Photoshop, Medical analysis of a tumor shape.

In this deep dive, we will explore the two dominant architectures handling these tasks for agents: The speed-optimized YOLO for detection, and the prompt-optimized SAM for segmentation.

2. Real-Time Detection: The YOLO Family

YOLO (You Only Look Once) is the industry standard for fast object detection. Unlike Transformers (like ViT) which process the image as a sequence of patches (Computationally heavy $O(N^2)$), YOLO is a customized CNN designed primarily for Inference Speed.

2.1 How it works (The Single Shot)

Traditional detectors (R-CNN) were slow because they ran in two stages:

Guess 2,000 places where an object might be (Region Proposals).
Run a classifier on each of those 2,000 boxes. This was slow (0.5 FPS).

YOLO reframed detection as a Single Regression Problem:

Grid: It divides the input image into an $S \times S$ grid (e.g., 13x13).
Responsibility: If the center of an object falls into a grid cell, that cell is “responsible” for detecting it.
Prediction: Each cell predicts $B$ bounding boxes and $C$ class probabilities simultaneously in one forward pass.
Speed: Modern variants (YOLOv8, YOLOv10) can run at 100+ FPS on a GPU, or 30 FPS on a Raspberry Pi.

2.2 Agent Use-Case: The “Wake Word” for Vision

Why do we care about speed if GPT-4V exists? Because GPT-4V is expensive ($0.01 per frame) and slow (2 seconds). If you are building a Security Camera Agent, you cannot send 24/7 video to GPT-4V. You will go bankrupt in an hour.

The Cascade Architecture: We use YOLO as a “Visual Wake Word”.

Layer 1 (Edge): Run YOLO-Nano (Cheap, Fast) on every frame locally.
- Task: Check for “Person” or “Car”.
- Cost: Zero (Device compute).
Layer 2 (Cloud): If (and only if) YOLO detects a Person in a restricted zone:
- Crop the image.
- Send it to GPT-4V.
- Prompt: “Is this person a delivery driver or a burglar?”
- Cost: High, but only paid when necessary.

This hierarchy is essential for economic viability in real-world agent systems.

3. The Segmentation Revolution: SAM

While YOLO helps us find boxes, SAM (Segment Anything Model) by Meta revolutionized how we find precise shapes. Before 2023, Segmentation was a “Supervised Learning” problem. If you wanted to segment “Screws” on a factory line, you had to hire humans to hand-paint pixels on 5,000 images of screws to train a custom U-Net. This took months.

SAM changed Segmentation into a “Prompting” problem.

3.1 Promptable Segmentation

SAM is a foundational model trained on 1 billion masks. It is “Zero-Shot”. It doesn’t ask “Train me”. It asks “Point to it”.

Point Prompt: You simulate a click at pixel (500, 200). SAM instantly understands the texture/color boundaries expanding from that point and returns the mask of the entire object (e.g., the shirt someone is wearing).
Box Prompt: You feed SAM a rough Bounding Box (perhaps from YOLO). SAM fills the box with the exact pixel-perfect shape.

3.2 The Usage for Agents

Robotic Manipulation:
- An agent sees a messy table with a red cup.
- Goal: Pick up the cup.
- Problem: If the agent assumes the cup is a box, it might try to grab the empty air in the corner of the box.
- Solution: Agent uses YOLO to find the general box. Then uses SAM to get the Mask. The “Grasp Planner” algorithm then finds a friction point strictly inside the white pixels of the mask.
Creative Agents (Photo Editing):
- User: “Make the sky blue.”
- Agent (Text-to-Box): Finds the box for “Sky”.
- Agent (SAM): Gets the mask of the sky (handling complex tree branches cutting into it).
- Agent (Filter): Applies the blue filter only to the masked pixels.

4. Open Vocabulary Detection (Grounding DINO)

Standard YOLO is trained on the COCO dataset, which has only 80 classes (Person, Car, Dog, Apple…). It does not know what a “WiFi Router” or “Stapler” is. If you need an agent to find arbitrary objects, you cannot use standard YOLO.

Grounding DINO is the solution. It fuses a BERT (Language) model with a Transformer (Vision) model.

Input: Image + Arbitrary Text Prompt: “Find me the green lego block.”
Output: Bounding Box.

Recipe for a General Purpose Vision Tool: Because Grounding DINO is slow (~500ms), we often combine it with SAM to create a robust tool:

def find_item_pixels(image_path, text_query):
    # 1. Use Grounding DINO to get the Semantic Box (Open Vocab)
    # Slow, but finds the "Green Lego" specifically
    boxes = grounding_dino.predict(image_path, text_query)
    
    # 2. Use SAM to get the Mask (Geometry)
    # Fast refinement of the box into pixels
    masks = sam.predict(image_path, boxes)
    
    return boxes, masks

This function allows an agent to find anything mentioned in a user prompt without needing Retraining.

5. Fine-Tuning for Specialized Domains

Sometimes “General” isn’t enough.

Medical: Detecting distinct types of tumors. Grounding DINO might just see “Blob”.
Manufacturing: Detecting microscopic scratch defects on a PCB.
Satellite: Detecting specific tank models from orbit.

In these cases, we use a Data Engine Strategy:

Auto-Labeling: Use the large models (Grounding DINO + SAM) to generate initial noisy labels for your specific data.
Human Verification: A human reviews the masks. Correcting a mask is 10x faster than drawing it from scratch.
Distillation: Once you have 500 clean labeled images, you train a small YOLO model on this dataset.
Deployment: You deploy the tiny, fast YOLO model to the edge device.

6. Summary

YOLO: The workhorse. Use it for high-speed, fixed-class detection where latency or cost matters (Security, Traffic, Wake Words).
Grounding DINO: The explorer. Use it for open-world, text-based detection where flexibility matters (Generic Agents searching for objects).
SAM: The surgeon. Use it when you need pixel-perfect precision (Robotics, Photo Editing).

Combining these creates the “Spatial Awareness” loop. The agent knows what is there, and exactly where it is. In the final post of this vision section, we will add the dimension of Time—looking at Video Analysis and Real-World perception.

Arun Baby

Object Detection & Segmentation (YOLO, SAM)

1. Introduction: Bounding Boxes vs. Pixels

2. Real-Time Detection: The YOLO Family

2.1 How it works (The Single Shot)

2.2 Agent Use-Case: The “Wake Word” for Vision

3. The Segmentation Revolution: SAM

3.1 Promptable Segmentation

3.2 The Usage for Agents

4. Open Vocabulary Detection (Grounding DINO)

5. Fine-Tuning for Specialized Domains

6. Summary

Related across topics

Share on

Arun Baby

1. Introduction: Bounding Boxes vs. Pixels

2. Real-Time Detection: The YOLO Family

2.1 How it works (The Single Shot)

2.2 Agent Use-Case: The “Wake Word” for Vision

3. The Segmentation Revolution: SAM

3.1 Promptable Segmentation

3.2 The Usage for Agents

4. Open Vocabulary Detection (Grounding DINO)

5. Fine-Tuning for Specialized Domains

6. Summary

Related across topics

Word Break

Tokenization Systems

Speech Tokenization

Share on