Phase 04 - Lesson 27
Multi-Object Tracking & Video Memory
This lesson includes a graded coding exercise that runs in your browser, unlocked with lifetime access.
Tracking is detection plus association. Detect every frame. Match this frame's detections to last frame's tracks by ID.
Type: Build Languages: Python Prerequisites: Phase 4 Lesson 06 (YOLO Detection), Phase 4 Lesson 08 (Mask R-CNN), Phase 4 Lesson 24 (SAM 3) Time: ~60 minutes
Learning Objectives
- Distinguish tracking-by-detection from query-based tracking and name the algorithm families (SORT, DeepSORT, ByteTrack, BoT-SORT, SAM 2 memory tracker, SAM 3.1 Object Multiplex)
- Implement IoU + Hungarian assignment from scratch for classic tracking-by-detection
- Explain SAM 2's memory bank and why it handles occlusion better than IoU-based association
- Read the three tracking metrics (MOTA, IDF1, HOTA) and pick which one matters for a given use case
The Problem
A detector tells you where the objects are in a single frame. A tracker tells you which detection in frame t is the same object as a detection in frame t-1. Without that, you cannot count objects crossing a line, follow a ball through an occlusion, or know "car #4 has been in the lane for 8 seconds."
Tracking is essential to every video-facing product: sports analytics, surveillance, autonomous driving, medical video analysis, wildlife monitoring, wordmark counting. The core building blocks are shared: a per-frame detector, a motion model (Kalman filter or something richer), an association step (Hungarian algorithm on IoU / cosine / learned features), and a track lifecycle (birth, update, death).
2026 brought two new patterns: SAM 2 memory-based tracking (feature-memory instead of motion-model association) and SAM 3.1 Object Multiplex (shared memory for many instances of the same concept). This lesson walks the classical stack first, then the memory-based approach.
The Concept
Tracking-by-detection
flowchart LR
F1["Frame t"] --> DET["Detector"] --> D1["Detections at t"]
PREV["Tracks up to t-1"] --> PREDICT["Motion predict<br/>(Kalman)"]
PREDICT --> PRED["Predicted tracks at t"]
D1 --> ASSOC["Hungarian assignment<br/>(IoU / cosine / motion)"]
PRED --> ASSOC
ASSOC --> UPDATE["Update matched tracks"]
ASSOC --> NEW["Birth new tracks"]
ASSOC --> DEAD["Age unmatched tracks; delete after N"]
UPDATE --> NEXT["Tracks at t"]
NEW --> NEXT
DEAD --> NEXT
style DET fill:#dbeafe,stroke:#2563eb
style ASSOC fill:#fef3c7,stroke:#d97706
style NEXT fill:#dcfce7,stroke:#16a34a
Every tracker you will encounter in 2026 is a variation on this loop. The differences:
- SORT (2016): Kalman filter + IoU Hungarian. Simple, fast, no appearance model.
- DeepSORT (2017): SORT + a CNN-based appearance feature per track (ReID embedding). Handles crossings better.
- ByteTrack (2021): associates low-confidence detections as a second stage; no appearance features needed but top performer on MOT17.
- BoT-SORT (2022): Byte + camera motion compensation + ReID.
- StrongSORT / OC-SORT — ByteTrack descendants with better motion and appearance.
Kalman filter in one paragraph
A Kalman filter maintains a per-track state (x, y, w, h, dx, dy, dw, dh) with a covariance. At each frame, predict the state using a constant-velocity model, then update with the matched detection. The update trusts the detection more when the predict uncertainty is high. This gives smooth trajectories and the ability to continue a track through a short occlusion (1-5 frames).
Every classical tracker uses a Kalman filter in the motion-prediction step.
The Hungarian algorithm
Given a M x N cost matrix (tracks x detections), find the one-to-one assignment that minimises total cost. Cost is usually 1 - IoU(track_bbox, detection_bbox) or negative cosine similarity of appearance features. Runtime is O((M+N)^3); for M, N up to ~1000 it is fast enough in Python via scipy.optimize.linear_sum_assignment.
ByteTrack's key idea
Standard trackers drop low-confidence detections (< 0.5). ByteTrack keeps them around as second-stage candidates: after matching tracks to high-confidence detections, unmatched tracks try to match low-confidence detections with a slightly looser IoU threshold. Recovers short occlusions, ID switches near crowds.
SAM 2 memory-based tracking
SAM 2 handles video by keeping a memory bank of per-instance spatio-temporal features. Given a prompt (click, box, text) on one frame, it encodes the instance into memory. On subsequent frames, the memory is cross-attended against the new frame's features, and the decoder produces a mask for the same instance in the new frame.
No Kalman filter, no Hungarian assignment. The association is implicit in the memory-attention operation.
Pros:
- Robust to large occlusions (memory carries instance identity across many frames).
- Open-vocabulary when combined with SAM 3's text prompts.
- Works without a separate motion model.
Cons:
- Slower than ByteTrack for many-object tracking.
- Memory bank grows; limits the context window.
SAM 3.1 Object Multiplex
Prior SAM 2 / SAM 3 tracking keeps a separate memory bank per instance. For 50 objects, 50 memory banks. Object Multiplex (March 2026) collapses them into one shared memory with per-instance query tokens. Cost scales sub-linearly in number of instances.
Multiplex is the new default for crowd tracking in 2026: concert crowds, warehouse workers, traffic intersections.
Three metrics to know
- MOTA (Multi-Object Tracking Accuracy) — 1 - (FN + FP + ID switches) / GT. Weighted by error type; a single metric that conflates detection and association failures.
- IDF1 (ID F1) — harmonic mean of ID precision and recall. Focuses specifically on how well each ground-truth track keeps its ID over time. Better than MOTA for ID-switch-sensitive tasks.
- HOTA (Higher Order Tracking Accuracy) — decomposes into detection accuracy (DetA) and association accuracy (AssA). The community standard since 2020; most comprehensive.
For surveillance (who is who): IDF1 is what you report. For sports analytics (counting passes): HOTA. For general academic comparison: HOTA.
Build It
Step 1: IoU-based cost matrix
import numpy as np
def bbox_iou(a, b):
"""
a, b: (N, 4) arrays of [x1, y1, x2, y2].
Returns (N_a, N_b) IoU matrix.
"""
ax1, ay1, ax2, ay2 = a[:, 0], a[:, 1], a[:, 2], a[:, 3]
bx1, by1, bx2, by2 = b[:, 0], b[:, 1], b[:, 2], b[:, 3]
inter_x1 = np.maximum(ax1[:, None], bx1[None, :])
inter_y1 = np.maximum(ay1[:, None], by1[None, :])
inter_x2 = np.minimum(ax2[:, None], bx2[None, :])
inter_y2 = np.minimum(ay2[:, None], by2[None, :])
inter = np.clip(inter_x2 - inter_x1, 0, None) * np.clip(inter_y2 - inter_y1, 0, None)
area_a = (ax2 - ax1) * (ay2 - ay1)
area_b = (bx2 - bx1) * (by2 - by1)
union = area_a[:, None] + area_b[None, :] - inter
return inter / np.clip(union, 1e-8, None)
Step 2: Minimal SORT-style tracker
Fixed constant-velocity Kalman omitted for brevity — we use a simple IoU association here; in production the Kalman predict is essential. The sort Python package provides the full version.
from scipy.optimize import linear_sum_assignment
class Track:
def __init__(self, tid, bbox, frame):
self.id = tid
self.bbox = bbox
self.last_frame = frame
self.hits = 1
def update(self, bbox, frame):
self.bbox = bbox
self.last_frame = frame
self.hits += 1
class SimpleTracker:
def __init__(self, iou_threshold=0.3, max_age=5):
self.tracks = []
self.next_id = 1
self.iou_threshold = iou_threshold
self.max_age = max_age
def step(self, detections, frame):
if not self.tracks:
for d in detections:
self.tracks.append(Track(self.next_id, d, frame))
self.next_id += 1
return [(t.id, t.bbox) for t in self.tracks]
track_boxes = np.array([t.bbox for t in self.tracks])
det_boxes = np.array(detections) if len(detections) else np.empty((0, 4))
iou = bbox_iou(track_boxes, det_boxes) if len(det_boxes) else np.zeros((len(track_boxes), 0))
cost = 1 - iou
cost[iou < self.iou_threshold] = 1e6
matched_track = set()
matched_det = set()
if cost.size > 0:
row, col = linear_sum_assignment(cost)
for r, c in zip(row, col):
if cost[r, c] < 1.0:
self.tracks[r].update(det_boxes[c], frame)
matched_track.add(r); matched_det.add(c)
for i, d in enumerate(det_boxes):
if i not in matched_det:
self.tracks.append(Track(self.next_id, d, frame))
self.next_id += 1
self.tracks = [t for t in self.tracks if frame - t.last_frame <= self.max_age]
return [(t.id, t.bbox) for t in self.tracks]
60 lines. Takes per-frame detections, returns per-frame track IDs. Real systems add the Kalman predict, ByteTrack's second-stage re-match, and appearance features.
Step 3: Synthetic trajectory test
def synthetic_frames(num_frames=20, num_objects=3, H=240, W=320, seed=0):
rng = np.random.default_rng(seed)
starts = rng.uniform(20, 200, size=(num_objects, 2))
velocities = rng.uniform(-5, 5, size=(num_objects, 2))
frames = []
for f in range(num_frames):
dets = []
for i in range(num_objects):
cx, cy = starts[i] + f * velocities[i]
dets.append([cx - 10, cy - 10, cx + 10, cy + 10])
frames.append(dets)
return frames
tracker = SimpleTracker()
for f, dets in enumerate(synthetic_frames()):
tracks = tracker.step(dets, f)
Three objects moving in straight lines should keep their IDs across all 20 frames.
Step 4: ID-switch metric
def count_id_switches(tracks_per_frame, gt_per_frame):
"""
tracks_per_frame: list of list of (track_id, bbox)
gt_per_frame: list of list of (gt_id, bbox)
Returns number of ID switches.
"""
prev_assignment = {}
switches = 0
for tracks, gts in zip(tracks_per_frame, gt_per_frame):
if not tracks or not gts:
continue
t_boxes = np.array([b for _, b in tracks])
g_boxes = np.array([b for _, b in gts])
iou = bbox_iou(g_boxes, t_boxes)
for g_idx, (gt_id, _) in enumerate(gts):
j = iou[g_idx].argmax()
if iou[g_idx, j] > 0.5:
t_id = tracks[j][0]
if gt_id in prev_assignment and prev_assignment[gt_id] != t_id:
switches += 1
prev_assignment[gt_id] = t_id
return switches
This is a simplified IDF1-adjacent metric: count how many times a ground-truth object changes its assigned predicted track ID. Real MOTA / IDF1 / HOTA tooling lives in py-motmetrics and TrackEval.
Use It
Production trackers in 2026:
ultralytics— YOLOv8 + ByteTrack / BoT-SORT built-in.results = model.track(source, tracker="bytetrack.yaml"). The default.supervision(Roboflow) — ByteTrack wrappers plus annotation utilities.- SAM 2 / SAM 3.1 — memory-based tracking via
processor.track(). - Custom stack: detector (YOLOv8 / RT-DETR) +
sort-tracker/OC-SORT/StrongSORT.
Picking:
- Pedestrians / cars / boxes at 30+ fps: ByteTrack with ultralytics.
- Many instances of one class in a crowd: SAM 3.1 Object Multiplex.
- Heavy occlusions with identifiable appearance: DeepSORT / StrongSORT (ReID features).
- Sports / complex interactions: BoT-SORT or learned trackers (MOTRv3).
Ship It
This lesson produces:
outputs/prompt-tracker-picker.md— picks SORT / ByteTrack / BoT-SORT / SAM 2 / SAM 3.1 given scene type, occlusion patterns, and latency budget.outputs/skill-mot-evaluator.md— writes a complete evaluation harness for MOTA / IDF1 / HOTA against ground-truth tracks.
Exercises
- (Easy) Run the synthetic tracker above with 3, 10, and 30 objects. Report ID-switch count in each case. Identify where the simple IoU-only association starts to fail.
- (Medium) Add a constant-velocity Kalman predict step before association. Show that short (2-3 frame) occlusions no longer cause ID switches.
- (Hard) Integrate SAM 2's memory-based tracker (via
transformers) as an alternative tracker backend. Run both SimpleTracker and SAM 2 on a 30-second clip of a crowd and compare ID-switch counts, manually labelling ground-truth IDs for 5 salient people.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Tracking-by-detection | "Detect then associate" | Per-frame detector + Hungarian assignment on IoU / appearance |
| Kalman filter | "Motion predict" | Linear dynamics + covariance for smooth track predictions and occlusion handling |
| Hungarian algorithm | "Optimal assignment" | Solves the minimum-cost bipartite matching problem; scipy.optimize.linear_sum_assignment |
| ByteTrack | "Low-confidence second pass" | Re-match unmatched tracks to low-confidence detections to recover short occlusions |
| DeepSORT | "SORT + appearance" | Adds a ReID feature for cross-frame matching; better for ID preservation |
| Memory bank | "SAM 2 trick" | Per-instance spatio-temporal features stored across frames; cross-attention replaces explicit association |
| Object Multiplex | "SAM 3.1 shared memory" | Single shared memory with per-instance queries for fast many-object tracking |
| HOTA | "Modern tracking metric" | Decomposes into detection and association accuracy; community standard |
Further Reading
- SORT (Bewley et al., 2016) — the minimal tracking-by-detection paper
- DeepSORT (Wojke et al., 2017) — adds appearance feature
- ByteTrack (Zhang et al., 2022) — low-confidence second pass
- BoT-SORT (Aharon et al., 2022) — camera motion compensation
- HOTA (Luiten et al., 2020) — decomposed tracking metric
- SAM 2 video segmentation (Meta, 2024) — memory-based tracker
- SAM 3.1 Object Multiplex (Meta, March 2026)