Use conditional breakpoint() and debug_print to inspect tensor shapes, dtypes, and NaN values mid-training
Profile training loops with cProfile, line_profiler, and tracemalloc to find bottlenecks
Detect common AI bugs: shape mismatches, NaN loss, data leakage, and wrong-device tensors
Set up TensorBoard to visualize loss curves, weight histograms, and gradient distributions
The Problem
AI code fails differently than regular code. A web app crashes with a stack trace. A misconfigured training loop runs for 8 hours, burns
00 in GPU time, and produces a model that predicts the mean of every input. The code never errored. The bug was a tensor on the wrong device, a forgotten .detach(), or labels leaking into features.
You need debugging tools that catch these silent failures before they waste your time and compute.
Most people jump straight to level 3 (staring at TensorBoard). But 80% of AI bugs live at levels 1 and 2.
Build It
Part 1: Print Debugging (Yes, It Works)
Print debugging gets dismissed. It shouldn't. For tensor code, a targeted print statement beats stepping through a debugger because you need to see shapes, dtypes, and value ranges all at once.
Logging gives you timestamps, severity levels, and file output. When a training run fails at 3 AM, you want a log file, not terminal output that scrolled off screen.
Part 4: Timing Code Sections
Knowing where time goes is the first step to optimization.
import time
class Timer:
def __init__(self, name=""):
self.name = name
def __enter__(self):
self.start = time.perf_counter()
return self
def __exit__(self, *args):
elapsed = time.perf_counter() - self.start
print(f"[{self.name}] {elapsed:.4f}s")
with Timer("data loading"):
batch = next(dataloader_iter)
with Timer("forward pass"):
outputs = model(batch)
with Timer("backward pass"):
loss.backward()
Common finding: data loading takes 60% of training time. The fix is num_workers > 0 in your DataLoader, not a faster GPU.
Part 5: cProfile and line_profiler
When you need more than manual timers:
python -m cProfile -s cumtime train.py
This shows every function call sorted by cumulative time. For line-by-line profiling:
pip install line_profiler
@profile
def train_step(model, data, target):
output = model(data)
loss = F.cross_entropy(output, target)
loss.backward()
return loss
# Run with: kernprof -l -v train.py
Part 6: Memory Profiling
CPU Memory with tracemalloc
import tracemalloc
tracemalloc.start()
# your code here
model = build_model()
data = load_dataset()
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
print(stat)
CPU Memory with memory_profiler
pip install memory_profiler
from memory_profiler import profile
@profile
def load_data():
raw = read_csv("data.csv") # watch memory jump here
processed = preprocess(raw) # and here
return processed
Run with python -m memory_profiler your_script.py to see line-by-line memory usage.
Use torch.cuda.empty_cache() to free cached memory
Use del tensor followed by torch.cuda.empty_cache() for large intermediates
Use mixed precision (torch.cuda.amp) to halve memory usage
Use gradient checkpointing for very deep models
Part 7: Common AI Bugs and How to Catch Them
Shape Mismatch
The most frequent bug. A tensor has shape [batch, features] when the model expects [batch, channels, height, width].
def check_shapes(model, sample_input):
print(f"Input: {sample_input.shape}")
hooks = []
def make_hook(name):
def hook(module, inp, out):
in_shape = inp[0].shape if isinstance(inp, tuple) else inp.shape
out_shape = out.shape if hasattr(out, "shape") else type(out)
print(f" {name}: {in_shape} -> {out_shape}")
return hook
for name, module in model.named_modules():
hooks.append(module.register_forward_hook(make_hook(name)))
with torch.no_grad():
model(sample_input)
for h in hooks:
h.remove()
Run this once with a sample batch. It maps every shape transformation in your model.
NaN Loss
NaN loss means something exploded. Common causes:
Learning rate too high
Division by zero in custom loss
Log of zero or negative number
Exploding gradients in RNNs
def detect_nan(model, loss, step):
if torch.isnan(loss):
print(f"NaN loss at step {step}")
for name, param in model.named_parameters():
if param.grad is not None:
if torch.isnan(param.grad).any():
print(f" NaN gradient in {name}")
if torch.isinf(param.grad).any():
print(f" Inf gradient in {name}")
return True
return False
Data Leakage
Your model gets 99% accuracy on the test set. Sounds great. It's a bug.
def check_data_leakage(train_set, test_set, id_column="id"):
train_ids = set(train_set[id_column].tolist())
test_ids = set(test_set[id_column].tolist())
overlap = train_ids & test_ids
if overlap:
print(f"DATA LEAKAGE: {len(overlap)} samples in both train and test")
return True
return False
Also check for temporal leakage: using future data to predict the past. Sort by timestamp before splitting.
Wrong Device
Tensors on different devices (CPU vs GPU) cause runtime errors. But sometimes a tensor silently stays on CPU while everything else is on GPU, and training just runs slowly.
def check_devices(model, *tensors):
model_device = next(model.parameters()).device
print(f"Model device: {model_device}")
for i, t in enumerate(tensors):
if t.device != model_device:
print(f" WARNING: tensor {i} on {t.device}, model on {model_device}")
Part 8: TensorBoard Basics
TensorBoard shows you what's happening inside training over time.
pip install tensorboard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("runs/experiment_1")
for step in range(num_steps):
loss = train_step(model, batch)
writer.add_scalar("loss/train", loss.item(), step)
writer.add_scalar("lr", optimizer.param_groups[0]["lr"], step)
if step % 100 == 0:
for name, param in model.named_parameters():
writer.add_histogram(f"weights/{name}", param, step)
if param.grad is not None:
writer.add_histogram(f"grads/{name}", param.grad, step)
writer.close()
Launch it:
tensorboard --logdir=runs
What to look for:
Loss not decreasing: Learning rate too low, or model architecture issue
Loss oscillating wildly: Learning rate too high
Loss goes to NaN: Numerical instability (see NaN section above)
Train loss decreasing, val loss increasing: Overfitting
Weight histograms collapsing to zero: Vanishing gradients
Gradient histograms exploding: Need gradient clipping
Part 9: VS Code Debugger
For interactive debugging, configure VS Code with a launch.json:
Set breakpoints by clicking the gutter. Use the Variables pane to inspect tensor properties. The Debug Console lets you run arbitrary Python expressions mid-execution.
Useful for stepping through data preprocessing pipelines where you want to see each transformation.
Use It
Here's the debugging workflow that catches most AI bugs:
Before training: Run check_shapes with a sample batch. Verify input and output dimensions match expectations.
First 10 steps: Use debug_print on loss, outputs, and gradients. Confirm nothing is NaN and values are in reasonable ranges.
During training: Log loss, learning rate, and gradient norms. Use TensorBoard for visualization.
When something breaks: Drop breakpoint() at the failure point. Inspect tensors interactively.
For performance: Time your data loading vs forward vs backward pass. Profile memory if you're near OOM.
See outputs/prompt-debug-ai-code.md for a prompt that helps diagnose AI-specific bugs.
Exercises
Run debug_tools.py and read through each section's output. Modify the dummy model to introduce a NaN (hint: divide by zero in the forward pass) and watch the detector catch it.
Profile a training loop with cProfile and identify the slowest function.
Use tracemalloc to find which line in your data loading pipeline allocates the most memory.
Set up TensorBoard for a simple training run and identify whether the model is overfitting.
Use breakpoint() inside a training loop. Practice inspecting tensor shapes, devices, and gradient values from the debugger prompt.