Weights & Biases

Monitor, analyze, and compare W&B training runs.

Setup

wandb login
# Or set WANDB_API_KEY in environment

Scripts

Characterize a Run (Full Health Analysis)

~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/characterize_run.py ENTITY/PROJECT/RUN_ID

Analyzes:

Loss curve trend (start → current, % change, direction)
Gradient norm health (exploding/vanishing detection)
Eval metrics (if present)
Stall detection (heartbeat age)
Progress & ETA estimate
Config highlights
Overall health verdict

Options: --json for machine-readable output.

Watch All Running Jobs

~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/watch_runs.py ENTITY [--projects p1,p2]

Quick health summary of all running jobs plus recent failures/completions. Ideal for morning briefings.

Options:

--projects p1,p2 — Specific projects to check
--all-projects — Check all projects
--hours N — Hours to look back for finished runs (default: 24)
--json — Machine-readable output

Compare Two Runs

~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/compare_runs.py ENTITY/PROJECT/RUN_A ENTITY/PROJECT/RUN_B

Side-by-side comparison:

Config differences (highlights important params)
Loss curves at same steps
Gradient norm comparison
Eval metrics
Performance (tokens/sec, steps/hour)
Winner verdict

Python API Quick Reference

import wandb
api = wandb.Api()

# Get runs
runs = api.runs("entity/project", {"state": "running"})

# Run properties
run.state      # running | finished | failed | crashed | canceled
run.name       # display name
run.id         # unique identifier
run.summary    # final/current metrics
run.config     # hyperparameters
run.heartbeat_at # stall detection

# Get history
history = list(run.scan_history(keys=["train/loss", "train/grad_norm"]))

Metric Key Variations

Scripts handle these automatically:

Loss: train/loss, loss, train_loss, training_loss
Gradients: train/grad_norm, grad_norm, gradient_norm
Steps: train/global_step, global_step, step, _step
Eval: eval/loss, eval_loss, eval/accuracy, eval_acc

Health Thresholds

Gradients > 10: Exploding (critical)
Gradients > 5: Spiky (warning)
Gradients < 0.0001: Vanishing (warning)
Heartbeat > 30min: Stalled (critical)
Heartbeat > 10min: Slow (warning)

Integration Notes

For morning briefings, use watch_runs.py --json and parse the output.

For detailed analysis of a specific run, use characterize_run.py.

For A/B testing or hyperparameter comparisons, use compare_runs.py.