Back to skills
extension
Category: Development & EngineeringNo API key required

slurm

Help write, debug, and manage SLURM jobs. Use when the user asks about sbatch, salloc, squeue, job scripts, or cluster resource allocation.

personAuthor: jakexiaohubgithub

SLURM Assistant

Help the user write job scripts, debug failed jobs, and manage cluster resources.

Job Script Guidelines

  • Always include: --job-name, --output, --error, --time, --mem, --gres (for GPUs), --cpus-per-task
  • Place scripts in a dedicated folder (e.g. scripts/)
  • Use set -euo pipefail in the bash portion
  • Log key info at the start: hostname, GPU info (nvidia-smi), date, git commit hash
  • Activate the correct virtual environment before running Python

Resource Allocation Rules

  • Small experiments (<1M params): 1 GPU, 4-8 CPUs, 16-32GB RAM
  • Medium experiments (1M-1B params): 1-2 GPUs, 8-16 CPUs, 32-64GB RAM
  • Large models (7B+): multiple GPUs, 64-128GB+ RAM
  • 32B+ inference: 4+ GPUs, match tensor parallelism to GPU count
  • Rule of thumb: ~4-8 CPUs per GPU, ~2x model size in FP16 for VRAM

Known GPU Types & Selection

GPU types (use with --gres=gpu:<type>:N)

  • a100: A100 40GB HBM2e
  • a100l: A100 80GB HBM2e
  • a6000: RTX A6000 48GB GDDR6
  • h100: H100 80GB HBM3
  • l40s: L40S ~45GB GDDR6
  • rtx8000: Quadro RTX 8000 48GB GDDR6
  • v100: V100 32GB HBM2

GPU selection by attribute

You can also request GPUs by memory, architecture, or feature:

  • By memory: --gres=gpu:48gb:1 (any 48GB GPU: RTX8000, A6000, L40S)
  • By arch: --gres=gpu:ampere:1 (A100, A6000, L40S)
  • By interconnect: --gres=gpu:nvlink:1
  • By system: --gres=gpu:dgx:1
  • Memory tags: 12gb, 32gb, 40gb, 48gb, 80gb
  • Arch tags: volta, turing, ampere

Node Inventory

| Nodes | Count | GPUs | CPUs | RAM | |---|---|---|---|---| | cn-l[001-091] | 91 | 4x L40S (48GB) | 48 | 1024GB | | cn-c[001-040] | 40 | 8x RTX8000 (48GB) | 64 | 384GB | | cn-g[001-029] | 29 | 4x A100 (80GB) | 64 | 1024GB | | cn-a[001-011] | 11 | 8x RTX8000 (48GB) | 40 | 384GB | | cn-b[001-005] | 5 | 8x V100 (32GB) | 40 | 384GB | | cn-k[001-004] | 4 | 4x A100 (40GB) | 48 | 512GB | | cn-n[001-002] | 2 | 8x H100 (80GB) | 192 | 2048GB | | cn-d[001-004] (DGX) | 4 | 8x A100 (40/80GB) | 128 | 1024-2048GB | | cn-j001 | 1 | 8x A6000 (48GB) | 64 | 1024GB |

GPUs per node is either 4 or 8 — don't request more than the node type has.

Partitions & Preemption

| Partition | Time Limit | Per-User Limits | |---|---|---| | long (default) | 7 days | No per-user GPU cap | | main | 5 days | 2 GPUs, 8 CPUs, 48GB | | short | 3 hours | 4 GPUs, 1TB mem | | unkillable | 2 days | 1 GPU, 6 CPUs, 32GB |

Preemption hierarchy: unkillable > main > long. Once preempted, jobs are killed and auto-requeued. main jobs do NOT preempt other main jobs. -grace variants give a SIGTERM grace period before kill. Checkpoint frequently on long partition.

Storage

| Path | Quota | Key Policy | |---|---|---| | $HOME | 100GB / 1M files | Daily backup, low I/O — don't write logs here | | $SCRATCH | 5TB / unlimited | Files unused >90 days deleted | | $SLURM_TMPDIR | No quota | Fastest I/O, cleared after job | | /network/projects/<group>/ | 1TB / 1M files | Shared project storage | | $ARCHIVE | 5TB | No backup, not on GPU nodes |

Always copy data to $SLURM_TMPDIR at job start for performance. Write logs/outputs to $SCRATCH, not $HOME. Check usage with disk-quota.

Module System

  • module load python/3.10 — required before creating venvs on cluster
  • module load miniconda/3 — for conda environments
  • module avail / module spider <term> — search available modules
  • Pre-built PyTorch/TF modules exist for Mila GPUs
  • On login/CPU nodes without GPUs: CONDA_OVERRIDE_CUDA=11.8 before conda commands

Debugging Failed Jobs

  • Check .err files first — experiment logs go to stderr
  • sacct -j <jobid> --format=JobID,State,ExitCode,MaxRSS,Elapsed,NodeList for completed jobs
  • Common issues: OOM (check MaxRSS), time limit, bad path, missing module/env
  • For OOM: check batch size, model size, gradient accumulation, and whether --mem was sufficient
  • torch.autograd.set_detect_anomaly(True) causes extreme filesystem IOPS — never leave on in batch jobs, admins will flag it

Monitoring

  • disk-quota — check storage usage
  • squeue -u $USER — your active jobs
  • echo $SLURM_JOB_GPUS — which GPU(s) your job got
  • Netdata per-node: <node>.server.mila.quebec:19999 (requires Mila wifi or SSH tunnel)
  • Grafana dashboard: dashboard.server.mila.quebec

Limits

  • Max 1000 jobs per user in the system at any time

Safety

  • Never submit jobs (sbatch) without explicit user confirmation
  • Verify paths and configs before submission
  • Test on small instances first when possible

Scope

$ARGUMENTS