GPU Container Setup Skill

This skill automates multi-vendor GPU container setup for PyTorch workloads.

Supported GPU Vendors

| Vendor | PyTorch Backend | Detection | |--------|-----------------|-----------| | NVIDIA | CUDA | nvidia-smi | | AMD | ROCm (HIP) | rocm-smi, /opt/rocm | | Ascend | torch_npu | npu-smi, /usr/local/Ascend | | Metax | torch_musa | mx-smi, /opt/metax | | Iluvatar | torch_corex | ixsmi, /opt/iluvatar |

Execution Flow

When invoked, follow these steps:

Step 1: Parse Arguments

Check if user provided:

--vendor <name> - Force specific vendor (skip detection)
--image <image> - Force specific container image
--data <path> - Force specific data mount path
--name <name> - Container name (default: pytorch-gpu)

Step 2: Detect GPU Vendor

Run the detection script:

python3 .claude/skills/gpu-container-setup/scripts/detect_gpu.py

Expected output:

{"vendor": "ascend", "devices": ["Ascend 910B"], "count": 8}

If detection fails and no --vendor flag provided, ask user which vendor to use.

Step 3: Find Data Disk

Run the data disk detection:

python3 .claude/skills/gpu-container-setup/scripts/find_data_disk.py

Expected output:

{"data_disk": "/mnt/data", "found": true, "size": "2.0T", "available": "1.5T"}

If no suitable disk found, ask user for data mount path.

Step 4: Find Container Image

Follow strict priority order (only proceed to next if current fails):

1. Primary Vendor Hub (hardcoded) → 2. BAAI Harbor → 3. Web Search → 4. Local Images → 5. Ask User

Step 4.1: Primary Vendor Hub (hardcoded URLs)

| Vendor | Registry | API/Query | |--------|----------|-----------| | NVIDIA | nvcr.io | https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags | | Ascend | ascendhub.huawei.com | Portal: https://ascendhub.huawei.com | | Metax | registry.metax-tech.com | https://registry.metax-tech.com/v2/pytorch/metax-pytorch/tags/list | | Iluvatar | hub.iluvatar.com | https://hub.iluvatar.com/v2/pytorch/iluvatar-pytorch/tags/list | | AMD | docker.io (rocm/pytorch) | https://hub.docker.com/v2/repositories/rocm/pytorch/tags |

# Example: Query NGC for latest NVIDIA PyTorch
TAG=$(curl -s "https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags" | jq -r '.tags[].name' | grep -E '^[0-9]{2}\.[0-9]{2}-py3$' | sort -rV | head -1)
IMAGE="nvcr.io/nvidia/pytorch:${TAG}"

Step 4.2: BAAI Harbor (fallback)

Only if Step 4.1 fails (unreachable, no image, pull fails).

# Query BAAI Harbor
curl -s "https://harbor.baai.ac.cn/api/v2.0/projects/flagrelease-public/repositories?page_size=100" | jq -r '.[].name' | grep "flagrelease-<vendor>"

Step 4.3: Web Search (fallback)

Only if Steps 4.1 and 4.2 fail. Search for "<vendor> pytorch docker official".

Step 4.4: Local Images (fallback)

Only if Steps 4.1-4.3 fail. Check docker images | grep pytorch.

Test Before Use

docker pull "${IMAGE}" && docker run --rm "${IMAGE}" python -c "import torch; print(torch.__version__)"

If test fails, try next source. If all fail, ask user for image.

Step 4.5: Update Skill (self-improvement)

IMPORTANT: If image found via Web Search (Step 4.3) passes all tests, update references/image-sources.md to add the newly discovered vendor hub as a primary source. This makes future lookups faster.

# After successful web search discovery:
# 1. Verify image works (pull + pytorch test + GPU test)
# 2. Extract registry URL pattern
# 3. Update references/image-sources.md Step 1 section with new vendor hub

Step 5: Build Docker Command

Refer to references/mount-requirements.md for vendor-specific requirements.

NVIDIA:

docker run -d --gpus all \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

AMD/ROCm:

docker run -d \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

Ascend:

docker run -d \
  --device=/dev/davinci0 --device=/dev/davinci1 ... \
  --device=/dev/davinci_manager \
  --device=/dev/devmm_svm \
  --device=/dev/hisi_hdc \
  -v /usr/local/Ascend:/usr/local/Ascend:ro \
  -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

Metax:

docker run -d \
  --device=/dev/mx0 --device=/dev/mx1 ... \
  -v /opt/metax:/opt/metax:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

Iluvatar:

docker run -d \
  --device=/dev/bi0 --device=/dev/bi1 ... \
  -v /opt/iluvatar:/opt/iluvatar:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

Step 6: Start Container

Execute the docker run command. If container with same name exists:

Check if it's running - offer to use existing or replace
If stopped - offer to restart or replace

Step 7: Validate PyTorch GPU

Copy and run validation script inside container:

docker cp .claude/skills/gpu-container-setup/scripts/validate_pytorch.py pytorch-gpu:/tmp/
docker exec pytorch-gpu python3 /tmp/validate_pytorch.py

Expected output:

{
  "status": "PASS",
  "backend": "npu",
  "device_count": 8,
  "device_names": ["Ascend 910B", ...],
  "tests": {
    "device_detection": true,
    "tensor_creation": true,
    "matrix_multiply": true,
    "gpu_to_cpu_transfer": true
  }
}

Step 8: Report Results

Summarize to user:

GPU vendor and devices detected
Container name and image used
Data mount path
Validation status
How to access: docker exec -it pytorch-gpu bash

Error Handling

| Error | Action | |-------|--------| | No GPU detected | Ask user for vendor or check drivers | | Image pull fails | Try alternative registry or web search | | Container start fails | Check device permissions, show error | | Validation fails | Show detailed error, suggest fixes |

Reference Files

references/gpu-detection.md - Detection methods by vendor
references/image-sources.md - Image discovery guide (registry APIs, priority order, selection criteria)
references/mount-requirements.md - Vendor mount specifications

Example Usage

User: /gpu-container-setup
User: setup a pytorch container
User: start container with ascend GPU
User: /gpu-container-setup --image nvcr.io/nvidia/pytorch:24.01-py3
User: /gpu-container-setup --image harbor.baai.ac.cn/flagrelease-public/ngctorch:2601

gpu-container-setup-flagos

GPU Container Setup Skill

Supported GPU Vendors

Execution Flow

Step 1: Parse Arguments

Step 2: Detect GPU Vendor

Step 3: Find Data Disk

Step 4: Find Container Image

Step 4.1: Primary Vendor Hub (hardcoded URLs)

Step 4.2: BAAI Harbor (fallback)

Step 4.3: Web Search (fallback)

Step 4.4: Local Images (fallback)

Test Before Use

Step 4.5: Update Skill (self-improvement)

Step 5: Build Docker Command

Step 6: Start Container

Step 7: Validate PyTorch GPU

Step 8: Report Results

Error Handling

Reference Files

Example Usage