Install Multi-Chip Software Stack

Install 5 packages inside a running GPU container, in dependency order, with per-package validation and structured error reporting.

Skill Components

install-stack/
├── SKILL.md                           # This file — execution flow
├── scripts/
│   ├── detect_network.py              # Probe GitHub/PyPI, return mirror config (JSON)
│   ├── collect_env_info.py            # Python/glibc/arch/vendor/disk info (JSON)
│   ├── select_flagtree_wheel.py       # Match vendor+python+glibc → wheel specifier (JSON)
│   └── validate_packages.py           # Import-test all 5 packages, report status (JSON)
└── references/
    ├── vendor-mappings.md             # FlagCX make flags, adaptor names, dependency chain
    └── network-mirrors.md             # GitHub/PyPI mirror config rules

Prerequisites

A running Docker container with PyTorch + GPU access (from /gpu-container-setup)
Know the container name and GPU vendor

If invoked standalone, ask the user for container name and GPU vendor. If invoked from /flagrelease orchestrator, these are passed as context.

Execution Flow

Step 0: Resolve Container & Vendor

Verify the container is running:

docker inspect --format='{{.State.Status}}' <CONTAINER> | grep -q running

Copy and run scripts/collect_env_info.py inside the container to get vendor, Python version, glibc version, architecture, and free disk space:

docker cp <SKILL_DIR>/scripts/collect_env_info.py <CONTAINER>:/tmp/
docker exec <CONTAINER> python3 /tmp/collect_env_info.py

If vendor is unknown and user didn't provide --vendor, ask the user.

Step 1: Detect Network Environment

Copy and run scripts/detect_network.py inside the container:

docker cp <SKILL_DIR>/scripts/detect_network.py <CONTAINER>:/tmp/
docker exec <CONTAINER> python3 /tmp/detect_network.py

Parse the JSON output to get GITHUB_PREFIX and PIP_INDEX for all subsequent commands. See references/network-mirrors.md for fallback rules.

Step 2: Check Disk Space

From collect_env_info.py output, verify at least 10GB free. If not, warn user and ask whether to proceed.

Step 3: Install Packages (in dependency order)

See references/vendor-mappings.md for dependency chain and install order: vLLM → FlagTree → FlagGems → FlagCX → vllm-plugin-FL

3.1: vLLM 0.13.0

docker exec <CONTAINER> pip install ${PIP_INDEX} vllm==0.13.0

Quick validate:

docker exec <CONTAINER> python3 -c "import vllm; assert vllm.__version__ == '0.13.0'"

GATE: If vLLM install fails → record error and EXIT the skill.

3.2: FlagTree (pre-compiled wheel)

Run scripts/select_flagtree_wheel.py to find the correct wheel:

python3 <SKILL_DIR>/scripts/select_flagtree_wheel.py \
    --vendor <VENDOR> --python <PY_VER> --glibc <GLIBC_VER>

If status is FOUND, uninstall stock triton and install the wheel:

docker exec <CONTAINER> bash -c '
python3 -m pip uninstall -y triton
python3 -m pip uninstall -y triton
python3 -m pip install <SPECIFIER> <PIP_ARGS>
'

If status is NOT_FOUND, record the mismatch and continue (do not exit).

3.3: FlagGems

docker exec <CONTAINER> bash -c "
cd /tmp && git clone ${GITHUB_PREFIX}/FlagOpen/FlagGems
cd FlagGems && pip install ${PIP_INDEX} -e .
"

Failure → record and continue.

3.4: FlagCX (two-phase build)

Read references/vendor-mappings.md to look up the correct Make flag and FLAGCX_ADAPTOR for the detected vendor.

Phase 1: Build C++ library:

docker exec <CONTAINER> bash -c "
cd /tmp && git clone ${GITHUB_PREFIX}/flagos-ai/FlagCX
cd FlagCX && git submodule update --init --recursive
make <MAKE_FLAG> -j\$(nproc)
"

Phase 2: Install PyTorch plugin:

docker exec <CONTAINER> bash -c "
cd /tmp/FlagCX/plugin/torch
FLAGCX_ADAPTOR=<ADAPTOR> pip install -e . --no-build-isolation
"

Failure → record and continue.

3.5: vllm-plugin-FL

docker exec <CONTAINER> bash -c "
cd /tmp && git clone ${GITHUB_PREFIX}/flagos-ai/vllm-plugin-FL
cd vllm-plugin-FL
pip install ${PIP_INDEX} -r requirements.txt
pip install --no-build-isolation -e .
"

On Iluvatar, if requirements.txt fails, retry with requirements_iluvatar.txt.

GATE: If vllm-plugin-FL fails → record error and EXIT the skill.

Step 4: Validate All Packages

Copy and run scripts/validate_packages.py inside the container:

docker cp <SKILL_DIR>/scripts/validate_packages.py <CONTAINER>:/tmp/
docker exec <CONTAINER> python3 /tmp/validate_packages.py

This produces a comprehensive JSON report of all 5 packages with import status, versions, and gate check.

Step 5: Set Runtime Environment

If FlagCX installed successfully, persist FLAGCX_PATH:

docker exec <CONTAINER> bash -c "echo 'export FLAGCX_PATH=/tmp/FlagCX' >> ~/.bashrc"

Step 6: Produce Final Report

Combine all results into structured output:

{
  "status": "PASS | PARTIAL | FAIL",
  "stage": "install-stack",
  "container": "<name>",
  "vendor": "<vendor>",
  "network": {"github_mirror": true, "pypi_mirror": true},
  "python_version": "3.11",
  "glibc_version": "2.34",
  "packages": {
    "vllm": {"status": "PASS", "version": "0.13.0"},
    "flagtree": {"status": "PASS", "version": "0.4.1+ascend3.2"},
    "flaggems": {"status": "PASS", "version": "..."},
    "flagcx": {"status": "PASS", "version": "0.10.0"},
    "vllm_plugin_fl": {"status": "PASS", "version": "..."}
  },
  "flagcx_path": "/tmp/FlagCX",
  "gate_passed": true,
  "errors": []
}

Status logic:

PASS — all 5 packages installed and validated
PARTIAL — vLLM + plugin installed, some of FlagTree/FlagGems/FlagCX failed
FAIL — vLLM or plugin failed (gate failed)

Error Handling

| Failure | Behavior | |---------|----------| | Container not running | Report error, exit | | Network unreachable (both direct and mirror) | Report, exit | | pip install fails | Report package, full error, continue (unless gate) | | Build from source fails | Report compiler error, continue | | No matching FlagTree wheel | Report vendor/python/glibc mismatch, continue | | Import fails after install | Report traceback, continue | | Disk space < 10GB | Warn user, ask whether to proceed | | Timeout (any command) | All commands use timeout; report which step |

Timeout Rules

| Operation | Timeout | |-----------|---------| | pip install (per package) | 300s | | git clone | 120s | | make (FlagCX) | 300s | | Network probe | 5s | | Validation (import) | 30s |