LigandMPNN Ligand-Aware Design

Prerequisites

| Requirement | Minimum | Recommended | |-------------|---------|-------------| | Python | 3.8+ | 3.10 | | CUDA | 11.0+ | 11.7+ | | GPU VRAM | 8GB | 16GB (T4) | | RAM | 8GB | 16GB |

How to run

First time? See Getting started to set up Modal and biomodals.

Option 1: Modal (recommended)

cd biomodals
# modal_ligandmpnn.py takes --input-pdb; LigandMPNN run.py args go in --params-str
modal run modal_ligandmpnn.py \
  --input-pdb protein_ligand.pdb \
  --params-str "--model_type ligand_mpnn --number_of_batches 16 --temperature 0.1"

GPU: A10G default | Timeout: 900s default

Option 2: Local installation

git clone https://github.com/dauparas/LigandMPNN.git
cd LigandMPNN

python run.py \
  --model_type ligand_mpnn \
  --pdb_path protein_ligand.pdb \
  --out_folder output/ \
  --number_of_batches 16 \
  --temperature 0.1

Key parameters (LigandMPNN run.py)

| Parameter | Default | Description | |-----------|---------|-------------| | --pdb_path | required | PDB with ligand | | --model_type | protein_mpnn | ligand_mpnn, soluble_mpnn, etc. | | --temperature | 0.1 | Sampling temperature | | --number_of_batches | 1 | Batches (sequences = batch_size x batches) | | --batch_size | 1 | Sequences per batch | | --ligand_mpnn_use_side_chain_context | 0 | Use ligand side-chain context |

Ligand Specification

In PDB File

Ligand must be present as HETATM records:

ATOM    ...protein atoms...
HETATM  1  C1  LIG A 999      x.xxx  y.yyy  z.zzz  1.00  0.00           C

Supported Ligand Types

Small molecules (HETATM)
Metals (Zn, Fe, Mg, Ca, etc.)
Cofactors (NAD, FAD, ATP)
DNA/RNA

Output format

output/
├── seqs/
│   └── protein.fa          # FASTA sequences
└── protein_pdb/
    └── protein_0001.pdb    # PDBs with designed sequence

Sample output

Successful run

$ python run.py --pdb_path enzyme_substrate.pdb --out_folder output/ --num_seq_per_target 8
Loading LigandMPNN model weights...
Processing enzyme_substrate.pdb
Found ligand: LIG (12 atoms)
Generated 8 sequences in 3.1 seconds

output/seqs/enzyme_substrate.fa:
>enzyme_substrate_0001, score=1.45, global_score=1.38
MKTAYIAKQRQISFVKSHFSRQLE...
>enzyme_substrate_0002, score=1.52, global_score=1.41
MKTAYIAKQRQISFVKSQFSRQLD...

What good output looks like:

Score: 1.0-2.0 (lower = more confident)
Ligand detected and incorporated in context
Active site residues preserved or optimized

Decision tree

Should I use LigandMPNN?
│
├─ What's in your binding site?
│  ├─ Small molecule / ligand → LigandMPNN ✓
│  ├─ Metal ion (Zn, Fe, etc.) → LigandMPNN ✓
│  ├─ Cofactor (NAD, FAD, ATP) → LigandMPNN ✓
│  ├─ DNA/RNA → LigandMPNN ✓
│  └─ Nothing / protein only → Use ProteinMPNN
│
├─ What type of design?
│  ├─ Enzyme active site → LigandMPNN ✓
│  ├─ Metal binding site → LigandMPNN ✓
│  ├─ Protein-protein binder → Use ProteinMPNN
│  └─ De novo scaffold → Use ProteinMPNN
│
└─ Priority?
   ├─ Solubility/expression → Consider SolubleMPNN
   └─ Ligand context accuracy → LigandMPNN ✓

Typical performance

| Campaign Size | Time (T4) | Cost (Modal) | Notes | |---------------|-----------|--------------|-------| | 100 backbones × 8 seq | 15-20 min | ~$2 | Standard | | 500 backbones × 8 seq | 1-1.5h | ~$8 | Large campaign |

Throughput: ~50-100 sequences/minute on T4 GPU.

Verify

grep -c "^>" output/seqs/*.fa  # Should match backbone_count × num_seq_per_target

Troubleshooting

Ligand not recognized: Check HETATM format, verify ligand residue name Poor binding residues: Increase sampling around active site Missing contacts: Verify ligand coordinates in PDB

Error interpretation

| Error | Cause | Fix | |-------|-------|-----| | RuntimeError: CUDA out of memory | Long protein or large batch | Reduce batch_size | | KeyError: 'LIG' | Ligand not found in PDB | Check HETATM records | | ValueError: no ligand atoms | Empty ligand | Verify ligand has atoms in PDB |

Next: Structure prediction for validation → protein-qc for filtering.