Files
Reverso/docs/gpu_plan.md
Junior B. 07705a5884 GPU Phase 1: co-fold cofactors/metals (the binding-mode determinants)
Add metal/cofactor handling to the Boltz-2 YAML as CCD ligand entries -
the modes classical docking couldn't model:
- HDAC2 + catalytic Zn (vorinostat chelates it)
- PKR + FBP + Mg (allosteric activator + metal)
- hemoglobin + heme
Same cofactors present when co-folding negatives into a target (fair test).

build_boltz_yaml() gains a cofactor_ccds arg (emits `ligand: {ccd: ...}`
entries); TARGETS carries per-target cofactors; cofold()/main() thread them
through. Verified locally: YAML builds correctly with Zn / FBP+Mg.

Honest limitation noted: Hb's voxelotor site is at the tetramer centre and
covalent (Schiff base), so single-chain+heme only approximates it - HDAC2
(Zn) and PKR (cofactor) are the real co-folding tests. Ready for
`modal run gpu/modal_app.py`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 17:16:06 +02:00

113 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GPU plan — ephemeral AF3-class co-folding for the binding track
Goal: run AF3-class co-folding (Boltz-2 / DiffDock) on a GPU for the structure-binding track
(§12), then **pay nothing when idle**. The work is *bursty* (a validation run, then a screen),
the inputs are *tiny*, so the design optimises for zero idle cost, not for a persistent box.
## What actually has to move (small!)
| Thing | Size | How it gets to the GPU |
|---|---|---|
| Code | KB | `git clone` the gitea repo (or mount) |
| Target structures (PDBs) | a few MB | in the repo / git |
| Ligands (SMILES, drug set) | KB | in the repo / git |
| **Model weights** (Boltz-2 / DiffDock) | **~26 GB** | downloaded once, **cached in a persistent volume** |
| Results (poses, scores, RMSDs) | KBMB | `git push` back / download |
The 27 GB LINCS data is **not** part of this track — nothing big to upload. The only thing worth
persisting is the model-weights cache (so we don't re-download = re-pay GPU time every run).
## How the model weights persist (the cost-saver)
A `modal.Volume` is a **named, cloud-backed filesystem that lives independently of any container
or GPU** — it survives every teardown. Mounted into the function at `/weights`:
- **Run 1:** `/weights` is empty → the model downloads weights there (the one-time slow cost).
- **Run 2+:** the same Volume mounts with the files already present → download skipped → **no
GPU-billed seconds wasted re-fetching 5 GB.**
Two things make it actually cache:
1. **Point the downloader at the mount** (weights only persist if written under `/weights`):
`HF_HOME=/weights/hf` (HuggingFace), `TORCH_HOME=/weights/torch`, `boltz --cache /weights/boltz`.
2. **Commit semantics:** writes persist on `weights.commit()` (modern Modal also auto-commits on a
clean exit); other containers see them after `weights.reload()`. Pattern: `reload()` → run →
`commit()`.
The Volume itself costs pennies (~$/GB-month of storage), *separate from the GPU* — so caching ~5 GB
of weights is near-free and saves real GPU time on every subsequent run.
(Alternative: bake weights into the image at build time via `image.run_function(download)` — fastest
cold start, but the image rebuilds when weights change. The skeleton uses the Volume approach.)
## Provider choice
| Option | Billing | Idle cost | "Kill" model | Best for |
|---|---|---|---|---|
| **Modal** (recommended) | per-second | **$0 (scales to zero)** | automatic — nothing to remember | bursty batch runs |
| RunPod | per-minute, on-demand/spot | only while pod exists | manual `terminate` | interactive SSH box |
| Vast.ai | per-minute spot | only while rented | manual destroy | cheapest, more fiddly |
| Lambda / AWS-GCP spot | per-hour/second | until you stop it | manual stop | if you already have credits |
**Recommend Modal.** You define the run as a function with a `gpu=` decorator; on call it
provisions the GPU, runs, and releases it — **there is no GPU to forget to kill**, and you pay only
for the seconds it ran. That *is* the "kill the GPU to save money" requirement, automated.
## The lifecycle (Modal)
1. **Define image** once: CUDA base + `boltz` (or DiffDock) + `rdkit`, `meeko`, `spyrmsd`, `gemmi`.
2. **Weights volume**: `modal.Volume` mounted at `/weights`; the model downloads into it on first
run and is cached forever after (no re-download cost).
3. **Run**: `modal run gpu/modal_app.py` → provisions GPU → runs the test → returns results to your
laptop. GPU released the moment the function returns.
4. **Persist results**: write the returned scores/poses into `data/processed/binding/` and
`git commit` (small, text).
5. **Teardown**: nothing to do — Modal scaled to zero. `modal app stop` only if a run is hung.
RunPod alternative (if you want an interactive box): start pod → `git clone` → run → `git push`
results → **`runpodctl remove pod <id>`** (or Stop in the UI). Set an **idle auto-terminate** so a
forgotten box can't bleed money.
## What runs on the GPU (in cost order — cheap validation first)
- **Phase 1 — modality validation (~minutes, ~$1):** co-fold each known binder + 2 negative
controls (caffeine, hydroxyurea) into each target (Hb, PKR, HDAC2) **with the binding-mode
cofactors co-folded in** — HDAC2 + catalytic **Zn**, PKR + **FBP/Mg**, Hb + **heme** (as CCD
ligand entries) — and check the **known binder has the highest Boltz-2 P(binder)** for its own
target. This is the discrimination Vina couldn't manage precisely because it can't model
Zn-chelation / cofactors. (Ranking uses P(binder), not the raw affinity value, whose
sign is version-dependent.) Pose-RMSD vs crystal is a deeper check but needs receptor
superposition (align predicted protein to crystal, transform ligand) — a later refinement. If
Phase 1 passes, the modality is real; if not, stop before paying for a screen.
- **Phase 2 — screen (~tens of minutes, a few $):** run the ~300-drug set (or a focused subset)
against the sickle targets; rank by Boltz-2 predicted affinity; redo the §12.4 positive-control
recovery test. Output a ranked CSV, same shape as the connectivity `ranked_candidates`.
## Model choice
- **Boltz-2** (MIT, pip-installable) — predicts the proteinligand complex **and** a binding
affinity → directly gives a rankable score. Primary choice. Fits a 2440 GB GPU for these
single-domain targets.
- **DiffDock-L** — lighter, pose-only (needs a separate scorer); fallback if Boltz memory is tight.
- GPU: an **L4 (24 GB, ~$0.60.8/hr)** or **A10/L40S (2448 GB)** is plenty; no multi-GPU, no A100
needed for these sizes.
## Cost controls (the save-money checklist)
- Serverless (Modal) → **zero idle cost** by construction; or an **idle-timeout** auto-kill on a box.
- **Cache weights** in a persistent volume — re-downloading 5 GB on a $1/hr GPU is wasted money.
- **Validate on one target before screening** — don't pay for a 300-drug screen until Phase 1 passes.
- Prefer **spot/interruptible** for the batch screen (Phase 2 is restartable).
- Keep prep (Meeko/RDKit) and result-scoring on the **laptop**; only the model forward pass needs GPU.
- Estimated total to validate + a first screen: **~$515**, not a standing bill.
## Repo integration
- `gpu/modal_app.py` — the Modal app (skeleton committed alongside this plan).
- Results land in `data/processed/binding/` (gitignored) + a small committed summary.
- Pin model + weights version in the image for reproducibility.
## Next step
Scaffold `gpu/modal_app.py` for Phase-1 validation (3 known binders), do a dry run locally
(`modal run --detach`? no — just `modal run`), confirm cost, then Phase 2. Requires a Modal account
+ `pip install modal` + `modal token new` (one-time auth).