GPU plan: make weight persistence concrete (Modal Volume cache)

Document and wire the weight-caching mechanism: - modal.Volume is a cloud-backed FS independent of the GPU/container; run 1 downloads weights into /weights, run 2+ reuses them (no GPU time wasted re-downloading). - Point downloaders at the mount: HF_HOME/TORCH_HOME/boltz --cache; persist via weights.commit(), see updates via weights.reload(). - Volume storage costs pennies, separate from GPU = near-free caching. modal_app.py cofold(): set cache env vars to /weights, reload()/commit() around the (stubbed) boltz call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 16:48:50 +02:00
parent 08ed713cc8
commit 81d56b7a76
2 changed files with 47 additions and 6 deletions
--- a/docs/gpu_plan.md
+++ b/docs/gpu_plan.md
@@ -17,6 +17,27 @@ the inputs are *tiny*, so the design optimises for zero idle cost, not for a per
 The 27 GB LINCS data is **not** part of this track — nothing big to upload. The only thing worth
 persisting is the model-weights cache (so we don't re-download = re-pay GPU time every run).

+## How the model weights persist (the cost-saver)
+
+A `modal.Volume` is a **named, cloud-backed filesystem that lives independently of any container
+or GPU** — it survives every teardown. Mounted into the function at `/weights`:
+
+- **Run 1:** `/weights` is empty → the model downloads weights there (the one-time slow cost).
+- **Run 2+:** the same Volume mounts with the files already present → download skipped → **no
+  GPU-billed seconds wasted re-fetching 5 GB.**
+
+Two things make it actually cache:
+1. **Point the downloader at the mount** (weights only persist if written under `/weights`):
+   `HF_HOME=/weights/hf` (HuggingFace), `TORCH_HOME=/weights/torch`, `boltz --cache /weights/boltz`.
+2. **Commit semantics:** writes persist on `weights.commit()` (modern Modal also auto-commits on a
+   clean exit); other containers see them after `weights.reload()`. Pattern: `reload()` → run →
+   `commit()`.
+
+The Volume itself costs pennies (~$/GB-month of storage), *separate from the GPU* — so caching ~5 GB
+of weights is near-free and saves real GPU time on every subsequent run.
+(Alternative: bake weights into the image at build time via `image.run_function(download)` — fastest
+cold start, but the image rebuilds when weights change. The skeleton uses the Volume approach.)
+
 ## Provider choice

 | Option | Billing | Idle cost | "Kill" model | Best for |