GPU Phase 1: co-fold cofactors/metals (the binding-mode determinants)

Add metal/cofactor handling to the Boltz-2 YAML as CCD ligand entries -
the modes classical docking couldn't model:
- HDAC2 + catalytic Zn (vorinostat chelates it)
- PKR + FBP + Mg (allosteric activator + metal)
- hemoglobin + heme
Same cofactors present when co-folding negatives into a target (fair test).

build_boltz_yaml() gains a cofactor_ccds arg (emits `ligand: {ccd: ...}`
entries); TARGETS carries per-target cofactors; cofold()/main() thread them
through. Verified locally: YAML builds correctly with Zn / FBP+Mg.

Honest limitation noted: Hb's voxelotor site is at the tetramer centre and
covalent (Schiff base), so single-chain+heme only approximates it - HDAC2
(Zn) and PKR (cofactor) are the real co-folding tests. Ready for
`modal run gpu/modal_app.py`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-24 17:16:06 +02:00
parent 4022c0cb94
commit 07705a5884
2 changed files with 43 additions and 31 deletions

View File

@@ -69,9 +69,11 @@ forgotten box can't bleed money.
## What runs on the GPU (in cost order — cheap validation first) ## What runs on the GPU (in cost order — cheap validation first)
- **Phase 1 — modality validation (~minutes, ~$1):** co-fold each known binder + 2 negative - **Phase 1 — modality validation (~minutes, ~$1):** co-fold each known binder + 2 negative
controls (caffeine, hydroxyurea) into each target (Hb, PKR, HDAC2) and check the **known binder controls (caffeine, hydroxyurea) into each target (Hb, PKR, HDAC2) **with the binding-mode
has the highest Boltz-2 P(binder)** for its own target — the discrimination Vina couldn't manage cofactors co-folded in** — HDAC2 + catalytic **Zn**, PKR + **FBP/Mg**, Hb + **heme** (as CCD
on metal/covalent/allosteric modes. (Ranking uses P(binder), not the raw affinity value, whose ligand entries) — and check the **known binder has the highest Boltz-2 P(binder)** for its own
target. This is the discrimination Vina couldn't manage precisely because it can't model
Zn-chelation / cofactors. (Ranking uses P(binder), not the raw affinity value, whose
sign is version-dependent.) Pose-RMSD vs crystal is a deeper check but needs receptor sign is version-dependent.) Pose-RMSD vs crystal is a deeper check but needs receptor
superposition (align predicted protein to crystal, transform ligand) — a later refinement. If superposition (align predicted protein to crystal, transform ligand) — a later refinement. If
Phase 1 passes, the modality is real; if not, stop before paying for a screen. Phase 1 passes, the modality is real; if not, stop before paying for a screen.

View File

@@ -32,14 +32,23 @@ image = (
weights = modal.Volume.from_name("reverso-binding-weights", create_if_missing=True) weights = modal.Volume.from_name("reverso-binding-weights", create_if_missing=True)
WEIGHTS = "/weights" WEIGHTS = "/weights"
# target name -> (PDB id, crystal-ligand resname, drug name). Plus negatives co-folded into each. # target -> (PDB id, crystal-ligand resname, drug name, cofactor/metal CCD codes to co-fold).
# The cofactors are the binding-mode determinants Vina couldn't model: HDAC2 needs the catalytic
# Zn (vorinostat chelates it), PKR the allosteric FBP + Mg, hemoglobin the heme. Co-folding them
# in as CCD ligands is the whole point of the AF3-class pivot. The same cofactors are present when
# co-folding the negatives into that target, for a fair comparison.
TARGETS = { TARGETS = {
"hemoglobin": ("5E83", "5L7", "voxelotor"), "hemoglobin": ("5E83", "5L7", "voxelotor", ["HEM"]),
"PKR": ("8XFD", "WV2", "mitapivat"), "PKR": ("8XFD", "WV2", "mitapivat", ["FBP", "MG"]),
"HDAC2": ("4LXZ", "SHH", "vorinostat"), "HDAC2": ("4LXZ", "SHH", "vorinostat", ["ZN"]),
} }
NEGATIVES = ["caffeine", "hydroxyurea"] NEGATIVES = ["caffeine", "hydroxyurea"]
# Honest limitation: hemoglobin's voxelotor site sits at the tetramer centre and the bond is
# covalent (Schiff base) — a single-chain + heme model only approximates it, so Hb is the weak
# case. HDAC2 (Zn chelation) and PKR (allosteric + cofactor) are the real tests of whether
# co-folding handles the modes classical docking could not.
# --------------------------------------------------------------------------- local helpers (CPU) # --------------------------------------------------------------------------- local helpers (CPU)
@@ -92,28 +101,27 @@ def pubchem_smiles(name: str) -> str:
raise ValueError(f"no SMILES for {name}") raise ValueError(f"no SMILES for {name}")
def build_boltz_yaml(protein_seq: str, ligand_smiles: str) -> str: def build_boltz_yaml(protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str] | None = None) -> str:
"""Boltz-2 YAML for a protein+ligand complex with an affinity prediction on the ligand.""" """Boltz-2 YAML: protein + the drug ligand (affinity binder) + any cofactor/metal CCD ligands.
return (
"version: 1\n" Cofactors/ions are added as `ligand` entries referencing their CCD code (e.g. ZN, MG, FBP,
"sequences:\n" HEM) so the model places the metal/cofactor that defines the binding mode. Affinity is
" - protein:\n" predicted on the drug ligand L only.
" id: A\n" """
f" sequence: {protein_seq}\n" lines = ["version: 1", "sequences:",
" - ligand:\n" " - protein:", " id: A", f" sequence: {protein_seq}",
" id: L\n" " - ligand:", " id: L", f" smiles: '{ligand_smiles}'"]
f" smiles: '{ligand_smiles}'\n" for i, ccd in enumerate(cofactor_ccds or []):
"properties:\n" lines += [" - ligand:", f" id: M{i}", f" ccd: {ccd}"]
" - affinity:\n" lines += ["properties:", " - affinity:", " binder: L"]
" binder: L\n" return "\n".join(lines) + "\n"
)
# ------------------------------------------------------------------------------- GPU function # ------------------------------------------------------------------------------- GPU function
@app.function(gpu="L4", image=image, volumes={WEIGHTS: weights}, timeout=3600) @app.function(gpu="L4", image=image, volumes={WEIGHTS: weights}, timeout=3600)
def cofold(label: str, protein_seq: str, ligand_smiles: str) -> dict: def cofold(label: str, protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str]) -> dict:
"""Co-fold one complex on the GPU; return predicted affinity + binder probability. """Co-fold one complex (protein + drug + cofactors) on the GPU; return affinity + P(binder).
Weights persist on the mounted Volume (HF_HOME/boltz --cache under /weights), so run 2+ skips Weights persist on the mounted Volume (HF_HOME/boltz --cache under /weights), so run 2+ skips
the download. GPU is released the moment this returns. the download. GPU is released the moment this returns.
@@ -126,7 +134,7 @@ def cofold(label: str, protein_seq: str, ligand_smiles: str) -> dict:
work = Path("/tmp") / label work = Path("/tmp") / label
work.mkdir(parents=True, exist_ok=True) work.mkdir(parents=True, exist_ok=True)
(work / "in.yaml").write_text(build_boltz_yaml(protein_seq, ligand_smiles)) (work / "in.yaml").write_text(build_boltz_yaml(protein_seq, ligand_smiles, cofactor_ccds))
out = work / "out" out = work / "out"
subprocess.run( subprocess.run(
["boltz", "predict", str(work / "in.yaml"), "--use_msa_server", ["boltz", "predict", str(work / "in.yaml"), "--use_msa_server",
@@ -154,19 +162,21 @@ def cofold(label: str, protein_seq: str, ligand_smiles: str) -> dict:
@app.local_entrypoint() @app.local_entrypoint()
def main() -> None: def main() -> None:
"""Fan out one GPU call per (target, ligand) pair; tabulate affinities; positive-control test.""" """Fan out one GPU call per (target, ligand) pair; tabulate affinities; positive-control test."""
jobs = [] # (target, ligand_name, protein_seq, smiles) jobs = [] # (label, protein_seq, smiles, cofactor_ccds)
for target, (pdb, res, drug) in TARGETS.items(): for target, (pdb, res, drug, cofactors) in TARGETS.items():
seq = binding_chain_sequence(pdb, res) seq = binding_chain_sequence(pdb, res)
for lig in [drug, *NEGATIVES]: for lig in [drug, *NEGATIVES]:
jobs.append((target, lig, seq, pubchem_smiles(lig))) jobs.append((f"{target}_{lig}", seq, pubchem_smiles(lig), cofactors))
print(f"co-folding {len(jobs)} complexes ({len(TARGETS)} targets x {1+len(NEGATIVES)} ligands)") cofactor_summary = {t: c[3] for t, c in TARGETS.items()}
print(f"co-folding {len(jobs)} complexes ({len(TARGETS)} targets x {1+len(NEGATIVES)} ligands); "
f"cofactors per target: {cofactor_summary}")
results = list(cofold.starmap([(f"{t}_{l}", s, smi) for t, l, s, smi in jobs])) results = list(cofold.starmap(jobs))
by = {r["label"]: r for r in results} by = {r["label"]: r for r in results}
print(f"\n{'target':12s}{'ligand':14s}{'affinity':>10s}{'P(binder)':>11s}") print(f"\n{'target':12s}{'ligand':14s}{'affinity':>10s}{'P(binder)':>11s}")
out_rows = [] out_rows = []
for target, (pdb, res, drug) in TARGETS.items(): for target, (pdb, res, drug, cofactors) in TARGETS.items():
prob = {} # rank by P(binder): unambiguous (higher = more likely a binder). Boltz prob = {} # rank by P(binder): unambiguous (higher = more likely a binder). Boltz
for lig in [drug, *NEGATIVES]: # affinity_pred_value is ~log(IC50) (lower=stronger) -- sign for lig in [drug, *NEGATIVES]: # affinity_pred_value is ~log(IC50) (lower=stronger) -- sign
r = by.get(f"{target}_{lig}", {}) # is version-dependent, so don't rank on it. r = by.get(f"{target}_{lig}", {}) # is version-dependent, so don't rank on it.