Files
Reverso/docs/structure_binding_notes.md
Junior B. 7c6cef1aef Production docking: prep helps (7.9->4.8A) but Vina wrong tool for sickle
§12.4 pushed to its limit. Meeko ligand prep + in-place symmetry RMSD
(spyrmsd, not obrms) on clean HDAC2/vorinostat: 7.9A -> 4.76A. Prep and
metric mattered, but still FAIL.

Residual cause is fundamental: vorinostat binds via hydroxamate-Zn
chelation and Vina has no metal-coordination term. Real finding: sickle's
druggable targets bind via non-classical chemistry classical docking
handles poorly -- Hb (covalent), PKR (allosteric+cofactor), HDAC (Zn
chelation). Vina is the wrong tool for this target landscape.

Redirect: data-driven AF3-class co-folding (Boltz-2/Chai-1/DiffDock)
handles these modes -- the indicated next tool, gated by the 24GB local
memory ceiling (cloud GPU needed). The "GPU breaks all-local" §12.6
prediction is now the binding constraint of the track.

Adds: scripts/dock_production.py; deps meeko, spyrmsd, gemmi.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 16:38:54 +02:00

114 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Structure-based binding track — working notes
Branch `structure-based-binding`. Implements PLAN §12. Baseline-first, start with the two cleanest
targets (Hemoglobin + PKR), de-risk the harness before scaling.
## Status (2026-06-23)
**Toolchain check (PLAN §12.6 pitfall 4, confirmed real):**
- ✅ RDKit installs on ARM Mac — ligand side ready.
- ❌ AutoDock Vina does NOT pip-install on ARM Mac; no docking binary available. Docking (§12.3)
is **blocked on toolchain** — must resolve via conda/micromamba (`vina`/`smina`), a GPU AF3-class
model (Boltz-2/Chai-1/DiffDock), or an x86 Vina binary under Rosetta.
**Structures obtained:** `5E83` (hemoglobin + voxelotor), `8XFD` (PKR + mitapivat) in
`data/raw/structures/`.
**Step 0 — ligand-based retrieval baseline (`scripts/binding_ligand_baseline.py`):**
RDKit Tanimoto of our 300 drugs vs known sickle binders.
- Engine VALIDATED on in-set classes: `decitabine`→azacitidine (0.62); `vorinostat`→scriptaid
(0.42), belinostat (0.28). Correctly clusters DNMT1 / HDAC HbF-inducers.
- But voxelotor / mitapivat have **no analog** in our set (max Tanimoto ~0.200.26). A 300-drug
library is too sparse to contain look-alikes of distinctive scaffolds.
**Takeaways:**
1. Ligand retrieval works but needs a **bigger drug library** to be useful for distinctive targets.
2. The targets without in-set analogs (Hb, PKR) need **actual docking** (§12.3) — which scores
binding directly, no look-alike required. That is the gating next step, and it needs the
toolchain solved.
## Step 1 — docking baseline (2026-06-24)
**Toolchain SOLVED on ARM Mac:** AutoDock Vina 1.2.5 mac binary (`tools/vina`, runs under Rosetta)
+ open-babel (brew) for prep. Docking runs end-to-end (`scripts/dock_positive_controls.py`).
Co-crystal ligands identified: 5L7 = voxelotor (5E83), WV2 = mitapivat (8XFD).
**Positive-control cross-docking — inconclusive, and instructively so.** Affinities (kcal/mol):
| ligand | hemoglobin | PKR |
|---|---|---|
| voxelotor | 8.1 | 9.3 |
| mitapivat | 10.0 | 11.2 |
| decitabine | 6.6 | 7.0 |
| hydroxyurea | 3.9 | 3.6 |
| caffeine | 6.1 | 6.4 |
The scores rank almost perfectly by **molecular size** (mitapivat > voxelotor > decitabine/caffeine
> hydroxyurea) in *both* pockets — mitapivat wins even on hemoglobin, which it doesn't target. So
raw Vina affinity is confounded by ligand size and per-pocket stickiness; it cannot yet
distinguish target-specific binding. This is the **docking analog of the connectivity specificity
problem** — raw scores carry a systematic bias (size here, broadness there) that masquerades as
signal. voxelotor *does* dock to Hb (8.1, a real score); the cross-target test just isn't the
right validation.
## Step 2 — redocking-RMSD validation FAILS across the board (2026-06-24)
Redocked each co-crystal ligand into its own structure (`scripts/dock_validate.py`); RMSD vs
crystal pose via obrms:
| redock | RMSD | note |
|---|---|---|
| voxelotor → Hb (5E83) | NA | covalent binder (Schiff base, αVal1) — out of scope §12.7 |
| mitapivat → PKR (8XFD) | 8.2 Å | allosteric, cofactor (FBP/Mg) stripped |
| **vorinostat → HDAC2 (4LXZ, Zn kept)** | **7.9 Å** | classical non-covalent target — should have worked |
**The clean target also failing means this is a systematic PIPELINE-QUALITY problem, not target
choice.** The cheap Vina + open-babel setup produces scores but does not reproduce known binding
geometry, so its affinities are not yet trustworthy. Ligand efficiency (affinity / heavy atoms)
also doesn't fix it — it over-corrects, ranking tiny hydroxyurea (0.78) "best".
Likely causes (in priority order):
1. **Low-quality receptor prep** — open-babel `-xr` is not production docking prep. Need
AutoDockTools `prepare_receptor` or **Meeko** + `reduce`/pdb2pqr for protonation, charges, and
proper AutoDock atom typing.
2. **Ligand prep** — should use Meeko (correct rotatable bonds / typing), not bare obabel `--gen3d`.
3. **RMSD metric** — obrms superimposes before RMSD; redocking validation wants symmetry-corrected
RMSD **in place** (receptor frame). Worth confirming with an in-place metric.
**Honest takeaway:** consistent with the whole project — the *quick* version of each method runs
but doesn't survive honest validation. Credible structure-based docking needs production prep
tooling (Meeko/ADFR), which is the real next investment for this track.
## Step 3 — production prep helps, but classical docking is the wrong tool here (2026-06-24)
`scripts/dock_production.py`: Meeko ligand prep (proper rotatable-bond/AD typing) + in-place
symmetry-corrected RMSD (spyrmsd, not obrms which superimposes). On the clean HDAC2/vorinostat
target (Zn kept):
- **7.9 Å → 4.76 Å** with proper ligand prep + correct metric. Prep and metric genuinely mattered.
- But still FAIL (>2 Å). The residual is the deeper problem: **vorinostat binding is defined by its
hydroxamate chelating the catalytic Zn, and Vina has no metal-coordination term** — it cannot
score the interaction that determines the pose.
**The real finding: sickle's druggable targets bind via non-classical chemistry that classical
docking handles poorly** — Hb/voxelotor (covalent), PKR/mitapivat (allosteric + cofactor),
HDAC/vorinostat (Zn chelation). This is the target landscape, not bad luck. AutoDock Vina is the
wrong tool for it.
**Redirect:** the modality that DOES handle covalent/metal/induced-fit binding is **data-driven
AF3-class co-folding** (Boltz-2 / Chai-1 / DiffDock — they learn these modes from the PDB). That is
the indicated next tool for this disease — and it's gated by the **24 GB local memory ceiling**
(PLAN §12.6 pitfall 4): needs a cloud GPU or a bigger box. The "GPU breaks all-local" prediction is
now the binding constraint of the whole track.
## Next steps
- [ ] AF3-class co-folding on a GPU (Boltz-2 affinity / Chai-1 / DiffDock); redo the §12.4
positive-control recovery test there — it should handle the metal/covalent modes Vina can't.
- [ ] (optional) Salvage one classical Vina case: PKR with FBP/Mg cofactors RETAINED, to confirm
the harness can validate on a non-metal sickle target.
- [ ] Production receptor prep (Meeko mk_prepare_receptor + protonation) if staying with Vina.
- [ ] §12.9 generative beacon — only after a validated scoring function exists.
> **Hardware note:** this machine is **24 GB** unified memory (not the 96 GB PLAN §2 assumed),
> which caps local AF3-class model inference. Classical docking (above) is unaffected.