diff --git a/PLAN.md b/PLAN.md index a61f268..b202cea 100644 --- a/PLAN.md +++ b/PLAN.md @@ -426,3 +426,108 @@ This MVP exists in a broader strategic context that was built through ~7 expert - **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product. The MVP's job is to produce one credible result. Everything else follows from that. + +--- + +## 12. Phase 2 track — Structure-based binding (scoped 2026-06-23) + +> **Status: scoped, not committed.** This is a follow-on track proposed *after* the MVP and its +> follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what +> those experiments empirically showed. Read §9–11 and the experiment commits first. + +### 12.1 Why pivot modality (the evidence, not a hunch) + +The expression-connectivity approach was built, validated, and pushed hard (gene-space +expansion, cell-composition deconvolution, reference-library tau, supervised learning). The +empirical verdict: + +- Connectivity **recovers hydroxyurea** (top ~6–8%) but **cannot achieve specificity** — + unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four + independent methods. +- A supervised model on indication labels hit **0.925 CV AUC** — but it was a **degree-bias + mirage**: it learned drug popularity, not disease matching (it ranked hydroxyurea *231/300*). +- The decisive test: with drug-popularity features removed, the model trained on the actual + drug↔disease connectivity scored **AUC 0.491 — pure chance**. **The expression-connectivity + modality contains essentially no disease-specific therapeutic signal for this task.** + +This is a *signal* problem, not a *model* problem — no amount of model sophistication (diffusion, +GNNs, etc.) extracts signal that isn't in the data. The response is to **change modality** to one +with a strong, physical, drug-specific signal: **does a molecule bind a sickle-relevant target?** +A drug that binds HbS is mechanistically specific by construction — the opposite of a coincidental +expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is +itself a diffusion model) actually has traction, because structure has physical ground truth. + +### 12.2 Targets (sickle-specific, druggable, structurally characterised) + +Small molecules only (§2). Curated shortlist with public structures and, crucially, **known +small-molecule binders to serve as positive controls**: + +| Target | Mechanism in sickle | Known binder (positive control) | +|---|---|---| +| Hemoglobin (HBB/HBA tetramer, HbS) | Anti-polymerisation; R-state stabiliser | **voxelotor** (binds α-Val1) | +| PKR (PKLR, red-cell pyruvate kinase) | Activator → ↓2,3-BPG → ↑O2 affinity | **mitapivat**, etavopivat | +| DNMT1 | HbF induction (de-repress γ-globin) | **decitabine**, azacitidine | +| LSD1 / KDM1A | HbF induction | tranylcypromine analogues | +| HDAC1/2 | HbF induction | vorinostat, panobinostat | +| EHMT2 (G9a) | HbF induction | UNC0642 (tool) | +| PDE9 | ↑cGMP, anti-adhesion | PF-04447943 (sickle trial) | + +Hard/excluded for v1: **BCL11A** (transcription factor, no classic pocket — the γ-globin master +repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism. + +### 12.3 Method (baseline → generative co-folding) + +1. **Prepare structures.** Pull target structures from the PDB; AF3/Boltz-predict any missing. +2. **Prepare ligands.** Reuse the existing ~300-drug set (we already have canonical SMILES from + ChEMBL); expandable to the full ChEMBL/LINCS catalogue. +3. **Dock + score**, in increasing sophistication: + - **Baseline:** classical docking (AutoDock Vina / smina) — fast, CPU, well-understood. + - **Generative co-folding:** an open AlphaFold3-class model — **Boltz-2** (predicts the + protein–ligand complex *and* a binding-affinity estimate, MIT-licensed), **Chai-1**, or + **DiffDock** (a diffusion model for docking — the legitimate home for the "diffusion" + instinct). These predict the bound pose directly and tend to beat classical docking. + - Report both; the baseline keeps us honest about whether the ML model adds anything. + +### 12.4 Validation (a real recovery test, like §6 Week 4) + +Pre-register before scoring: **the known structure-based sickle drugs must rank as top binders to +their targets** — voxelotor→hemoglobin, mitapivat→PKR, decitabine→DNMT1. Negative controls +(unrelated drugs) must *not* bind these pockets. This is a cleaner recovery test than the +expression one, because binding is mechanistically specific — it should not have the +coincidental-reverser problem that sank the connectivity approach. + +### 12.5 The real prize — integrate, don't replace + +The long-term value is **both modalities together**: a candidate that *reverses the disease +signature* (expression) **and** *binds a sickle-relevant target* (structure) is far more credible +than either alone. Structure supplies the specificity the expression layer lacks; expression +supplies the systems-level, target-agnostic view structure lacks. The platform thesis (§11) — +two databases + a matching engine — extends naturally to a third (structures) feeding the same +confidence-tiered data layer. + +### 12.6 Honest pitfalls (do not ignore) + +1. **Binding ≠ efficacy.** A molecule can bind and do nothing therapeutic. Structure-based hits + are still hypotheses (cf. §9.7). +2. **Only covers the enzyme/pocket subset.** Sickle's biggest lever (γ-globin reactivation via + BCL11A) is largely transcriptional and not small-molecule-tractable — structure-based screening + is blind to it. Be explicit about coverage. +3. **Docking/affinity accuracy is limited.** Pose prediction is decent; absolute affinity is hard. + Validate on known binders before trusting novel scores. +4. **Compute.** AF3-class models are GPU-heavy; the local Mac Studio (§2) may not suffice — this + track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule. +5. **Moat.** Structures and tools are public; the proprietary value is the curated target list, + the integration with the expression layer, and provenance/tiering — not the docker. + +### 12.7 Explicitly NOT in this track + +Free energy perturbation / MD-based affinity; covalent docking; de novo molecule *generation* +(that's design, not repurposing); BCL11A or any non-pocket target; biologics; combination binding. + +### 12.8 Open decisions before committing + +- **Tooling:** classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend: + baseline first, for an honest reference — the lesson of the whole expression arc.) +- **Compute:** secure a GPU environment (the "all local" §2 assumption breaks here). +- **Scope of v1:** the 7-target shortlist above, or start with just Hb + PKR (the two with the + cleanest positive controls) to de-risk the harness before scaling targets.