Scope Phase 2 structure-based binding track into PLAN (§12)

Add a scoped (not committed) follow-on track pivoting modality from
expression-connectivity to structure-based drug-target binding, motivated
by the empirical finding that the expression modality is signal-dead for
this task (relational-only supervised AUC = 0.49, chance).

§12 covers: the evidence for the pivot, a sickle-specific druggable target
shortlist with known-binder positive controls (Hb/voxelotor, PKR/mitapivat,
DNMT1/decitabine, LSD1, HDAC, EHMT2, PDE9), method (classical docking
baseline -> AF3-class co-folding: Boltz-2/Chai-1/DiffDock), a pre-registered
binding recovery test, integration with the expression layer as the real
prize, honest pitfalls (binding != efficacy, BCL11A untractable, GPU breaks
the all-local assumption), and open decisions before committing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-23 23:40:18 +02:00
parent 649f617019
commit 7449dbeefb

105
PLAN.md
View File

@@ -426,3 +426,108 @@ This MVP exists in a broader strategic context that was built through ~7 expert
- **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product.
The MVP's job is to produce one credible result. Everything else follows from that.
---
## 12. Phase 2 track — Structure-based binding (scoped 2026-06-23)
> **Status: scoped, not committed.** This is a follow-on track proposed *after* the MVP and its
> follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what
> those experiments empirically showed. Read §911 and the experiment commits first.
### 12.1 Why pivot modality (the evidence, not a hunch)
The expression-connectivity approach was built, validated, and pushed hard (gene-space
expansion, cell-composition deconvolution, reference-library tau, supervised learning). The
empirical verdict:
- Connectivity **recovers hydroxyurea** (top ~68%) but **cannot achieve specificity**
unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four
independent methods.
- A supervised model on indication labels hit **0.925 CV AUC** but it was a **degree-bias
mirage**: it learned drug popularity, not disease matching (it ranked hydroxyurea *231/300*).
- The decisive test: with drug-popularity features removed, the model trained on the actual
drugdisease connectivity scored **AUC 0.491 — pure chance**. **The expression-connectivity
modality contains essentially no disease-specific therapeutic signal for this task.**
This is a *signal* problem, not a *model* problem no amount of model sophistication (diffusion,
GNNs, etc.) extracts signal that isn't in the data. The response is to **change modality** to one
with a strong, physical, drug-specific signal: **does a molecule bind a sickle-relevant target?**
A drug that binds HbS is mechanistically specific by construction the opposite of a coincidental
expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is
itself a diffusion model) actually has traction, because structure has physical ground truth.
### 12.2 Targets (sickle-specific, druggable, structurally characterised)
Small molecules only 2). Curated shortlist with public structures and, crucially, **known
small-molecule binders to serve as positive controls**:
| Target | Mechanism in sickle | Known binder (positive control) |
|---|---|---|
| Hemoglobin (HBB/HBA tetramer, HbS) | Anti-polymerisation; R-state stabiliser | **voxelotor** (binds α-Val1) |
| PKR (PKLR, red-cell pyruvate kinase) | Activator 2,3-BPG O2 affinity | **mitapivat**, etavopivat |
| DNMT1 | HbF induction (de-repress γ-globin) | **decitabine**, azacitidine |
| LSD1 / KDM1A | HbF induction | tranylcypromine analogues |
| HDAC1/2 | HbF induction | vorinostat, panobinostat |
| EHMT2 (G9a) | HbF induction | UNC0642 (tool) |
| PDE9 | cGMP, anti-adhesion | PF-04447943 (sickle trial) |
Hard/excluded for v1: **BCL11A** (transcription factor, no classic pocket the γ-globin master
repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism.
### 12.3 Method (baseline → generative co-folding)
1. **Prepare structures.** Pull target structures from the PDB; AF3/Boltz-predict any missing.
2. **Prepare ligands.** Reuse the existing ~300-drug set (we already have canonical SMILES from
ChEMBL); expandable to the full ChEMBL/LINCS catalogue.
3. **Dock + score**, in increasing sophistication:
- **Baseline:** classical docking (AutoDock Vina / smina) fast, CPU, well-understood.
- **Generative co-folding:** an open AlphaFold3-class model **Boltz-2** (predicts the
proteinligand complex *and* a binding-affinity estimate, MIT-licensed), **Chai-1**, or
**DiffDock** (a diffusion model for docking the legitimate home for the "diffusion"
instinct). These predict the bound pose directly and tend to beat classical docking.
- Report both; the baseline keeps us honest about whether the ML model adds anything.
### 12.4 Validation (a real recovery test, like §6 Week 4)
Pre-register before scoring: **the known structure-based sickle drugs must rank as top binders to
their targets** voxelotorhemoglobin, mitapivatPKR, decitabineDNMT1. Negative controls
(unrelated drugs) must *not* bind these pockets. This is a cleaner recovery test than the
expression one, because binding is mechanistically specific it should not have the
coincidental-reverser problem that sank the connectivity approach.
### 12.5 The real prize — integrate, don't replace
The long-term value is **both modalities together**: a candidate that *reverses the disease
signature* (expression) **and** *binds a sickle-relevant target* (structure) is far more credible
than either alone. Structure supplies the specificity the expression layer lacks; expression
supplies the systems-level, target-agnostic view structure lacks. The platform thesis 11)
two databases + a matching engine extends naturally to a third (structures) feeding the same
confidence-tiered data layer.
### 12.6 Honest pitfalls (do not ignore)
1. **Binding ≠ efficacy.** A molecule can bind and do nothing therapeutic. Structure-based hits
are still hypotheses (cf. §9.7).
2. **Only covers the enzyme/pocket subset.** Sickle's biggest lever (γ-globin reactivation via
BCL11A) is largely transcriptional and not small-molecule-tractable structure-based screening
is blind to it. Be explicit about coverage.
3. **Docking/affinity accuracy is limited.** Pose prediction is decent; absolute affinity is hard.
Validate on known binders before trusting novel scores.
4. **Compute.** AF3-class models are GPU-heavy; the local Mac Studio 2) may not suffice this
track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule.
5. **Moat.** Structures and tools are public; the proprietary value is the curated target list,
the integration with the expression layer, and provenance/tiering not the docker.
### 12.7 Explicitly NOT in this track
Free energy perturbation / MD-based affinity; covalent docking; de novo molecule *generation*
(that's design, not repurposing); BCL11A or any non-pocket target; biologics; combination binding.
### 12.8 Open decisions before committing
- **Tooling:** classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend:
baseline first, for an honest reference the lesson of the whole expression arc.)
- **Compute:** secure a GPU environment (the "all local" §2 assumption breaks here).
- **Scope of v1:** the 7-target shortlist above, or start with just Hb + PKR (the two with the
cleanest positive controls) to de-risk the harness before scaling targets.