Harden cache_msa: pick valid FASTA a3m + strip null bytes

Pilot validated the MSA-reuse architecture (1 server query, cached & reused -- server hammering fixed) but Boltz's INPUT a3m parser crashed on the cached file: KeyError '\x00'. Boltz writes several .a3m files (some binary); we were grabbing a bad one. Fix: select the largest a3m that looks like FASTA (starts with '>'), and strip null bytes before caching. Raise with the candidate list if none qualifies (so we can see what's there). Pilot status: MSA caching + reuse confirmed working; this fixes the a3m-format crash. Re-validate with a 2-drug pilot before the full run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fix screen: compute target MSA once, reuse it, tolerate failures
2026-06-25 00:44:32 +02:00 · 2026-06-25 00:41:51 +02:00 · 2026-06-24 22:23:01 +02:00 · 2026-06-24 19:56:40 +02:00 · 2026-06-24 19:50:33 +02:00 · 2026-06-24 19:41:19 +02:00
28 changed files with 5138 additions and 150 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -31,3 +31,6 @@ env/
 .DS_Store
 .idea/
 .vscode/
+
+# Local tool binaries (refetch per docs/structure_binding_notes.md)
+tools/
--- a/PLAN.md
+++ b/PLAN.md
@@ -426,3 +426,143 @@ This MVP exists in a broader strategic context that was built through ~7 expert
 - **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product.

 The MVP's job is to produce one credible result. Everything else follows from that.
+
+---
+
+## 12. Phase 2 track — Structure-based binding (scoped 2026-06-23)
+
+> **Status: scoped, not committed.** This is a follow-on track proposed *after* the MVP and its
+> follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what
+> those experiments empirically showed. Read §9–11 and the experiment commits first.
+
+### 12.1 Why pivot modality (the evidence, not a hunch)
+
+The expression-connectivity approach was built, validated, and pushed hard (gene-space
+expansion, cell-composition deconvolution, reference-library tau, supervised learning). The
+empirical verdict:
+
+- Connectivity **recovers hydroxyurea** (top ~6–8%) but **cannot achieve specificity** —
+  unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four
+  independent methods.
+- A supervised model on indication labels hit **0.925 CV AUC** — but it was a **degree-bias
+  mirage**: it learned drug popularity, not disease matching (it ranked hydroxyurea *231/300*).
+- The decisive test: with drug-popularity features removed, the model trained on the actual
+  drug↔disease connectivity scored **AUC 0.491 — pure chance**. **The expression-connectivity
+  modality contains essentially no disease-specific therapeutic signal for this task.**
+
+This is a *signal* problem, not a *model* problem — no amount of model sophistication (diffusion,
+GNNs, etc.) extracts signal that isn't in the data. The response is to **change modality** to one
+with a strong, physical, drug-specific signal: **does a molecule bind a sickle-relevant target?**
+A drug that binds HbS is mechanistically specific by construction — the opposite of a coincidental
+expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is
+itself a diffusion model) actually has traction, because structure has physical ground truth.
+
+### 12.2 Targets (sickle-specific, druggable, structurally characterised)
+
+Small molecules only (§2). Curated shortlist with public structures and, crucially, **known
+small-molecule binders to serve as positive controls**:
+
+| Target | Mechanism in sickle | Known binder (positive control) |
+|---|---|---|
+| Hemoglobin (HBB/HBA tetramer, HbS) | Anti-polymerisation; R-state stabiliser | **voxelotor** (binds α-Val1) |
+| PKR (PKLR, red-cell pyruvate kinase) | Activator → ↓2,3-BPG → ↑O2 affinity | **mitapivat**, etavopivat |
+| DNMT1 | HbF induction (de-repress γ-globin) | **decitabine**, azacitidine |
+| LSD1 / KDM1A | HbF induction | tranylcypromine analogues |
+| HDAC1/2 | HbF induction | vorinostat, panobinostat |
+| EHMT2 (G9a) | HbF induction | UNC0642 (tool) |
+| PDE9 | ↑cGMP, anti-adhesion | PF-04447943 (sickle trial) |
+
+Hard/excluded for v1: **BCL11A** (transcription factor, no classic pocket — the γ-globin master
+repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism.
+
+### 12.3 Method (baseline → generative co-folding)
+
+1. **Prepare structures.** Pull target structures from the PDB; AF3/Boltz-predict any missing.
+2. **Prepare ligands.** Reuse the existing ~300-drug set (we already have canonical SMILES from
+   ChEMBL); expandable to the full ChEMBL/LINCS catalogue.
+3. **Dock + score**, in increasing sophistication:
+   - **Baseline:** classical docking (AutoDock Vina / smina) — fast, CPU, well-understood.
+   - **Generative co-folding:** an open AlphaFold3-class model — **Boltz-2** (predicts the
+     protein–ligand complex *and* a binding-affinity estimate, MIT-licensed), **Chai-1**, or
+     **DiffDock** (a diffusion model for docking — the legitimate home for the "diffusion"
+     instinct). These predict the bound pose directly and tend to beat classical docking.
+   - Report both; the baseline keeps us honest about whether the ML model adds anything.
+
+### 12.4 Validation (a real recovery test, like §6 Week 4)
+
+Pre-register before scoring: **the known structure-based sickle drugs must rank as top binders to
+their targets** — voxelotor→hemoglobin, mitapivat→PKR, decitabine→DNMT1. Negative controls
+(unrelated drugs) must *not* bind these pockets. This is a cleaner recovery test than the
+expression one, because binding is mechanistically specific — it should not have the
+coincidental-reverser problem that sank the connectivity approach.
+
+### 12.5 The real prize — integrate, don't replace
+
+The long-term value is **both modalities together**: a candidate that *reverses the disease
+signature* (expression) **and** *binds a sickle-relevant target* (structure) is far more credible
+than either alone. Structure supplies the specificity the expression layer lacks; expression
+supplies the systems-level, target-agnostic view structure lacks. The platform thesis (§11) —
+two databases + a matching engine — extends naturally to a third (structures) feeding the same
+confidence-tiered data layer.
+
+### 12.6 Honest pitfalls (do not ignore)
+
+1. **Binding ≠ efficacy.** A molecule can bind and do nothing therapeutic. Structure-based hits
+   are still hypotheses (cf. §9.7).
+2. **Only covers the enzyme/pocket subset.** Sickle's biggest lever (γ-globin reactivation via
+   BCL11A) is largely transcriptional and not small-molecule-tractable — structure-based screening
+   is blind to it. Be explicit about coverage.
+3. **Docking/affinity accuracy is limited.** Pose prediction is decent; absolute affinity is hard.
+   Validate on known binders before trusting novel scores.
+4. **Compute.** AF3-class models are GPU-heavy; the local Mac Studio (§2) may not suffice — this
+   track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule.
+5. **Moat.** Structures and tools are public; the proprietary value is the curated target list,
+   the integration with the expression layer, and provenance/tiering — not the docker.
+
+### 12.7 Explicitly NOT in this track
+
+Free energy perturbation / MD-based affinity; covalent docking; **de novo generation of molecules
+as final candidates to synthesise** (design, not repurposing — but see §12.9 for the
+generate-then-retrieve hybrid, which *is* repurposing); BCL11A or any non-pocket target;
+biologics; combination binding.
+
+### 12.8 Open decisions before committing
+
+- **Tooling:** classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend:
+  baseline first, for an honest reference — the lesson of the whole expression arc.)
+- **Compute:** secure a GPU environment (the "all local" §2 assumption breaks here).
+- **Scope of v1:** the 7-target shortlist above, or start with just Hb + PKR (the two with the
+  cleanest positive controls) to de-risk the harness before scaling targets.
+
+### 12.9 Door left open — generative-guided retrieval (generate → match existing)
+
+A legitimate way to bring generative models *into the repurposing frame* (vs the design frame
+excluded in §12.7): don't generate molecules to synthesise — generate them as a **search beacon**.
+
+**The idea.** Use a pocket-conditioned generative model (target-conditioned diffusion — e.g.
+TargetDiff, DiffSBDD, Pocket2Mol) to propose idealised binders for a sickle target. Then retrieve
+the **nearest existing drugs** to those proposals by chemical similarity (Tanimoto over Morgan
+fingerprints, or a learned molecular embedding). The retrieved neighbours — real, already-approved
+or clinical molecules — are the repurposing candidates. The generated molecule is never made; it
+only *defines a region of chemical space* worth searching. This is the user-proposed framing and
+it is sound: "generate the ideal, then find what we already have nearby."
+
+**Why it could add value.** It can point at scaffolds / regions a known-binder-seeded or
+brute-force docking sweep would miss, and it makes the target's binding requirements explicit as
+geometry rather than as a single reference ligand.
+
+**Honest caveats (why it's a *door*, not a commitment).**
+1. **Generated molecules are often synthetically unrealistic / invalid** — which is exactly why
+   they must be used only as beacons, never as candidates.
+2. **Similarity ≠ activity.** Activity cliffs mean a near-neighbour of a good binder can be inert.
+   So retrieved neighbours do **not** bypass validation — they must still be docked/scored (§12.3)
+   and clear the binding recovery test (§12.4). The generative step *proposes*; it does not *prove*.
+3. **Marginal-value question.** Directly docking the whole existing library (§12.3) already covers
+   chemical space. Whether generate-then-retrieve beats that — by efficiency or by surfacing
+   non-obvious scaffolds — is an open empirical question and needs a head-to-head before it earns
+   real investment.
+4. **Only as good as the pocket conditioning** of the generator, and the chemistry of the target.
+
+**Status:** explore only *after* the §12.3–12.4 docking harness works and is validated on the
+known binders — same discipline as everywhere else: prove the baseline, then test whether the
+fancier method adds anything.
--- a/data/raw/structures/.gitkeep
+++ b/data/raw/structures/.gitkeep
--- a/docs/data_sources.md
+++ b/docs/data_sources.md
@@ -61,9 +61,10 @@ Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
  38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
 - **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
  not-single-source rule).
- **Signature↔landmark overlap:** only 56/477 (12%) of the disease signature genes are LINCS
-  landmarks, so connectivity scoring (Week 3) uses a 30-up/26-down query. The erythroid hallmark
-  genes (CA1, AHSP, SLC4A1, HBG) are NOT landmarks. This is a key limitation for the recovery test.
+- **Gene space (v1.1):** scoring uses the full **12,328-gene** LINCS space, not just the 978
+  landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger
+  space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from
+  LINCS entirely and remain unscoreable.
 - Reproduce: `week2_curate_drugset.py` → `week2_chembl.py` → download Level-5 GCTX →
  `week2_lincs_extract.py` → `week2_assemble.py`.

--- a/docs/gpu_plan.md
+++ b/docs/gpu_plan.md
@@ -0,0 +1,112 @@
+# GPU plan — ephemeral AF3-class co-folding for the binding track
+
+Goal: run AF3-class co-folding (Boltz-2 / DiffDock) on a GPU for the structure-binding track
+(§12), then **pay nothing when idle**. The work is *bursty* (a validation run, then a screen),
+the inputs are *tiny*, so the design optimises for zero idle cost, not for a persistent box.
+
+## What actually has to move (small!)
+
+| Thing | Size | How it gets to the GPU |
+|---|---|---|
+| Code | KB | `git clone` the gitea repo (or mount) |
+| Target structures (PDBs) | a few MB | in the repo / git |
+| Ligands (SMILES, drug set) | KB | in the repo / git |
+| **Model weights** (Boltz-2 / DiffDock) | **~2–6 GB** | downloaded once, **cached in a persistent volume** |
+| Results (poses, scores, RMSDs) | KB–MB | `git push` back / download |
+
+The 27 GB LINCS data is **not** part of this track — nothing big to upload. The only thing worth
+persisting is the model-weights cache (so we don't re-download = re-pay GPU time every run).
+
+## How the model weights persist (the cost-saver)
+
+A `modal.Volume` is a **named, cloud-backed filesystem that lives independently of any container
+or GPU** — it survives every teardown. Mounted into the function at `/weights`:
+
+- **Run 1:** `/weights` is empty → the model downloads weights there (the one-time slow cost).
+- **Run 2+:** the same Volume mounts with the files already present → download skipped → **no
+  GPU-billed seconds wasted re-fetching 5 GB.**
+
+Two things make it actually cache:
+1. **Point the downloader at the mount** (weights only persist if written under `/weights`):
+   `HF_HOME=/weights/hf` (HuggingFace), `TORCH_HOME=/weights/torch`, `boltz --cache /weights/boltz`.
+2. **Commit semantics:** writes persist on `weights.commit()` (modern Modal also auto-commits on a
+   clean exit); other containers see them after `weights.reload()`. Pattern: `reload()` → run →
+   `commit()`.
+
+The Volume itself costs pennies (~$/GB-month of storage), *separate from the GPU* — so caching ~5 GB
+of weights is near-free and saves real GPU time on every subsequent run.
+(Alternative: bake weights into the image at build time via `image.run_function(download)` — fastest
+cold start, but the image rebuilds when weights change. The skeleton uses the Volume approach.)
+
+## Provider choice
+
+| Option | Billing | Idle cost | "Kill" model | Best for |
+|---|---|---|---|---|
+| **Modal** (recommended) | per-second | **$0 (scales to zero)** | automatic — nothing to remember | bursty batch runs |
+| RunPod | per-minute, on-demand/spot | only while pod exists | manual `terminate` | interactive SSH box |
+| Vast.ai | per-minute spot | only while rented | manual destroy | cheapest, more fiddly |
+| Lambda / AWS-GCP spot | per-hour/second | until you stop it | manual stop | if you already have credits |
+
+**Recommend Modal.** You define the run as a function with a `gpu=` decorator; on call it
+provisions the GPU, runs, and releases it — **there is no GPU to forget to kill**, and you pay only
+for the seconds it ran. That *is* the "kill the GPU to save money" requirement, automated.
+
+## The lifecycle (Modal)
+
+1. **Define image** once: CUDA base + `boltz` (or DiffDock) + `rdkit`, `meeko`, `spyrmsd`, `gemmi`.
+2. **Weights volume**: `modal.Volume` mounted at `/weights`; the model downloads into it on first
+   run and is cached forever after (no re-download cost).
+3. **Run**: `modal run gpu/modal_app.py` → provisions GPU → runs the test → returns results to your
+   laptop. GPU released the moment the function returns.
+4. **Persist results**: write the returned scores/poses into `data/processed/binding/` and
+   `git commit` (small, text).
+5. **Teardown**: nothing to do — Modal scaled to zero. `modal app stop` only if a run is hung.
+
+RunPod alternative (if you want an interactive box): start pod → `git clone` → run → `git push`
+results → **`runpodctl remove pod <id>`** (or Stop in the UI). Set an **idle auto-terminate** so a
+forgotten box can't bleed money.
+
+## What runs on the GPU (in cost order — cheap validation first)
+
+- **Phase 1 — modality validation (~minutes, ~$1):** co-fold each known binder + 2 negative
+  controls (caffeine, hydroxyurea) into each target (Hb, PKR, HDAC2) **with the binding-mode
+  cofactors co-folded in** — HDAC2 + catalytic **Zn**, PKR + **FBP/Mg**, Hb + **heme** (as CCD
+  ligand entries) — and check the **known binder has the highest Boltz-2 P(binder)** for its own
+  target. This is the discrimination Vina couldn't manage precisely because it can't model
+  Zn-chelation / cofactors. (Ranking uses P(binder), not the raw affinity value, whose
+  sign is version-dependent.) Pose-RMSD vs crystal is a deeper check but needs receptor
+  superposition (align predicted protein to crystal, transform ligand) — a later refinement. If
+  Phase 1 passes, the modality is real; if not, stop before paying for a screen.
+- **Phase 2 — screen (~tens of minutes, a few $):** run the ~300-drug set (or a focused subset)
+  against the sickle targets; rank by Boltz-2 predicted affinity; redo the §12.4 positive-control
+  recovery test. Output a ranked CSV, same shape as the connectivity `ranked_candidates`.
+
+## Model choice
+
+- **Boltz-2** (MIT, pip-installable) — predicts the protein–ligand complex **and** a binding
+  affinity → directly gives a rankable score. Primary choice. Fits a 24–40 GB GPU for these
+  single-domain targets.
+- **DiffDock-L** — lighter, pose-only (needs a separate scorer); fallback if Boltz memory is tight.
+- GPU: an **L4 (24 GB, ~$0.6–0.8/hr)** or **A10/L40S (24–48 GB)** is plenty; no multi-GPU, no A100
+  needed for these sizes.
+
+## Cost controls (the save-money checklist)
+
+- Serverless (Modal) → **zero idle cost** by construction; or an **idle-timeout** auto-kill on a box.
+- **Cache weights** in a persistent volume — re-downloading 5 GB on a $1/hr GPU is wasted money.
+- **Validate on one target before screening** — don't pay for a 300-drug screen until Phase 1 passes.
+- Prefer **spot/interruptible** for the batch screen (Phase 2 is restartable).
+- Keep prep (Meeko/RDKit) and result-scoring on the **laptop**; only the model forward pass needs GPU.
+- Estimated total to validate + a first screen: **~$5–15**, not a standing bill.
+
+## Repo integration
+
+- `gpu/modal_app.py` — the Modal app (skeleton committed alongside this plan).
+- Results land in `data/processed/binding/` (gitignored) + a small committed summary.
+- Pin model + weights version in the image for reproducibility.
+
+## Next step
+
+Scaffold `gpu/modal_app.py` for Phase-1 validation (3 known binders), do a dry run locally
+(`modal run --detach`? no — just `modal run`), confirm cost, then Phase 2. Requires a Modal account
+ `pip install modal` + `modal token new` (one-time auth).
--- a/docs/known_limitations.md
+++ b/docs/known_limitations.md
@@ -34,20 +34,28 @@ Source: PLAN.md §9.
 7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
   to test, not discoveries. Use careful language in any write-up.

-8. **Only 12% of the signature is LINCS-scorable (56/477 genes).** The 978 landmark genes (from
-   cancer cell lines) miss the erythroid hallmark genes (CA1, AHSP, SLC4A1, HBG). Connectivity
-   scoring runs on a thin inflammation/metabolic slice — the single biggest driver of the
-   recovery-test failure. v2 fix: signature prediction or a mechanism graph to score the other 88%.
+8. **Gene-space bottleneck (v1 → fixed in v1.1).** v1 scored on only the 978 landmark genes (12%
+   signature overlap) — the main driver of the v1 failure. v1.1 uses the full 12,328-gene space
+   (85% overlap) and recovers hydroxyurea. HBG1/HBG2 remain absent from LINCS entirely.

-## Recovery test outcome (Week 4)
+9. **No reference-signature library for tau.** Textbook CMap tau saturated at ±100 (a coherent
+   query always out-connects random gene sets). v1.1 substitutes a per-drug specificity z-score.
+   Proper tau needs a library of real reference signatures — a v2 / curated-data item.

-The MVP **failed** all three pre-registered criteria on the primary raw ranking (hydroxyurea
-rank 40/top 13%; L-glutamine rank 100/WTCS=0; 1/5 negative controls in bottom half). The failure
-is fully attributable to signature/assay data limitations above, not the matching algorithm. See
+10. **Negative-control criterion may be invalid for connectivity scoring.** Unrelated drugs
+    (norethindrone, ciprofloxacin) rank as top specific reversers — connectivity measures
+    expression reversal, not therapeutic relatedness.
+
+## Recovery test outcome
+
+Pre-registered test (**v1, confirmatory**): **FAILED** all three criteria (hydroxyurea rank
+40/top 13%; L-glutamine rank 100; 1/5 negative controls bottom-half). Post-hoc (**v1.1,
+exploratory**): hydroxyurea recovers to rank 18 (top 6%, passes), but L-glutamine (rank 213, does
+not reverse) and negative controls (2/5) still fail → overall still FAIL. See
 `recovery_test_report.md`.

-| Drug | Issue | Handling |
+| Drug | Issue | v1.1 status |
 |---|---|---|
-| hydroxyurea | HbF mechanism not in scorable gene space | scored (rank 40); recovered only by prior-weighted ranking |
-| L-glutamine | signature present but WTCS ambiguous (=0) | scored (rank 100); no reversal signal |
-| all 300 | had LINCS signatures | 0 marked "not scored" — coverage was not the issue; specificity was |
+| hydroxyurea | needed the full gene space | rank 18 (top 6%) — recovered post-hoc |
+| L-glutamine | metabolite, no reversal signal (positive connectivity) | rank 213 — genuine negative |
+| neg controls | reverse the generic inflammation signature | 2/5 bottom-half — criterion questionable |
--- a/docs/recovery_test_report.md
+++ b/docs/recovery_test_report.md
@@ -1,7 +1,7 @@
 # Sickle Cell Repurposing — Recovery Test Report

-> **Status: COMPLETE.** Reproduce with `scripts/week1_*` → `week2_*` → `week3_scoring.py` →
-> `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.
+> **Status: COMPLETE (v1 confirmatory + v1.1 exploratory).** Reproduce with `scripts/week1_*` →
+> `week2_*` → `week3_scoring.py` → `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.

 ## Pre-registered success criteria

@@ -12,118 +12,116 @@ The MVP passes if:
  missing LINCS signature, **AND**
 - At least **4 of 5** negative-control drugs rank in the **bottom half**.

-_Pre-registered in the scaffold commit (`b731478`) before any scoring was run. Primary ranking
-= raw connectivity. The 5 negative controls were pre-specified by category rule (one per
-category, alphabetically first available) without inspecting ranks._
+_Pre-registered in the scaffold commit (`b731478`) before any scoring. **Primary (confirmatory)
+analysis = v1**: 978 landmark genes, weighted connectivity score (WTCS). The 5 negative controls
+were pre-specified by category rule without inspecting ranks._

 ---

 ## Section 1 — Methodology

-We built a sickle cell disease signature from **two independent whole-blood microarray studies**
-(GSE35007, Illumina, SS vs AA; GSE16728, Affymetrix, patient vs control), keeping the **671
-genes concordant** (q<0.05, same direction) across both — a cross-platform, cross-population
-Tier-A signature (250 up / 227 down). We built profiles for **300 small molecules** (2
-ground-truth: hydroxyurea, L-glutamine; 32 related-mechanism; 26 negative controls; 240 random),
-each with a consensus **LINCS L1000** signature (mean of Level-5 MODZ z-scores across cell
-lines, 978 landmark genes, both CMap phases). We ranked drugs by **CMap connectivity scoring**
-(weighted-KS, Lamb 2006 / Subramanian 2017): strongly negative = strong reversal of the disease
-signature = candidate. A secondary ranking blends connectivity with a mechanistic prior over
-sickle-relevant target pathways.
+A sickle cell disease signature was built from **two whole-blood microarray studies** (GSE35007
+Illumina SS-vs-AA; GSE16728 Affymetrix patient-vs-control), keeping the **671 genes concordant**
+across both (q<0.05, same direction) → a cross-platform Tier-A signature (250 up / 227 down).
+Profiles were built for **300 small molecules** (2 ground-truth; 32 related-mechanism; 26
+negative controls; 240 random), each with a **LINCS L1000** consensus signature (mean Level-5
+MODZ across cell lines, both CMap phases). Drugs were ranked by **CMap connectivity scoring**
+(Kolmogorov-Smirnov, Lamb 2006 / Subramanian 2017): negative = reversal = candidate.

-## Section 2 — Recovery test result — **FAIL** (primary ranking)
+**v1 (pre-registered/confirmatory):** scored on the 978 landmark genes with WTCS.
+**v1.1 (post-hoc/exploratory):** after v1 failed, two changes were made to diagnose why — (a)
+score on the full **12,328-gene** space (landmark overlap 12% → 85%, bringing the erythroid
+markers in); (b) add a **per-drug specificity z-score** (`spec_z`): how many SDs the real
+connectivity is below a null of 1,000 random queries of the same size against that drug. Because
+these changes followed inspection of the v1 result, **v1.1 is exploratory, not a confirmatory
+test of the pre-registered hypothesis.**

-| Drug | Rank | Percentile | Pass? |
-|---|---|---|---|
-| Hydroxyurea | 40 / 300 | top 13.3% | ❌ (needs top 30) |
-| L-glutamine | 100 / 300 | top 33.3% | ❌ (WTCS=0, ambiguous; has a signature so not "missing") |
+## Section 2 — Recovery test result

-Negative controls (pre-specified; expected: bottom half):
+| Criterion | v1 (confirmatory) | v1.1 (exploratory) |
+|---|---|---|
+| Hydroxyurea top-10% (≤30) | rank **40** (13.3%) ❌ | rank **18** (6.0%) ✅ |
+| L-glutamine top-25% (≤75) | rank 100, WTCS=0 ❌ | rank 213, spec_z=+0.98 ❌ |
+| ≥4/5 neg controls bottom-half | 1/5 ❌ | 2/5 ❌ |
+| **Overall** | **FAIL** | **FAIL** (but hydroxyurea recovered) |

-| Control | Category | Rank | Bottom half? |
-|---|---|---|---|
-| clotrimazole | antifungal | 89 | ❌ |
-| astemizole | antihistamine | 291 | ✅ |
-| azithromycin | antibiotic | 82 | ❌ |
-| ethinyl-estradiol | hormone | 98 | ❌ |
-| caffeine | misc | 84 | ❌ |
+v1.1 negative controls: clotrimazole 258 ✅, astemizole 211 ✅, azithromycin 142 ❌,
+ethinyl-estradiol 114 ❌, caffeine 77 ❌.

-**Only 1/5 negative controls in the bottom half (need ≥4).**
+**Honest reading.** The **pre-registered test FAILED (v1).** The post-hoc v1.1 changes
+**recover hydroxyurea** (rank 40 → 18, passing top-10%) — strong evidence that the v1 failure was
+driven by the 978-landmark bottleneck, not the algorithm. But two failures survive into v1.1, and
+both are now precisely diagnosed:

-**Overall: FAIL on all three pre-registered criteria.** This is reported as-is, without
-adjustment. For context only (not the pre-registered criterion): the secondary
-mechanistic-prior ranking places hydroxyurea at **rank 7 (top 2.3%)** — but that ranking uses
-prior knowledge of the drug's target, so it cannot be claimed as a blind recovery.
+1. **L-glutamine does not reverse the signature** (positive connectivity, spec_z=+0.98). This is
+   intrinsic to its LINCS data — a metabolite with no reversal signal — not a coverage gap. More
+   genes cannot fix it.
+2. **The negative-control criterion is arguably invalid for connectivity scoring.** Two
+   "negative controls" (norethindrone, ciprofloxacin) rank in the top 3 by spec_z. Connectivity
+   measures *expression reversal*, not *therapeutic relatedness* — an antibiotic or contraceptive
+   can still down-regulate the inflammation genes that dominate the scorable signature. The test
+   design conflates the two.

-**Why it failed — the honest diagnosis.** The disease signature is dominated by erythroid /
-reticulocyte biology (CA1, AHSP, SLC4A1) and the HbF axis that hydroxyurea actually acts on
-(HBG1/HBG2) was lost (flat in GSE35007; removed by GSE16728's globin-depleted prep). Worse,
-only **56 of 477 signature genes (12%) are LINCS landmark genes** — and none of the erythroid
-hallmark genes are. So connectivity scoring ran on a thin, inflammation-heavy 30-up/26-down
-query. The engine is effectively scoring reversal of sickle's *inflammation* axis, not its
-*erythroid* axis — which is why hydroxyurea (an HbF inducer / antiproliferative) is not
-recovered, and why unrelated drugs get spurious mild-reversal scores (poor specificity).
+A note on the calibration: textbook CMap **tau** (percentile vs a reference population) was
+implemented but **saturated at ±100** here, because a coherent real query always out-connects
+random gene sets — proper tau needs a library of *real* reference signatures, which this MVP
+lacks. The continuous `spec_z` is the workable substitute.

-## Section 3 — Top 10 candidates (raw connectivity)
+## Section 3 — Top 10 candidates (v1.1 spec_z)

-| Rank | Drug | Score | Known target / mechanism | Plausibility |
+| Rank | Drug | spec_z | Inclusion | Read |
 |---|---|---|---|---|
-| 1 | laropiprant | −0.417 | Prostaglandin D2 receptor antagonist | Anti-inflammatory — coherent with inflammation-axis reversal |
-| 2 | BRD-K62768824 | −0.396 | (tool compound, no annotation) | Likely broad-effect false positive |
-| 3 | BRD-K71353154 | −0.393 | (tool compound) | Likely false positive |
-| 4 | lisinopril | −0.358 | ACE inhibitor | **Non-obvious; see §4** |
-| 5 | BRD-K53443165 | −0.358 | (tool compound) | Likely false positive |
-| 6 | talnetant | −0.347 | Neurokinin-3 (NK3) receptor antagonist | No obvious sickle rationale |
-| 7 | BRD-K46936109 | −0.342 | (tool compound) | Likely false positive |
-| 8 | lawsone | −0.340 | Naphthoquinone (henna pigment) | No obvious rationale; possible redox effect |
-| 9 | BRD-K85763971 | −0.338 | (tool compound) | Likely false positive |
-| 10 | BRD-K36516410 | −0.323 | (tool compound) | Likely false positive |
+| 1 | reserpic-acid | −3.80 | random | reserpine metabolite; non-obvious |
+| 2 | norethindrone | −3.78 | **negative control** | false positive (see §2) |
+| 3 | ciprofloxacin | −3.61 | **negative control** | false positive |
+| 4 | resveratrol | −3.46 | related-mechanism | antioxidant studied in SCD — coherent |
+| 5 | BRD-K57490754 | −3.37 | random | tool compound |
+| 6 | anastrozole | −3.27 | random | aromatase inhibitor |
+| 7–10 | BRD-* / palmitoylethanolamide | ~−3.1 | random | mostly tool compounds |

-As anticipated (PLAN §9.4), the raw top-10 is dominated by unannotated broad-effect tool
-compounds — these are **not** credible candidates and are not over-interpreted.
+That two negative controls outrank hydroxyurea is the single most informative result here — see §4.

-## Section 4 — One non-obvious candidate worth investigating
+## Section 4 — One non-obvious result worth investigating

-**Lisinopril (ACE inhibitor), rank 4.** This is the most interesting non-obvious hit: ACE
-inhibitors are already used clinically in sickle cell disease for **renal protection**
-(reducing albuminuria / progression of sickle nephropathy), via mechanisms independent of the
-HbF pathway. Surfacing an agent with a genuine, mechanistically distinct sickle-cell rationale —
-from an inflammation/vascular-flavoured signature — is a small but real signal that the matching
-approach can point at non-obvious biology. **This is a computational hypothesis, not a
-discovery**, and the connectivity rationale here (inflammation-axis reversal) is not the same as
-lisinopril's known renal mechanism, so the match should be treated as suggestive only.
+The most useful finding is **not** a candidate drug but the **negative-control failure**:
+unrelated drugs (norethindrone, ciprofloxacin) score as strong specific reversers. This is a
+real, generalizable lesson — for a signature whose *scorable* portion is generic
+inflammation/metabolic genes, connectivity rewards any broad transcriptional perturbation that
+touches those genes. The honest implication: **this signature is not specific enough to
+discriminate true repurposing candidates from incidental expression reversers.** Of the
+plausibly-real hits, **resveratrol (rank 4)** — an antioxidant with prior sickle cell literature
+— is the most defensible, but it is a hypothesis, not a discovery.

 ## Section 5 — Honest limitations

-1. **Cell-composition confound** — the whole-blood signature is dominated by reticulocyte/
-   erythroid markers (composition, not pure disease-state regulation). v2 needs deconvolution.
-2. **Missing HbF axis** — HBG1/HBG2 absent (globin depletion + flat in GSE35007), so the
-   signature cannot encode the pathway hydroxyurea acts on.
-3. **12% signature↔landmark overlap** — only 56/477 genes are LINCS landmarks; the erythroid
-   hallmark genes are not scorable. The query collapses to a generic inflammation/metabolic slice.
-4. **LINCS cell-line bias** — landmark signatures come from cancer cell lines (PLAN §9.2); poorly
-   suited to a blood disease.
-5. **Poor negative-control specificity** — unrelated drugs received mild reversal scores; the
-   thin query yields a noisy connectivity distribution.
-6. **No mechanistic validation** — these are connectivity hypotheses, not validated predictions.
+1. **Pre-registered test failed; the pass is post-hoc.** v1.1's hydroxyurea recovery is
+   exploratory and must be re-validated on a held-out disease before any claim is made.
+2. **Missing HbF axis** — HBG1/HBG2 are absent from LINCS entirely (not just landmarks), so the
+   pathway hydroxyurea acts on can never be scored by this method.
+3. **Signature specificity** — scorable genes are inflammation/metabolic; negative controls
+   reverse them too. Connectivity ≠ therapeutic relatedness.
+4. **Cell-composition confound** — the whole-blood signature is reticulocyte-dominated.
+5. **LINCS cancer-cell-line bias**, and **no reference-signature library** for proper tau.
+6. **No mechanistic validation** — all hits are computational hypotheses.

 ## Section 6 — What v2 would fix

- **Cell-type deconvolution** of the disease signature to separate disease-state regulation from
-  composition, recovering specificity.
- **A non-globin-depleted, RNA-seq whole-blood study** to retain the HbF axis.
- **Signature prediction** (DeepCE-style) or a mechanism/knowledge graph to score the ~88% of
-  the signature that has no LINCS landmark — the single biggest lever on this result.
- **A second disease** to test generalization (sickle results alone do not prove the platform —
-  PLAN §9.5).
+- **A reference-signature library** to make tau (proper specificity calibration) work — the
+  single biggest fix to the negative-control problem, and a direct use of the curated-data moat.
+- **Cell-type deconvolution** + a non-globin-depleted RNA-seq study to recover a more specific,
+  HbF-containing signature.
+- **Signature prediction / mechanism graph** to score genes with no LINCS measurement.
+- **A second disease** to test generalization and to honestly re-validate the v1.1 method
+  (PLAN §9.5).

 ---

 ### Bottom line

-The pipeline is reproducible end-to-end and the method is sound, but on this signature it **does
-not recover the known sickle cell drugs**. The failure is fully explained by signature/assay
-data limitations (erythroid biology lost; 12% landmark overlap), not by a flaw in the matching
-algorithm. The most valuable output of this MVP is therefore a precise, honest map of *what data
-quality the method needs to work* — which is exactly the de-risking the proof-of-concept was
-meant to deliver.
+The pre-registered recovery test **failed**. Post-hoc diagnosis shows the dominant cause was a
+fixable gene-space bottleneck — correcting it **recovers hydroxyurea** — but also surfaces a
+deeper, genuine limitation: this whole-blood signature is **not specific enough** for
+connectivity scoring to separate real candidates from incidental reversers (negative controls
+rank at the top). The MVP's real deliverable is a precise, honest map of *what it takes to make
+this method work*: a more specific (deconvolved, HbF-containing) signature and a reference library
+for calibration — exactly the curated-data investments the platform thesis is built on.
--- a/docs/results/HDAC2_vorinostat_pred.pdb
+++ b/docs/results/HDAC2_vorinostat_pred.pdb
--- a/docs/results/phase1_affinity.csv
+++ b/docs/results/phase1_affinity.csv
@@ -0,0 +1,10 @@
+target,ligand,affinity,prob_binder,is_known_binder
+hemoglobin,voxelotor,0.16870097815990448,0.4566290080547333,True
+hemoglobin,caffeine,1.5093848705291748,0.34217822551727295,False
+hemoglobin,hydroxyurea,0.6418474316596985,0.06856877356767654,False
+PKR,mitapivat,1.187251329421997,0.3238581717014313,True
+PKR,caffeine,2.992832899093628,0.2582802474498749,False
+PKR,hydroxyurea,2.444709300994873,0.39834722876548767,False
+HDAC2,vorinostat,-1.7771220207214355,0.9993993043899536,True
+HDAC2,caffeine,2.3925399780273438,0.11900843679904938,False
+HDAC2,hydroxyurea,1.284231424331665,0.7654422521591187,False
--- a/docs/results/screen_HDAC2_pilot.csv
+++ b/docs/results/screen_HDAC2_pilot.csv
@@ -0,0 +1,25 @@
+rank,drug,P_binder,affinity,inclusion_reason
+1,trichostatin-a,0.9994845986366272,-2.883908271789551,related_mechanism
+2,vorinostat,0.9994641542434692,-1.7357702255249023,related_mechanism
+3,panobinostat,0.9983962774276733,-2.4892501831054688,related_mechanism
+4,belinostat,0.9970909357070923,-2.102327823638916,related_mechanism
+5,scriptaid,0.9957252740859985,-1.5838574171066284,related_mechanism
+6,mocetinostat,0.9910210371017456,-1.3829972743988037,related_mechanism
+7,entinostat,0.9903219938278198,-0.6888977289199829,related_mechanism
+8,apicidin,0.9882931113243103,-2.2579169273376465,related_mechanism
+9,valproic-acid,0.9134805202484131,0.3317195773124695,related_mechanism
+10,hydroxyurea,0.77649986743927,1.2352395057678223,ground_truth
+11,curcumin,0.6803375482559204,0.9697343707084656,related_mechanism
+12,sulforaphane,0.5829939842224121,0.6017141342163086,related_mechanism
+13,resveratrol,0.5800595283508301,1.1647570133209229,related_mechanism
+14,tadalafil,0.5426475405693054,0.21663367748260498,related_mechanism
+15,glutamine,0.4674023389816284,2.029467821121216,ground_truth
+16,quercetin,0.45189985632896423,0.34488344192504883,related_mechanism
+17,romidepsin,0.4316568374633789,-0.8200547099113464,related_mechanism
+18,decitabine,0.38500306010246277,0.8381982445716858,related_mechanism
+19,azacitidine,0.31987687945365906,0.6874549388885498,related_mechanism
+20,sildenafil,0.15390564501285553,0.26623794436454773,related_mechanism
+21,thalidomide,0.1190553605556488,1.9194087982177734,related_mechanism
+22,pomalidomide,0.11849743872880936,1.7003729343414307,related_mechanism
+23,lenalidomide,0.09446469694375992,1.9872398376464844,related_mechanism
+24,dexamethasone,0.031893711537122726,0.5724264979362488,related_mechanism
--- a/docs/structure_binding_notes.md
+++ b/docs/structure_binding_notes.md
@@ -0,0 +1,175 @@
+# Structure-based binding track — working notes
+
+Branch `structure-based-binding`. Implements PLAN §12. Baseline-first, start with the two cleanest
+targets (Hemoglobin + PKR), de-risk the harness before scaling.
+
+## Status (2026-06-23)
+
+**Toolchain check (PLAN §12.6 pitfall 4, confirmed real):**
+- ✅ RDKit installs on ARM Mac — ligand side ready.
+- ❌ AutoDock Vina does NOT pip-install on ARM Mac; no docking binary available. Docking (§12.3)
+  is **blocked on toolchain** — must resolve via conda/micromamba (`vina`/`smina`), a GPU AF3-class
+  model (Boltz-2/Chai-1/DiffDock), or an x86 Vina binary under Rosetta.
+
+**Structures obtained:** `5E83` (hemoglobin + voxelotor), `8XFD` (PKR + mitapivat) in
+`data/raw/structures/`.
+
+**Step 0 — ligand-based retrieval baseline (`scripts/binding_ligand_baseline.py`):**
+RDKit Tanimoto of our 300 drugs vs known sickle binders.
+- Engine VALIDATED on in-set classes: `decitabine`→azacitidine (0.62); `vorinostat`→scriptaid
+  (0.42), belinostat (0.28). Correctly clusters DNMT1 / HDAC HbF-inducers.
+- But voxelotor / mitapivat have **no analog** in our set (max Tanimoto ~0.20–0.26). A 300-drug
+  library is too sparse to contain look-alikes of distinctive scaffolds.
+
+**Takeaways:**
+1. Ligand retrieval works but needs a **bigger drug library** to be useful for distinctive targets.
+2. The targets without in-set analogs (Hb, PKR) need **actual docking** (§12.3) — which scores
+   binding directly, no look-alike required. That is the gating next step, and it needs the
+   toolchain solved.
+
+## Step 1 — docking baseline (2026-06-24)
+
+**Toolchain SOLVED on ARM Mac:** AutoDock Vina 1.2.5 mac binary (`tools/vina`, runs under Rosetta)
+ open-babel (brew) for prep. Docking runs end-to-end (`scripts/dock_positive_controls.py`).
+Co-crystal ligands identified: 5L7 = voxelotor (5E83), WV2 = mitapivat (8XFD).
+
+**Positive-control cross-docking — inconclusive, and instructively so.** Affinities (kcal/mol):
+
+| ligand | hemoglobin | PKR |
+|---|---|---|
+| voxelotor | −8.1 | −9.3 |
+| mitapivat | −10.0 | −11.2 |
+| decitabine | −6.6 | −7.0 |
+| hydroxyurea | −3.9 | −3.6 |
+| caffeine | −6.1 | −6.4 |
+
+The scores rank almost perfectly by **molecular size** (mitapivat > voxelotor > decitabine/caffeine
+> hydroxyurea) in *both* pockets — mitapivat wins even on hemoglobin, which it doesn't target. So
+raw Vina affinity is confounded by ligand size and per-pocket stickiness; it cannot yet
+distinguish target-specific binding. This is the **docking analog of the connectivity specificity
+problem** — raw scores carry a systematic bias (size here, broadness there) that masquerades as
+signal. voxelotor *does* dock to Hb (−8.1, a real score); the cross-target test just isn't the
+right validation.
+
+## Step 2 — redocking-RMSD validation FAILS across the board (2026-06-24)
+
+Redocked each co-crystal ligand into its own structure (`scripts/dock_validate.py`); RMSD vs
+crystal pose via obrms:
+
+| redock | RMSD | note |
+|---|---|---|
+| voxelotor → Hb (5E83) | NA | covalent binder (Schiff base, αVal1) — out of scope §12.7 |
+| mitapivat → PKR (8XFD) | 8.2 Å | allosteric, cofactor (FBP/Mg) stripped |
+| **vorinostat → HDAC2 (4LXZ, Zn kept)** | **7.9 Å** | classical non-covalent target — should have worked |
+
+**The clean target also failing means this is a systematic PIPELINE-QUALITY problem, not target
+choice.** The cheap Vina + open-babel setup produces scores but does not reproduce known binding
+geometry, so its affinities are not yet trustworthy. Ligand efficiency (affinity / heavy atoms)
+also doesn't fix it — it over-corrects, ranking tiny hydroxyurea (−0.78) "best".
+
+Likely causes (in priority order):
+1. **Low-quality receptor prep** — open-babel `-xr` is not production docking prep. Need
+   AutoDockTools `prepare_receptor` or **Meeko** + `reduce`/pdb2pqr for protonation, charges, and
+   proper AutoDock atom typing.
+2. **Ligand prep** — should use Meeko (correct rotatable bonds / typing), not bare obabel `--gen3d`.
+3. **RMSD metric** — obrms superimposes before RMSD; redocking validation wants symmetry-corrected
+   RMSD **in place** (receptor frame). Worth confirming with an in-place metric.
+
+**Honest takeaway:** consistent with the whole project — the *quick* version of each method runs
+but doesn't survive honest validation. Credible structure-based docking needs production prep
+tooling (Meeko/ADFR), which is the real next investment for this track.
+
+## Step 3 — production prep helps, but classical docking is the wrong tool here (2026-06-24)
+
+`scripts/dock_production.py`: Meeko ligand prep (proper rotatable-bond/AD typing) + in-place
+symmetry-corrected RMSD (spyrmsd, not obrms which superimposes). On the clean HDAC2/vorinostat
+target (Zn kept):
+
+- **7.9 Å → 4.76 Å** with proper ligand prep + correct metric. Prep and metric genuinely mattered.
+- But still FAIL (>2 Å). The residual is the deeper problem: **vorinostat binding is defined by its
+  hydroxamate chelating the catalytic Zn, and Vina has no metal-coordination term** — it cannot
+  score the interaction that determines the pose.
+
+**The real finding: sickle's druggable targets bind via non-classical chemistry that classical
+docking handles poorly** — Hb/voxelotor (covalent), PKR/mitapivat (allosteric + cofactor),
+HDAC/vorinostat (Zn chelation). This is the target landscape, not bad luck. AutoDock Vina is the
+wrong tool for it.
+
+**Redirect:** the modality that DOES handle covalent/metal/induced-fit binding is **data-driven
+AF3-class co-folding** (Boltz-2 / Chai-1 / DiffDock — they learn these modes from the PDB). That is
+the indicated next tool for this disease — and it's gated by the **24 GB local memory ceiling**
+(PLAN §12.6 pitfall 4): needs a cloud GPU or a bigger box. The "GPU breaks all-local" prediction is
+now the binding constraint of the whole track.
+
+## Step 4 — AF3-class co-folding (Boltz-2 on Modal GPU) WORKS on the Zn case (2026-06-24)
+
+Ran Phase 1 on Modal (L4, serverless) — `gpu/modal_app.py`, ~$0.60–0.80. Co-folded each known
+binder + 2 negatives into each target WITH the binding-mode cofactors (HDAC2+Zn, PKR+FBP/Mg,
+Hb+heme). Ranked by Boltz-2 P(binder):
+
+| target | known binder P(binder) | negatives | verdict |
+|---|---|---|---|
+| **HDAC2 (+Zn)** | vorinostat **0.9994** | caffeine 0.12, hydroxyurea 0.77 | **PASS — decisive** |
+| hemoglobin (+heme) | voxelotor 0.46 | caffeine 0.34, hydroxyurea 0.07 | PASS (weak) |
+| PKR (+FBP/Mg) | mitapivat 0.32 | hydroxyurea 0.40 (beat it) | FAIL |
+
+**Headline: HDAC2 + zinc — the exact case Vina failed (7.9 Å redock, no metal term) — co-folding
+NAILS** (vorinostat 0.999 vs negatives ~0.1). The data-driven model handles the Zn-chelation
+chemistry classical docking could not. The modality pivot is validated on its decisive test.
+The first clear positive result in the project after a long string of honest negatives.
+
+Notes: (1) affinity sign confirmed — vorinostat has the lowest affinity_pred_value (−1.78,
+strongest) AND highest P(binder); ranking by max(affinity) would be backwards (the P(binder) fix
+was necessary). (2) 2/3: Hb weak (covalent/tetramer, as predicted), PKR miss (allosteric pocket).
+(3) Engineering: had to add cuequivariance kernels to the image; serialize (max_containers=1) so
+the weights download once (parallel containers corrupted the checkpoint).
+
+## Step 5 — pose-RMSD confirmation: co-folding reproduces the geometry (2026-06-24)
+
+`scripts/pose_rmsd.py`: superpose predicted HDAC2 onto crystal 4LXZ (Ca), transform the predicted
+vorinostat + Zn, compare poses (spyrmsd, in-place).
+
+| metric | co-folding | classical Vina |
+|---|---|---|
+| protein fold, Ca RMSD over 366 res | **0.14 A** | — |
+| **vorinostat pose RMSD** | **0.21 A** PASS | 7.9 A FAIL |
+| catalytic Zn placement | 2.73 A (minor) | no metal term |
+
+Co-folding reproduces the fold AND the vorinostat binding pose to **0.21 A (crystal-accurate)** on
+the exact Zn-chelation case Vina was off by 7.9 A. HDAC2 is now validated on BOTH axes: affinity
+(P(binder)=0.999) and geometry (0.21 A). Minor blemish: the Zn ion itself lands ~2.7 A off (ligand
+perfect, metal slightly less). The structure-binding modality is comprehensively validated on its
+decisive metal-coordination case.
+
+## Step 6 — Phase 2 screen pilot (HDAC2): recovers the inhibitor class decisively (2026-06-24)
+
+`modal run gpu/modal_app.py::screen --limit 24` — co-fold 24 drugs vs HDAC2 (+Zn), parallel
+(~10-wide, cached weights), ranked by P(binder). ~$1.3.
+
+**Top 9 = all HDAC inhibitors** (trichostatin-A, vorinostat, panobinostat, belinostat, scriptaid,
+mocetinostat, entinostat, apicidin all P≥0.99; valproic-acid 0.91), then a clean drop to
+hydroxyurea 0.78 and non-HDAC drugs sinking to dexamethasone 0.03. The model captures the
+**structure-activity gradient**: potent Zn-chelating hydroxamates ~0.99 vs weak fatty-acid
+valproate 0.91 vs DNMT inhibitors / IMiDs / steroid near 0. The screen recovers the right
+pharmacological class at scale.
+
+**Honest false negative:** romidepsin (potent HDAC inhibitor) ranked low (0.43) — it is a cyclic
+depsipeptide PRODRUG (must be reduced to expose its Zn-binding thiol), which co-folding doesn't
+model. The screen mishandles non-standard chemotypes (prodrugs, macrocycles).
+
+The screening pipeline is validated. Next: run the full set (incl. the 240 random + negatives) to
+hunt for NON-obvious HDAC2 binders (the actual discovery run), ~$15-20.
+
+## Next steps
+- [ ] Full screen (300 drugs) vs HDAC2 — discovery run for non-obvious binders.
+- [ ] Investigate PKR: allosteric site may need the full assembly / better pocket definition.
+- [ ] Phase 2 screen: rank the ~300-drug set against HDAC2 (the validated target) by P(binder);
+  positive-control recovery test at screen scale.
+- [ ] Add a one-time weight-warmup function so post-cache runs go back to fast parallel safely.
+- [ ] (optional) Salvage one classical Vina case: PKR with FBP/Mg cofactors RETAINED, to confirm
+  the harness can validate on a non-metal sickle target.
+- [ ] Production receptor prep (Meeko mk_prepare_receptor + protonation) if staying with Vina.
+- [ ] §12.9 generative beacon — only after a validated scoring function exists.
+
+> **Hardware note:** this machine is **24 GB** unified memory (not the 96 GB PLAN §2 assumed),
+> which caps local AF3-class model inference. Classical docking (above) is unaffected.
--- a/gpu/modal_app.py
+++ b/gpu/modal_app.py
@@ -0,0 +1,329 @@
+"""Ephemeral GPU runner for AF3-class co-folding (PLAN §12, docs/gpu_plan.md).
+
+Serverless: `modal run gpu/modal_app.py` provisions a GPU, runs Phase 1, releases the GPU — zero
+idle cost. Model weights cache in a persistent Volume so we never re-pay GPU time to re-download.
+
+Phase 1 (positive-control recovery test, §12.4): co-fold each known binder + a couple of negative
+controls against each sickle target and rank by Boltz-2 predicted affinity. The known binder
+should win its own target — the test Vina couldn't pass on metal/covalent/allosteric modes.
+Affinity ranking avoids the receptor-alignment that pose-RMSD would need (a later refinement).
+
+Setup (one-time): `pip install modal && modal token new`.
+Run: `modal run gpu/modal_app.py`.
+
+Helpers below the GPU function run locally (no GPU) and are import-safe for testing.
+"""
+
+from __future__ import annotations
+
+import json
+import subprocess
+from pathlib import Path
+
+import modal
+
+app = modal.App("reverso-binding")
+
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .apt_install("git", "wget")
+    # Boltz-2 needs NVIDIA cuequivariance kernels (cuda 12) for inference, plus rdkit/numpy.
+    .pip_install("boltz", "cuequivariance-torch", "cuequivariance-ops-torch-cu12", "rdkit", "numpy")
+)
+weights = modal.Volume.from_name("reverso-binding-weights", create_if_missing=True)
+WEIGHTS = "/weights"
+
+# target -> (PDB id, crystal-ligand resname, drug name, cofactor/metal CCD codes to co-fold).
+# The cofactors are the binding-mode determinants Vina couldn't model: HDAC2 needs the catalytic
+# Zn (vorinostat chelates it), PKR the allosteric FBP + Mg, hemoglobin the heme. Co-folding them
+# in as CCD ligands is the whole point of the AF3-class pivot. The same cofactors are present when
+# co-folding the negatives into that target, for a fair comparison.
+TARGETS = {
+    "hemoglobin": ("5E83", "5L7", "voxelotor", ["HEM"]),
+    "PKR": ("8XFD", "WV2", "mitapivat", ["FBP", "MG"]),
+    "HDAC2": ("4LXZ", "SHH", "vorinostat", ["ZN"]),
+}
+NEGATIVES = ["caffeine", "hydroxyurea"]
+
+# Honest limitation: hemoglobin's voxelotor site sits at the tetramer centre and the bond is
+# covalent (Schiff base) — a single-chain + heme model only approximates it, so Hb is the weak
+# case. HDAC2 (Zn chelation) and PKR (allosteric + cofactor) are the real tests of whether
+# co-folding handles the modes classical docking could not.
+
+
+# --------------------------------------------------------------------------- local helpers (CPU)
+
+def fetch_pdb(pdb: str, struct_dir: Path = Path("data/raw/structures")) -> Path:
+    """Return a local PDB path, downloading from RCSB if absent (keeps the GPU run self-contained)."""
+    import requests
+    p = struct_dir / f"{pdb}.pdb"
+    if not p.exists():
+        p.parent.mkdir(parents=True, exist_ok=True)
+        p.write_bytes(requests.get(f"https://files.rcsb.org/download/{pdb}.pdb", timeout=60).content)
+    return p
+
+
+def binding_chain_sequence(pdb: str, lig_resname: str) -> str:
+    """One-letter sequence of the protein chain nearest the crystal ligand (the binding chain)."""
+    import gemmi
+    st = gemmi.read_structure(str(fetch_pdb(pdb)))
+    model = st[0]
+    lig_atoms = [a.pos for ch in model for res in ch if res.name == lig_resname for a in res]
+    if not lig_atoms:
+        raise ValueError(f"ligand {lig_resname} not found in {pdb}")
+    lig_center = gemmi.Position(
+        sum(p.x for p in lig_atoms) / len(lig_atoms),
+        sum(p.y for p in lig_atoms) / len(lig_atoms),
+        sum(p.z for p in lig_atoms) / len(lig_atoms),
+    )
+    best, best_d = None, 1e9
+    for ch in model:
+        poly = ch.get_polymer()
+        if len(poly) < 20:
+            continue
+        d = min((a.pos.dist(lig_center) for res in ch for a in res), default=1e9)
+        if d < best_d:
+            best, best_d = poly, d
+    if best is None:
+        raise ValueError(f"no protein chain in {pdb}")
+    return gemmi.one_letter_code(best.extract_sequence()).upper().replace("X", "")
+
+
+def pubchem_smiles(name: str) -> str:
+    import requests
+    for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
+        try:
+            d = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}"
+                             f"/property/{prop}/JSON", timeout=30).json()["PropertyTable"]["Properties"][0]
+            if prop in d:
+                return d[prop]
+        except Exception:
+            continue
+    raise ValueError(f"no SMILES for {name}")
+
+
+def build_boltz_yaml(protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str] | None = None,
+                     msa_path: str | None = None) -> str:
+    """Boltz-2 YAML: protein + the drug ligand (affinity binder) + any cofactor/metal CCD ligands.
+
+    Cofactors/ions are added as `ligand` entries referencing their CCD code (e.g. ZN, MG, FBP,
+    HEM) so the model places the metal/cofactor that defines the binding mode. Affinity is
+    predicted on the drug ligand L only. If ``msa_path`` is given, the protein reuses a precomputed
+    MSA (so a screen against one target queries the MSA server ONCE, not per drug).
+    """
+    prot = ["  - protein:", "      id: A", f"      sequence: {protein_seq}"]
+    if msa_path:
+        prot.append(f"      msa: {msa_path}")
+    lines = ["version: 1", "sequences:"] + prot + \
+            ["  - ligand:", "      id: L", f"      smiles: '{ligand_smiles}'"]
+    for i, ccd in enumerate(cofactor_ccds or []):
+        lines += ["  - ligand:", f"      id: M{i}", f"      ccd: {ccd}"]
+    lines += ["properties:", "  - affinity:", "      binder: L"]
+    return "\n".join(lines) + "\n"
+
+
+# ------------------------------------------------------------------------------- GPU function
+
+# max_containers caps parallel fan-out (cost control). The download race that corrupts the
+# checkpoint only happens on a COLD volume; once weights are cached+committed (Phase 1 did this),
+# parallel containers just reload them, so a screen can safely run ~10-wide.
+@app.function(gpu="L4", image=image, volumes={WEIGHTS: weights}, timeout=1800)
+def cache_msa(label: str, protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str]) -> str:
+    """Compute the target's MSA ONCE (via the server) and cache the a3m on the Volume.
+
+    A screen reuses one target protein for all drugs, so we query the MSA server a single time
+    here, then every cofold() reuses the cached a3m (no server hammering). Returns the Volume path
+    of the cached a3m. Doubles as the weight/CCD warmup.
+    """
+    import os
+    import shutil
+    os.environ["HF_HOME"] = f"{WEIGHTS}/hf"
+    boltz_cache = f"{WEIGHTS}/boltz"
+    os.makedirs(boltz_cache, exist_ok=True)
+    weights.reload()
+
+    work = Path("/tmp") / f"{label}_msa"
+    work.mkdir(parents=True, exist_ok=True)
+    (work / "in.yaml").write_text(build_boltz_yaml(protein_seq, ligand_smiles, cofactor_ccds))
+    out = work / "out"
+    subprocess.run(["boltz", "predict", str(work / "in.yaml"), "--use_msa_server",
+                    "--cache", boltz_cache, "--out_dir", str(out), "--output_format", "pdb"], check=True)
+    # Pick the largest a3m that actually looks like FASTA/a3m text (Boltz writes several files;
+    # some are binary and crash its INPUT parser with KeyError '\x00'). Strip null bytes too.
+    candidates = sorted(out.rglob("*.a3m"), key=lambda p: p.stat().st_size, reverse=True)
+    chosen = None
+    for c in candidates:
+        raw = c.read_bytes()
+        if raw[:1] == b">" or b"\n>" in raw[:512]:
+            chosen = raw.replace(b"\x00", b"")
+            break
+    if chosen is None:
+        weights.commit()
+        raise RuntimeError(f"no valid a3m; candidates={[(str(p), p.stat().st_size) for p in candidates]}")
+    msa_dir = Path(WEIGHTS) / "msa"
+    msa_dir.mkdir(parents=True, exist_ok=True)
+    dest = msa_dir / f"{label}.a3m"
+    dest.write_bytes(chosen)
+    weights.commit()
+    return str(dest)
+
+
+@app.function(gpu="L4", image=image, volumes={WEIGHTS: weights}, timeout=3600, max_containers=20)
+def cofold(label: str, protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str],
+           msa_path: str | None = None) -> dict:
+    """Co-fold one complex (protein + drug + cofactors) on the GPU; return affinity + P(binder).
+
+    Weights persist on the mounted Volume. If ``msa_path`` is given, reuse the cached target MSA
+    (no server query); else query the MSA server. GPU is released the moment this returns.
+    """
+    import os
+    os.environ["HF_HOME"] = f"{WEIGHTS}/hf"
+    boltz_cache = f"{WEIGHTS}/boltz"
+    os.makedirs(boltz_cache, exist_ok=True)
+    weights.reload()
+
+    work = Path("/tmp") / label
+    work.mkdir(parents=True, exist_ok=True)
+    (work / "in.yaml").write_text(build_boltz_yaml(protein_seq, ligand_smiles, cofactor_ccds, msa_path))
+    out = work / "out"
+    cmd = ["boltz", "predict", str(work / "in.yaml"),
+           "--cache", boltz_cache, "--out_dir", str(out), "--output_format", "pdb"]
+    if not msa_path:
+        cmd.insert(3, "--use_msa_server")
+    try:
+        subprocess.run(cmd, check=True)
+    finally:
+        weights.commit()  # persist downloaded weights/CCD even if this run fails, so retries skip it
+
+    # Affinity is written to a JSON under out/predictions/<name>/; parse defensively (keys vary).
+    aff = {"affinity_pred_value": None, "affinity_probability_binary": None}
+    for jf in out.rglob("affinity*.json"):
+        data = json.loads(jf.read_text())
+        for k in aff:
+            if k in data:
+                aff[k] = data[k]
+        break
+    cif = next(out.rglob("*_model_0.pdb"), None) or next(out.rglob("*.pdb"), None)
+    return {"label": label, "affinity": aff["affinity_pred_value"],
+            "prob_binder": aff["affinity_probability_binary"],
+            "structure": cif.read_text() if cif else None}
+
+
+# ------------------------------------------------------------------------------- driver (local)
+
+@app.local_entrypoint()
+def pose() -> None:
+    """Save the predicted HDAC2/vorinostat complex for local pose-RMSD validation.
+
+    Run: `modal run gpu/modal_app.py::pose`. Weights are cached, so this is one fast GPU call.
+    The returned PDB (protein + vorinostat + Zn) is scored locally against 4LXZ by
+    scripts/pose_rmsd.py (align predicted protein to crystal, compare ligand).
+    """
+    target = "HDAC2"
+    pdb, res, drug, cofactors = TARGETS[target]
+    seq = binding_chain_sequence(pdb, res)
+    r = cofold.remote(f"{target}_{drug}_pose", seq, pubchem_smiles(drug), cofactors)
+    out = Path("data/processed/binding"); out.mkdir(parents=True, exist_ok=True)
+    if r.get("structure"):
+        dest = out / f"{target}_{drug}_pred.pdb"
+        dest.write_text(r["structure"])
+        print(f"saved {dest}; affinity={r['affinity']}, P(binder)={r['prob_binder']}")
+    else:
+        print("no structure returned")
+
+
+@app.local_entrypoint()
+def screen(limit: int = 0) -> None:
+    """Phase 2: co-fold the drug set against the validated target (HDAC2 + Zn), rank by P(binder).
+
+    `modal run gpu/modal_app.py::screen --limit 24`  (pilot; omit --limit for the full set).
+    Recovery check at scale: the known HDAC inhibitors (related_mechanism) should rank top.
+    Weights are cached, so this fans out ~10-wide.
+    """
+    import csv
+    import pandas as pd
+
+    target = "HDAC2"
+    pdb, res, _drug, cofactors = TARGETS[target]
+    seq = binding_chain_sequence(pdb, res)
+
+    df = pd.read_csv("data/processed/drug_set_v1.csv")
+    df = df[df["canonical_smiles"].notna() & (df["canonical_smiles"] != "-666")].copy()
+    if limit:  # pilot: prioritise mechanism + controls (incl. the HDAC inhibitors) then fill
+        pri = df[df["inclusion_reason"].isin(["ground_truth", "related_mechanism", "negative_control"])]
+        df = pd.concat([pri, df.drop(pri.index)]).head(limit)
+    # 1) compute the target MSA ONCE (also warms weights/CCD), then reuse it for every drug
+    print(f"computing {target} MSA once (server) ...")
+    msa_path = cache_msa.remote(target, seq, pubchem_smiles("vorinostat"), cofactors)
+    print(f"cached MSA: {msa_path}")
+
+    # 2) screen all drugs reusing the cached MSA; tolerate per-drug failures (no abort)
+    jobs = [(f"{target}__{r.pert_iname}", seq, r.canonical_smiles, cofactors, msa_path)
+            for r in df.itertuples()]
+    print(f"screening {len(jobs)} drugs vs {target} (+{cofactors}), reusing cached MSA")
+    results = list(cofold.starmap(jobs, return_exceptions=True))
+    by = {j[0].split("__")[1]: (r if isinstance(r, dict) else None) for j, r in zip(jobs, results)}
+    n_fail = sum(1 for v in by.values() if v is None)
+    if n_fail:
+        print(f"  ({n_fail}/{len(jobs)} drugs failed and were skipped)")
+    reason = dict(zip(df["pert_iname"], df["inclusion_reason"]))
+
+    rows = [{"drug": d, "P_binder": (r or {}).get("prob_binder"),
+             "affinity": (r or {}).get("affinity"), "inclusion_reason": reason.get(d)}
+            for d, r in by.items()]
+    rows = [x for x in rows if x["P_binder"] is not None]
+    rows.sort(key=lambda x: x["P_binder"], reverse=True)
+
+    out = Path("data/processed/binding"); out.mkdir(parents=True, exist_ok=True)
+    with (out / f"screen_{target}.csv").open("w", newline="") as f:
+        w = csv.DictWriter(f, fieldnames=["rank", "drug", "P_binder", "affinity", "inclusion_reason"])
+        w.writeheader()
+        for i, x in enumerate(rows, 1):
+            w.writerow({"rank": i, **x})
+    print(f"\nscreened {len(rows)} drugs vs {target}; top 15 by P(binder):")
+    for i, x in enumerate(rows[:15], 1):
+        print(f"  {i:2d}  {x['drug']:20s} P={x['P_binder']:.3f}  [{x['inclusion_reason']}]")
+    hdac_like = [i for i, x in enumerate(rows, 1) if x["inclusion_reason"] == "related_mechanism"]
+    if hdac_like:
+        print(f"\nrelated-mechanism (HDAC inhibitors etc.) ranks: {hdac_like[:10]}")
+
+
+@app.local_entrypoint()
+def main() -> None:
+    """Fan out one GPU call per (target, ligand) pair; tabulate affinities; positive-control test."""
+    jobs = []  # (label, protein_seq, smiles, cofactor_ccds)
+    for target, (pdb, res, drug, cofactors) in TARGETS.items():
+        seq = binding_chain_sequence(pdb, res)
+        for lig in [drug, *NEGATIVES]:
+            jobs.append((f"{target}_{lig}", seq, pubchem_smiles(lig), cofactors))
+    cofactor_summary = {t: c[3] for t, c in TARGETS.items()}
+    print(f"co-folding {len(jobs)} complexes ({len(TARGETS)} targets x {1+len(NEGATIVES)} ligands); "
+          f"cofactors per target: {cofactor_summary}")
+
+    results = list(cofold.starmap(jobs))
+    by = {r["label"]: r for r in results}
+
+    print(f"\n{'target':12s}{'ligand':14s}{'affinity':>10s}{'P(binder)':>11s}")
+    out_rows = []
+    for target, (pdb, res, drug, cofactors) in TARGETS.items():
+        prob = {}  # rank by P(binder): unambiguous (higher = more likely a binder). Boltz
+        for lig in [drug, *NEGATIVES]:   # affinity_pred_value is ~log(IC50) (lower=stronger) -- sign
+            r = by.get(f"{target}_{lig}", {})           # is version-dependent, so don't rank on it.
+            a, p = r.get("affinity"), r.get("prob_binder")
+            if p is not None:
+                prob[lig] = p
+            print(f"{target:12s}{lig:14s}{(f'{a:.2f}' if a is not None else 'NA'):>10s}"
+                  f"{(f'{p:.2f}' if p is not None else 'NA'):>11s}")
+            out_rows.append({"target": target, "ligand": lig, "affinity": a, "prob_binder": p,
+                             "is_known_binder": lig == drug})
+        if prob:
+            best = max(prob, key=prob.get)  # highest P(binder)
+            print(f"  -> {target}: best P(binder) = {best} (known = {drug}) "
+                  f"{'PASS' if best == drug else 'FAIL'}\n")
+
+    outdir = Path("data/processed/binding"); outdir.mkdir(parents=True, exist_ok=True)
+    import csv
+    with (outdir / "phase1_affinity.csv").open("w", newline="") as f:
+        w = csv.DictWriter(f, fieldnames=["target", "ligand", "affinity", "prob_binder", "is_known_binder"])
+        w.writeheader(); w.writerows(out_rows)
+    print(f"wrote {outdir/'phase1_affinity.csv'}")
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -33,6 +33,18 @@ dev = [
    "pytest>=8.0",
    "ruff>=0.5",
 ]
+# Structure-based binding track (PLAN §12); see docs/structure_binding_notes.md.
+# Non-pip tools (install separately): AutoDock Vina binary (tools/vina, from the AutoDock-Vina
+# GitHub release) and open-babel (brew). Boltz-2 + cuequivariance run only in the Modal GPU image
+# (gpu/modal_app.py), not locally.
+structure = [
+    "rdkit>=2024.3",        # ligand prep, fingerprints
+    "requests>=2.31",       # PubChem / RCSB fetch
+    "gemmi>=0.6",           # structure parsing + superposition
+    "spyrmsd>=0.9",         # symmetry-corrected pose RMSD
+    "meeko>=0.7",           # production docking ligand prep
+    "modal>=1.5",           # ephemeral GPU orchestration
+]

 [tool.setuptools.packages.find]
 where = ["."]
--- a/scripts/binding_ligand_baseline.py
+++ b/scripts/binding_ligand_baseline.py
@@ -0,0 +1,66 @@
+"""Structure-based track, step 0: ligand-based retrieval baseline (PLAN §12.9 engine).
+
+Docking (§12.3) needs a toolchain that doesn't pip-install on ARM Mac (AutoDock Vina) — that's the
+next dependency to solve. Meanwhile this runs now with pure RDKit: do any of our 300 drugs sit near
+the KNOWN sickle binders (voxelotor, mitapivat, decitabine) in chemical space? This is the
+retrieval engine §12.9 would point a generative beacon at, and a sanity check on the ligand data.
+
+NOT docking and NOT a binding claim — chemical similarity only. Similarity != activity (§12.9).
+"""
+
+from __future__ import annotations
+
+import pandas as pd
+import requests
+from rdkit import Chem, DataStructs, RDLogger
+from rdkit.Chem import rdFingerprintGenerator
+
+RDLogger.DisableLog("rdApp.*")
+MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
+
+# Known sickle binders = positive-control beacons (target in parens).
+BINDERS = ["voxelotor", "mitapivat", "decitabine", "vorinostat"]
+
+
+def pubchem_smiles(name: str) -> str | None:
+    for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
+        try:
+            u = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON"
+            d = requests.get(u, timeout=30).json()["PropertyTable"]["Properties"][0]
+            if prop in d:
+                return d[prop]
+        except Exception:
+            continue
+    return None
+
+
+def fp(smi: str):
+    if not isinstance(smi, str) or smi in ("-666", ""):
+        return None
+    m = Chem.MolFromSmiles(smi)
+    return MORGAN.GetFingerprint(m) if m else None
+
+
+def main() -> None:
+    binder_smi = {b: pubchem_smiles(b) for b in BINDERS}
+    print("known-binder SMILES:", {k: (v[:34] + "..." if v else "MISSING") for k, v in binder_smi.items()})
+
+    drugs = pd.read_csv("data/processed/drug_set_v1.csv")[["pert_iname", "canonical_smiles", "inclusion_reason"]]
+    reason = dict(zip(drugs.pert_iname, drugs.inclusion_reason))
+    drug_fp = {r.pert_iname: fp(r.canonical_smiles) for r in drugs.itertuples()}
+    drug_fp = {k: v for k, v in drug_fp.items() if v is not None}
+    print(f"fingerprinted {len(drug_fp)}/{len(drugs)} drugs\n")
+
+    for b, smi in binder_smi.items():
+        bfp = fp(smi)
+        if bfp is None:
+            print(f"{b}: no SMILES\n"); continue
+        sims = sorted(((DataStructs.TanimotoSimilarity(bfp, v), k) for k, v in drug_fp.items()), reverse=True)
+        print(f"nearest drugs to {b}:")
+        for s, k in sims[:5]:
+            print(f"    {s:.3f}  {k:22s} [{reason.get(k,'?')}]")
+        print()
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/dock_positive_controls.py
+++ b/scripts/dock_positive_controls.py
@@ -0,0 +1,111 @@
+"""Structure-based track §12.4: dock known binders + negatives into Hb and PKR.
+
+Positive-control recovery test: voxelotor should dock best into hemoglobin (5E83), mitapivat best
+into PKR (8XFD), and unrelated drugs (caffeine, hydroxyurea) should score worse. The box is
+centered on each co-crystal ligand (5L7=voxelotor, WV2=mitapivat). AutoDock Vina (mac binary,
+Rosetta) + open-babel prep. Affinity in kcal/mol; more negative = stronger predicted binding.
+
+This is a docking baseline, not an efficacy claim (PLAN §12.6).
+"""
+
+from __future__ import annotations
+
+import re
+import subprocess
+import tempfile
+from pathlib import Path
+
+import numpy as np
+import requests
+
+VINA = "./tools/vina"
+STRUCT = Path("data/raw/structures")
+WORK = Path("data/processed/docking")
+WORK.mkdir(parents=True, exist_ok=True)
+
+TARGETS = {"hemoglobin": ("5E83", "5L7"), "PKR": ("8XFD", "WV2")}
+LIGANDS = ["voxelotor", "mitapivat", "decitabine", "hydroxyurea", "caffeine"]
+EXPECTED = {"voxelotor": "hemoglobin", "mitapivat": "PKR"}
+
+
+def pubchem_smiles(name: str) -> str | None:
+    for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
+        try:
+            d = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON",
+                             timeout=30).json()["PropertyTable"]["Properties"][0]
+            if prop in d:
+                return d[prop]
+        except Exception:
+            continue
+    return None
+
+
+def prep_receptor_and_box(pdb: str, lig_res: str):
+    """Receptor pdbqt (protein only) + box center/size from one copy of the co-crystal ligand."""
+    atoms, lig, chosen = [], [], None
+    for ln in (STRUCT / f"{pdb}.pdb").read_text().splitlines():
+        if ln.startswith("ATOM"):
+            atoms.append(ln)
+        elif ln.startswith("HETATM") and ln[17:20].strip() == lig_res:
+            key = (ln[21], ln[22:26])
+            chosen = chosen or key
+            if key == chosen:
+                lig.append(ln)
+    rec_pdb = WORK / f"{pdb}_rec.pdb"
+    rec_pdb.write_text("\n".join(atoms) + "\nEND\n")
+    rec_pdbqt = WORK / f"{pdb}_rec.pdbqt"
+    subprocess.run(["obabel", str(rec_pdb), "-O", str(rec_pdbqt), "-xr", "-p", "7.4"],
+                   capture_output=True, check=True)
+    xyz = np.array([[float(l[30:38]), float(l[38:46]), float(l[46:54])] for l in lig])
+    center = xyz.mean(0)
+    size = np.clip((xyz.max(0) - xyz.min(0)) + 9.0, 18.0, 28.0)
+    return rec_pdbqt, center, size
+
+
+def prep_ligand(name: str, smiles: str) -> Path | None:
+    out = WORK / f"lig_{name}.pdbqt"
+    r = subprocess.run(["obabel", f"-:{smiles}", "-O", str(out), "--gen3d", "-p", "7.4"],
+                       capture_output=True, text=True)
+    return out if out.exists() and out.stat().st_size > 0 else None
+
+
+def dock(rec_pdbqt: Path, lig_pdbqt: Path, center, size) -> float | None:
+    out = subprocess.run([VINA, "--receptor", str(rec_pdbqt), "--ligand", str(lig_pdbqt),
+                          "--center_x", f"{center[0]:.2f}", "--center_y", f"{center[1]:.2f}",
+                          "--center_z", f"{center[2]:.2f}", "--size_x", f"{size[0]:.1f}",
+                          "--size_y", f"{size[1]:.1f}", "--size_z", f"{size[2]:.1f}",
+                          "--exhaustiveness", "8", "--cpu", "4",
+                          "--out", str(WORK / "o.pdbqt")], capture_output=True, text=True)
+    m = re.search(r"^\s+1\s+(-?\d+\.\d+)", out.stdout, re.M)
+    return float(m.group(1)) if m else None
+
+
+def main() -> None:
+    smis = {n: pubchem_smiles(n) for n in LIGANDS}
+    ligs = {n: prep_ligand(n, s) for n, s in smis.items() if s}
+    ligs = {n: p for n, p in ligs.items() if p}
+    print(f"prepared ligands: {list(ligs)}")
+
+    boxes = {t: prep_receptor_and_box(pdb, lr) for t, (pdb, lr) in TARGETS.items()}
+    print("prepared receptors:", list(boxes))
+
+    print(f"\n{'ligand':14s}" + "".join(f"{t:>13s}" for t in TARGETS))
+    results = {}
+    for lname, lpath in ligs.items():
+        row = {}
+        for tname, (rec, center, size) in boxes.items():
+            row[tname] = dock(rec, lpath, center, size)
+        results[lname] = row
+        cells = "".join(f"{(f'{row[t]:.1f}' if row[t] is not None else 'NA'):>13s}" for t in TARGETS)
+        print(f"  {lname:12s}{cells}")
+
+    print("\n=== positive-control recovery test (§12.4) ===")
+    for lig, exp_target in EXPECTED.items():
+        if lig in results:
+            best = min(results[lig], key=lambda t: results[lig][t] if results[lig][t] is not None else 99)
+            ok = best == exp_target
+            print(f"  {lig:12s} best target = {best:11s} (expected {exp_target}) -> {'PASS' if ok else 'FAIL'}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/dock_production.py
+++ b/scripts/dock_production.py
@@ -0,0 +1,83 @@
+"""Push docking to credibility: Meeko ligand prep + in-place spyrmsd, on the clean HDAC2 target.
+
+Isolates the two cheap suspects behind the 7.9 A redocking failure:
+  1. ligand prep — replace bare obabel --gen3d with Meeko (correct rotatable bonds / AD typing)
+  2. RMSD metric — replace obrms (superimposes) with spyrmsd in-place, symmetry-corrected
+Receptor reused (obabel-protonated 4LXZ with catalytic Zn kept). If redock RMSD drops <2 A, the
+docking pipeline is validated and the earlier failure was prep+metric, not docking itself.
+"""
+
+from __future__ import annotations
+
+import subprocess
+from pathlib import Path
+
+import numpy as np
+from rdkit import Chem, RDLogger
+from rdkit.Chem import AllChem
+from meeko import MoleculePreparation, PDBQTWriterLegacy
+from spyrmsd import io as spyio, rmsd as spyrmsd
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from scripts.dock_positive_controls import STRUCT, VINA, WORK, pubchem_smiles  # noqa: E402
+from scripts.dock_validate import dock_pose, extract_crystal_ligand  # noqa: E402
+
+RDLogger.DisableLog("rdApp.*")
+PDB, RES, DRUG = "4LXZ", "SHH", "vorinostat"
+
+
+def prep_receptor_keep_zn() -> tuple[Path, np.ndarray, np.ndarray]:
+    atoms, lig, chosen = [], [], None
+    for ln in (STRUCT / f"{PDB}.pdb").read_text().splitlines():
+        if ln.startswith("ATOM") or (ln.startswith("HETATM") and ln[17:20].strip() == "ZN"):
+            atoms.append(ln)
+        elif ln.startswith("HETATM") and ln[17:20].strip() == RES:
+            key = (ln[21], ln[22:26]); chosen = chosen or key
+            if key == chosen:
+                lig.append(ln)
+    recpdb = WORK / f"{PDB}_rec.pdb"; recpdb.write_text("\n".join(atoms) + "\nEND\n")
+    recpdbqt = WORK / f"{PDB}_rec.pdbqt"
+    subprocess.run(["obabel", str(recpdb), "-O", str(recpdbqt), "-xr", "-p", "7.4"], capture_output=True)
+    xyz = np.array([[float(l[30:38]), float(l[38:46]), float(l[46:54])] for l in lig])
+    return recpdbqt, xyz.mean(0), np.clip((xyz.max(0) - xyz.min(0)) + 8, 16, 26)
+
+
+def meeko_ligand(smiles: str) -> Path:
+    mol = Chem.AddHs(Chem.MolFromSmiles(smiles))
+    AllChem.EmbedMolecule(mol, randomSeed=0xC0FFEE)
+    AllChem.MMFFOptimizeMolecule(mol)
+    setups = MoleculePreparation().prepare(mol)
+    pdbqt, ok, err = PDBQTWriterLegacy.write_string(setups[0])
+    if not ok:
+        raise RuntimeError(f"meeko ligand prep failed: {err}")
+    out = WORK / f"lig_{DRUG}_meeko.pdbqt"; out.write_text(pdbqt)
+    return out
+
+
+def inplace_rmsd(crystal_pdb: Path, docked_pdbqt: Path) -> float:
+    cref = WORK / "xtal.sdf"; cdoc = WORK / "docked.sdf"
+    subprocess.run(["obabel", str(crystal_pdb), "-O", str(cref)], capture_output=True)
+    subprocess.run(["obabel", str(docked_pdbqt), "-O", str(cdoc), "-f", "1", "-l", "1"], capture_output=True)
+    ref, mol = spyio.loadmol(str(cref)), spyio.loadmol(str(cdoc))
+    ref.strip(); mol.strip()
+    r = spyrmsd.rmsdwrapper(ref, mol, symmetry=True, minimize=False)
+    return float(r[0] if isinstance(r, (list, tuple, np.ndarray)) else r)
+
+
+def main() -> None:
+    rec, center, size = prep_receptor_keep_zn()
+    smi = pubchem_smiles(DRUG)
+    lig = meeko_ligand(smi)
+    print(f"meeko ligand pdbqt: {lig.name}; box center {center.round(1)} size {size.round(1)}")
+    pose = dock_pose(rec, lig, center, size)  # writes WORK/redock_out.pdbqt
+    raw = WORK / "redock_out.pdbqt"
+    xtal = extract_crystal_ligand(PDB, RES)
+    rmsd = inplace_rmsd(xtal, raw)
+    verdict = "PASS (<2A)" if rmsd < 2 else ("MARGINAL (<3A)" if rmsd < 3 else "FAIL")
+    print(f"\nvorinostat -> HDAC2 redock | in-place spyrmsd RMSD = {rmsd:.2f} A  {verdict}")
+    print("(prior obabel-ligand + obrms gave 7.9 A)")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/dock_validate.py
+++ b/scripts/dock_validate.py
@@ -0,0 +1,93 @@
+"""Structure-binding §12.4: redocking-RMSD validation + ligand-efficiency normalization.
+
+Two de-biased checks of the docking baseline:
+  A. REDOCKING RMSD — redock each co-crystal ligand into its own structure and measure pose RMSD
+     vs the crystal pose (open-babel obrms, symmetry-aware). <2 A = geometry validated. This is
+     the gold-standard positive control and is immune to the molecular-size bias that made the
+     cross-target affinity test inconclusive.
+  B. LIGAND EFFICIENCY — affinity / heavy-atom count, to de-bias the size effect in the raw scores.
+"""
+
+from __future__ import annotations
+
+import re
+import subprocess
+from pathlib import Path
+
+from rdkit import Chem, RDLogger
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from scripts.dock_positive_controls import (  # noqa: E402
+    STRUCT, VINA, WORK, TARGETS, pubchem_smiles, prep_ligand, prep_receptor_and_box,
+)
+
+RDLogger.DisableLog("rdApp.*")
+XTAL_LIG = {"hemoglobin": ("5E83", "5L7", "voxelotor"), "PKR": ("8XFD", "WV2", "mitapivat")}
+
+
+def extract_crystal_ligand(pdb: str, resname: str) -> Path:
+    lines, chosen = [], None
+    for ln in (STRUCT / f"{pdb}.pdb").read_text().splitlines():
+        if ln.startswith("HETATM") and ln[17:20].strip() == resname:
+            key = (ln[21], ln[22:26])
+            chosen = chosen or key
+            if key == chosen:
+                lines.append(ln)
+    out = WORK / f"xtal_{resname}.pdb"
+    out.write_text("\n".join(lines) + "\nEND\n")
+    return out
+
+
+def dock_pose(rec, lig, center, size) -> Path | None:
+    pose = WORK / "redock_out.pdbqt"
+    subprocess.run([VINA, "--receptor", str(rec), "--ligand", str(lig),
+                    "--center_x", f"{center[0]:.2f}", "--center_y", f"{center[1]:.2f}",
+                    "--center_z", f"{center[2]:.2f}", "--size_x", f"{size[0]:.1f}",
+                    "--size_y", f"{size[1]:.1f}", "--size_z", f"{size[2]:.1f}",
+                    "--exhaustiveness", "16", "--out", str(pose)], capture_output=True, text=True)
+    if not pose.exists():
+        return None
+    top = WORK / "redock_top.pdb"  # first model only
+    subprocess.run(["obabel", str(pose), "-O", str(top), "-f", "1", "-l", "1"], capture_output=True)
+    return top
+
+
+def obrms(ref: Path, test: Path) -> float | None:
+    # strip H to compare heavy-atom poses; obrms is automorphism-aware
+    r2, t2 = WORK / "ref_noH.sdf", WORK / "test_noH.sdf"
+    subprocess.run(["obabel", str(ref), "-d", "-O", str(r2)], capture_output=True)
+    subprocess.run(["obabel", str(test), "-d", "-O", str(t2)], capture_output=True)
+    out = subprocess.run(["obrms", str(r2), str(t2)], capture_output=True, text=True).stdout
+    m = re.search(r"(-?\d+\.\d+)", out)
+    return float(m.group(1)) if m else None
+
+
+def main() -> None:
+    print("=== A. Redocking-RMSD validation ===")
+    for target, (pdb, resname, drug) in XTAL_LIG.items():
+        rec, center, size = prep_receptor_and_box(pdb, resname)
+        smi = pubchem_smiles(drug)
+        lig = prep_ligand(drug, smi)
+        xtal = extract_crystal_ligand(pdb, resname)
+        pose = dock_pose(rec, lig, center, size)
+        rmsd = obrms(xtal, pose) if pose else None
+        verdict = "PASS (<2A)" if (rmsd is not None and rmsd < 2.0) else \
+                  ("MARGINAL (<3A)" if (rmsd is not None and rmsd < 3.0) else "FAIL")
+        rstr = f"{rmsd:.2f} A" if rmsd is not None else "NA"
+        print(f"  {drug:12s} -> {target:11s} ({pdb}/{resname}): redock RMSD {rstr}  {verdict}")
+
+    print("\n=== B. Ligand efficiency (affinity / heavy atoms), de-biasing size ===")
+    # raw affinities from the cross-dock baseline (scripts/dock_positive_controls.py)
+    aff = {"voxelotor": {"hemoglobin": -8.1, "PKR": -9.3}, "mitapivat": {"hemoglobin": -10.0, "PKR": -11.2},
+           "decitabine": {"hemoglobin": -6.6, "PKR": -7.0}, "hydroxyurea": {"hemoglobin": -3.9, "PKR": -3.6},
+           "caffeine": {"hemoglobin": -6.1, "PKR": -6.4}}
+    print(f"  {'ligand':12s}{'heavy':>6s}" + "".join(f"{t+' LE':>14s}" for t in TARGETS))
+    for lig, row in aff.items():
+        ha = Chem.MolFromSmiles(pubchem_smiles(lig)).GetNumHeavyAtoms()
+        cells = "".join(f"{row[t]/ha:>14.3f}" for t in TARGETS)
+        print(f"  {lig:12s}{ha:>6d}{cells}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/exp_deconv_signature.py
+++ b/scripts/exp_deconv_signature.py
@@ -0,0 +1,137 @@
+"""Experiment: composition-adjusted sickle signature, to fix specificity (option 1).
+
+The v1 signature is confounded by cell composition (SS patients have very different WBC/RBC
+than AA controls). GSE35007 *measured* those counts per sample, so we adjust the differential
+expression for them directly (a measured-covariate alternative to estimated deconvolution):
+
+    expression ~ disease + WBC + RBC + MCV + age + sex     (per gene, vectorized OLS)
+
+We compare the composition-ADJUSTED signature against an UNADJUSTED single-study signature
+(same samples, model without the covariates), score both with the v1.1 engine (full gene space
+ spec_z), and report the recovery test for each. Writes nothing to committed artifacts.
+"""
+
+from __future__ import annotations
+
+import warnings
+from pathlib import Path
+
+import GEOparse
+import numpy as np
+import pandas as pd
+from scipy.stats import false_discovery_control
+from scipy.stats import t as tdist
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.disease import collapse_probes_to_symbols  # noqa: E402
+from src.scoring import tau_calibrate  # noqa: E402
+
+warnings.filterwarnings("ignore")
+PROCESSED = Path("data/processed")
+NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
+SYMBOL_COLS = ["Symbol", "ILMN_Gene", "Gene Symbol", "GeneSymbol"]
+
+
+def cval(gsm, key):
+    for c in gsm.metadata.get("characteristics_ch1", []):
+        if c.lower().startswith(key.lower()):
+            return c.split(":", 1)[1].strip()
+    return None
+
+
+def load_data():
+    gse = GEOparse.get_GEO(geo="GSE35007", destdir="data/raw/geo", silent=True)
+    meta = []
+    for gid, g in gse.gsms.items():
+        meta.append({"gsm": gid, "hb": cval(g, "hb phenotype"), "wbc": cval(g, "white blood cells"),
+                     "rbc": cval(g, "red blood cells"), "mcv": cval(g, "mean corpuscular volume"),
+                     "age": cval(g, "age"), "sex": cval(g, "Sex")})
+    meta = pd.DataFrame(meta).set_index("gsm")
+    for c in ["wbc", "rbc", "mcv", "age"]:
+        meta[c] = pd.to_numeric(meta[c], errors="coerce")
+    meta["disease"] = meta["hb"].map({"SS": 1.0, "AA": 0.0})
+    meta["sex_m"] = (meta["sex"] == "M").astype(float)
+    keep = meta[meta["hb"].isin(["SS", "AA"])].dropna(subset=["wbc", "rbc", "mcv", "age", "disease"])
+
+    expr = pd.DataFrame({gid: gse.gsms[gid].table.set_index("ID_REF")["VALUE"] for gid in keep.index})
+    expr = expr.apply(pd.to_numeric, errors="coerce").dropna(how="any")
+    if float(np.nanmax(expr.to_numpy())) > 50:
+        expr = np.log2(expr.clip(lower=0) + 1.0)
+
+    gpl = list(gse.gpls.values())[0]
+    col = next((c for c in SYMBOL_COLS if c in gpl.table.columns), None)
+    sym = gpl.table.set_index("ID")[col].astype(str).str.strip().replace({"": np.nan, "nan": np.nan})
+    return expr, keep, sym.dropna()
+
+
+def ols_de(expr, design, disease_idx):
+    """Vectorized per-gene OLS; return DE table (log_fc=disease coef, pvalue, qvalue)."""
+    X = design.to_numpy(dtype=float)
+    Y = expr.T.to_numpy(dtype=float)  # samples x genes
+    n, p = X.shape
+    XtX_inv = np.linalg.pinv(X.T @ X)
+    B = XtX_inv @ X.T @ Y
+    resid = Y - X @ B
+    dof = n - p
+    sigma2 = (resid ** 2).sum(0) / dof
+    se = np.sqrt(sigma2 * XtX_inv[disease_idx, disease_idx])
+    t = B[disease_idx] / se
+    pval = 2 * tdist.sf(np.abs(t), dof)
+    out = pd.DataFrame({"log_fc": B[disease_idx], "pvalue": pval}, index=expr.index).dropna()
+    out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
+    return out
+
+
+def make_signature(de, sym, expr, top_n=250):
+    de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
+    sig = de_sym[de_sym["qvalue"] < 0.05]
+    up = sig[sig["log_fc"] > 0].nsmallest(top_n, "qvalue").index.tolist()
+    down = sig[sig["log_fc"] < 0].nsmallest(top_n, "qvalue").index.tolist()
+    return up, down
+
+
+def evaluate(label, up, down, lincs):
+    ranked = tau_calibrate(up, down, lincs, n_null=1000)
+    n = len(ranked)
+    top10, top25, half = int(n * .10), int(n * .25), n // 2
+    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
+    ranked = ranked.join(profiles[["inclusion_reason"]])
+    hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
+    negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
+    n_bottom = sum(r > half for r in negs.values())
+    n_ov = len((set(up) | set(down)) & set(lincs.columns))
+    print(f"\n### {label}: {len(up)} up / {len(down)} down (query overlap {n_ov})")
+    print(f"  hydroxyurea rank {hu}/{n} (top {100*hu/n:.1f}%)  top-10%? {hu <= top10}")
+    print(f"  L-glutamine rank {glut}/{n} (top {100*glut/n:.1f}%)  top-25%? {glut <= top25}")
+    print(f"  neg controls bottom-half: {n_bottom}/5  {negs}")
+    print("  top 8: " + ", ".join(
+        f"{name}[{r['inclusion_reason'][:3]}]" for name, r in ranked.nsmallest(8, "spec_z").iterrows()))
+    return ranked
+
+
+def main():
+    expr, meta, sym = load_data()
+    print(f"loaded {expr.shape[1]} samples x {expr.shape[0]} probes; "
+          f"{int(meta.disease.sum())} SS / {int((meta.disease==0).sum())} AA")
+    lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
+
+    base = pd.DataFrame({"intercept": 1.0, "disease": meta["disease"]}, index=meta.index)
+    adj = base.assign(wbc=meta["wbc"], rbc=meta["rbc"], mcv=meta["mcv"], age=meta["age"], sex_m=meta["sex_m"])
+
+    de_unadj = ols_de(expr, base, disease_idx=1)
+    de_adj = ols_de(expr, adj, disease_idx=1)
+
+    up_u, dn_u = make_signature(de_unadj, sym, expr)
+    up_a, dn_a = make_signature(de_adj, sym, expr)
+
+    # how much does adjustment change the gene set?
+    overlap = len((set(up_u) | set(dn_u)) & (set(up_a) | set(dn_a)))
+    print(f"\nsignature gene overlap unadjusted vs adjusted: {overlap}/{len(set(up_u)|set(dn_u))}")
+
+    evaluate("UNADJUSTED (GSE35007 only)", up_u, dn_u, lincs)
+    evaluate("COMPOSITION-ADJUSTED", up_a, dn_a, lincs)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/exp_genespace.py
+++ b/scripts/exp_genespace.py
@@ -0,0 +1,102 @@
+"""Experiment (v1.1): re-score on a larger LINCS gene space and re-run the recovery test.
+
+v1 used only the 978 landmark genes (12% signature overlap). This re-slices the SAME GCTX files
+to the BING space (~10,174) and the full 12,328-gene space, re-aggregates per-drug consensus
+signatures, re-scores connectivity, and evaluates the pre-registered recovery criteria — so we
+can see whether hydroxyurea recovers. Writes nothing to the committed v1 artifacts.
+"""
+
+from __future__ import annotations
+
+import gzip
+import io
+import json
+from pathlib import Path
+
+import pandas as pd
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.scoring import rank_drugs  # noqa: E402
+
+LINCS = Path("data/raw/lincs")
+PROCESSED = Path("data/processed")
+GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
+SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
+NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
+
+
+def read_gz(name):
+    return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
+
+
+def gene_ids_for_space(space: str):
+    g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
+    if space == "bing":
+        g = g[g.pr_is_bing == 1]
+    # 'all' -> keep everything
+    ids = [str(x) for x in g.pr_gene_id]
+    id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
+    return ids, id_to_symbol
+
+
+def extract(space, drug_names, gene_ids, id_to_symbol):
+    from cmapPy.pandasGEXpress.parse import parse
+    frames = []
+    for ph in (1, 2):
+        sig = read_gz(SIG_INFO[ph])
+        sig = sig[(sig.pert_type == "trt_cp") & (sig.pert_iname.isin(drug_names))]
+        if sig.empty:
+            continue
+        gct = parse(str(GCTX[ph]), rid=gene_ids, cid=sig.sig_id.tolist())
+        data = gct.data_df
+        s2d = dict(zip(sig.sig_id, sig.pert_iname))
+        frames.append(data.T.groupby(data.columns.map(s2d)).mean())
+        print(f"    [{space}] phase {ph}: {sig.pert_iname.nunique()} drugs sliced", flush=True)
+    combined = pd.concat(frames).groupby(level=0).mean()
+    combined.columns = [id_to_symbol.get(c, c) for c in combined.columns]
+    combined = combined.loc[:, ~combined.columns.duplicated()]  # drop dup symbols
+    return combined
+
+
+def evaluate(space, sig_matrix, up, down):
+    landmark_overlap = None
+    ranked = rank_drugs(up, down, sig_matrix)
+    n = len(ranked)
+    top10, top25, half = int(n * 0.10), int(n * 0.25), n // 2
+    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
+    ranked = ranked.join(profiles[["inclusion_reason"]])
+
+    hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
+    glut_s = ranked.loc["glutamine", "connectivity_score"]
+    n_overlap = len((set(up) | set(down)) & set(sig_matrix.columns))
+    negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
+    n_bottom = sum(r > half for r in negs.values())
+
+    print(f"\n=== gene space: {space.upper()} ({sig_matrix.shape[1]} genes; query overlap {n_overlap}) ===")
+    print(f"  hydroxyurea: rank {hu}/{n} (top {100*hu/n:.1f}%)  top-10%? {hu <= top10}")
+    print(f"  L-glutamine: rank {glut}/{n} (top {100*glut/n:.1f}%), WTCS={glut_s:.3f}  top-25%? {glut <= top25}")
+    print(f"  neg controls in bottom half: {n_bottom}/5  {negs}")
+    crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
+    print(f"  OVERALL: {'PASS' if crit else 'FAIL'}")
+    print("  top 8:")
+    for name, r in ranked.nsmallest(8, "connectivity_score").iterrows():
+        print(f"    {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} [{r['inclusion_reason']}]")
+    return ranked
+
+
+def main():
+    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
+    up = [g["gene"] for g in sig["up_regulated"]]
+    down = [g["gene"] for g in sig["down_regulated"]]
+    drug_names = set(pd.read_csv(PROCESSED / "drug_set_v1.csv").pert_iname)
+
+    for space in ("bing", "all"):
+        print(f"\n>>> extracting {space} ...", flush=True)
+        ids, id2sym = gene_ids_for_space(space)
+        mat = extract(space, drug_names, ids, id2sym)
+        evaluate(space, mat, up, down)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/phaseA_reference_tau.py
+++ b/scripts/phaseA_reference_tau.py
@@ -0,0 +1,118 @@
+"""Phase A: calibrate sickle connectivity against a REAL disease-signature reference population.
+
+The v1.1 tau saturated because it used random gene-set nulls. Proper specificity calibration
+asks: does a drug reverse SICKLE more than it reverses diseases in general? We download a library
+of real disease signatures (Enrichr "Disease Signatures from GEO", up+down), compute each drug's
+connectivity to every reference disease, and express its sickle connectivity as a z-score within
+that per-drug reference distribution. Broad-effect drugs (reverse everything) -> z~0 -> down-ranked.
+
+Re-runs the recovery test and compares to v1.1 (random-null). Writes nothing committed.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+import requests
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.scoring import _ks_connectivity  # noqa: E402
+
+PROCESSED = Path("data/processed")
+RAW = Path("data/raw/disease_sigs")
+ENRICHR = "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName={}"
+LIBS = {"up": "Disease_Signatures_from_GEO_up_2014", "down": "Disease_Signatures_from_GEO_down_2014"}
+NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
+
+
+def fetch_gmt(name: str) -> dict[str, list[str]]:
+    RAW.mkdir(parents=True, exist_ok=True)
+    path = RAW / f"{name}.gmt"
+    if not path.exists():
+        r = requests.get(ENRICHR.format(name), timeout=120)
+        r.raise_for_status()
+        path.write_text(r.text)
+    out = {}
+    for line in path.read_text().splitlines():
+        parts = line.split("\t")
+        if len(parts) < 3:
+            continue
+        term = parts[0]
+        genes = [g.split(",")[0].strip().upper() for g in parts[2:] if g.strip()]
+        out[term] = genes
+    print(f"  {name}: {path.stat().st_size/1e6:.1f} MB, {len(out)} terms")
+    return out
+
+
+def build_reference() -> list[dict]:
+    up = fetch_gmt(LIBS["up"])
+    down = fetch_gmt(LIBS["down"])
+    shared = set(up) & set(down)
+    refs = []
+    for term in shared:
+        if "sickle" in term.lower():
+            continue  # exclude the target disease from its own reference population
+        refs.append({"name": term, "up": up[term], "down": down[term]})
+    print(f"reference disease signatures (paired up+down, sickle excluded): {len(refs)}")
+    return refs
+
+
+def cols_for(genes, gene_to_col):
+    return np.array([gene_to_col[g] for g in set(genes) if g in gene_to_col], dtype=int)
+
+
+def main():
+    import json
+    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
+    sk_up = [g["gene"] for g in sig["up_regulated"]]
+    sk_down = [g["gene"] for g in sig["down_regulated"]]
+
+    lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
+    genes = list(lincs.columns)
+    gene_to_col = {g: i for i, g in enumerate(genes)}
+    n = len(genes)
+    R = lincs.rank(axis=1, ascending=False).to_numpy()
+
+    refs = build_reference()
+    # connectivity of every drug to every reference disease -> (n_drugs, n_refs)
+    C = np.empty((R.shape[0], len(refs)))
+    for j, d in enumerate(refs):
+        C[:, j] = _ks_connectivity(R, cols_for(d["up"], gene_to_col), cols_for(d["down"], gene_to_col), n)
+    # drop reference diseases with too few mapped genes (degenerate columns)
+    mapped = np.array([len(cols_for(d["up"], gene_to_col)) + len(cols_for(d["down"], gene_to_col)) for d in refs])
+    keep = mapped >= 10
+    C = C[:, keep]
+    print(f"usable reference diseases (>=10 mapped genes): {keep.sum()}")
+
+    real = _ks_connectivity(R, cols_for(sk_up, gene_to_col), cols_for(sk_down, gene_to_col), n)
+    ref_mean, ref_std = C.mean(axis=1), C.std(axis=1)
+    ref_std[ref_std == 0] = np.nan
+    spec_z = (real - ref_mean) / ref_std  # negative = reverses sickle more than diseases-in-general
+
+    ranked = pd.DataFrame({"spec_z": spec_z, "connectivity": real}, index=lincs.index).sort_values("spec_z")
+    ranked.insert(0, "rank", range(1, len(ranked) + 1))
+    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
+    ranked = ranked.join(profiles[["inclusion_reason"]])
+
+    N = len(ranked)
+    top10, top25, half = int(N * .10), int(N * .25), N // 2
+    hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
+    negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
+    n_bottom = sum(r > half for r in negs.values())
+
+    print("\n==== RECOVERY TEST (reference-calibrated tau) ====")
+    print(f"  hydroxyurea rank {hu}/{N} (top {100*hu/N:.1f}%)  top-10%? {hu <= top10}  z={ranked.loc['hydroxyurea','spec_z']:.2f}")
+    print(f"  L-glutamine rank {glut}/{N} (top {100*glut/N:.1f}%)  top-25%? {glut <= top25}")
+    print(f"  neg controls bottom-half: {n_bottom}/5  {negs}")
+    crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
+    print(f"  OVERALL: {'PASS' if crit else 'FAIL'}")
+    print("\n  top 12 by reference-calibrated z:")
+    for name, r in ranked.nsmallest(12, "spec_z").iterrows():
+        print(f"    {int(r['rank']):2d} {name:20s} z={r['spec_z']:6.2f} [{r['inclusion_reason']}]")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/phaseD_supervised.py
+++ b/scripts/phaseD_supervised.py
@@ -0,0 +1,137 @@
+"""Phase D: supervised cross-disease repurposing — can labels break the specificity ceiling?
+
+Connectivity alone can't tell therapeutic from coincidental reversal. A supervised model trained
+on known drug-disease pairs CAN learn that pattern — given features that expose drug "broadness"
+(a drug that reverses everything is non-specific). We train on 839 GEO disease signatures with
+Repurposing-Hub indications as labels, evaluate with disease-grouped CV, then apply to HELD-OUT
+sickle and check whether the coincidental reversers (norethindrone, ciprofloxacin) finally drop.
+
+Baseline = rank by raw connectivity (the single conn feature). Win = model down-ranks the
+negative controls vs baseline while keeping hydroxyurea high.
+"""
+
+from __future__ import annotations
+
+import json
+import re
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+from sklearn.ensemble import GradientBoostingClassifier
+from sklearn.model_selection import GroupKFold, cross_val_predict
+from sklearn.metrics import roc_auc_score
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.scoring import _ks_connectivity  # noqa: E402
+
+PROCESSED = Path("data/processed")
+SIGS = Path("data/raw/disease_sigs")
+NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
+ID_RE = re.compile(r"\s*(c\d{6,}|doid[-\s]?\d+|gse\d+|umls\S+)\s*", re.I)
+
+
+def clean_disease(term: str) -> str:
+    return ID_RE.sub(" ", term).strip().lower()
+
+
+def load_disease_sigs():
+    def parse(name):
+        out = {}
+        for line in (SIGS / name).read_text().splitlines():
+            p = line.split("\t")
+            if len(p) < 3:
+                continue
+            out[p[0]] = [g.split(",")[0].strip().upper() for g in p[2:] if g.strip()]
+        return out
+    up = parse("Disease_Perturbations_from_GEO_up.gmt")
+    down = parse("Disease_Perturbations_from_GEO_down.gmt")
+    terms = sorted(set(up) & set(down))
+    return [{"term": t, "disease": clean_disease(t), "up": up[t], "down": down[t]} for t in terms]
+
+
+def cols_for(genes, g2c):
+    return np.array([g2c[g] for g in set(genes) if g in g2c], dtype=int)
+
+
+def featurize(conn_col, C):
+    """Per-(drug,disease) features for one disease column conn_col, given full matrix C."""
+    drug_mean, drug_std = C.mean(1), C.std(1)
+    drug_min = C.min(1)
+    broad = (C < -0.10).mean(1)  # fraction of diseases this drug strongly reverses
+    dmean, dstd = conn_col.mean(), conn_col.std() or 1.0
+    return np.column_stack([
+        conn_col, drug_mean, drug_std, drug_min, broad,
+        np.full_like(conn_col, dmean),
+        (conn_col - drug_mean) / np.where(drug_std == 0, 1, drug_std),  # specificity within drug
+        (conn_col - dmean) / dstd,                                       # specificity within disease
+    ])
+
+
+FEATS = ["conn", "drug_mean", "drug_std", "drug_min", "broadness",
+         "disease_mean", "z_within_drug", "z_within_disease"]
+
+
+def main():
+    lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
+    drugs = list(lincs.index)
+    g2c = {g: i for i, g in enumerate(lincs.columns)}
+    n = len(lincs.columns)
+    R = lincs.rank(axis=1, ascending=False).to_numpy()
+
+    refs = [d for d in load_disease_sigs()
+            if len(cols_for(d["up"], g2c)) + len(cols_for(d["down"], g2c)) >= 10]
+    C = np.column_stack([_ks_connectivity(R, cols_for(d["up"], g2c), cols_for(d["down"], g2c), n) for d in refs])
+    print(f"{len(drugs)} drugs x {len(refs)} disease signatures; connectivity matrix {C.shape}")
+
+    # labels from Repurposing Hub indications
+    hub = pd.read_csv("data/raw/repurposing_drugs.txt", sep="\t", comment="!", low_memory=False)
+    ind = {r.pert_iname: [i.strip().lower() for i in re.split(r"[|,]", r.indication) if len(i.strip()) > 3]
+           for r in hub.itertuples() if isinstance(r.indication, str)}
+    drug_idx = {d: i for i, d in enumerate(drugs)}
+
+    X_rows, y, grp = [], [], []
+    for j, d in enumerate(refs):
+        feats = featurize(C[:, j], C)
+        dz = d["disease"]
+        for i, drug in enumerate(drugs):
+            inds = ind.get(drug, [])
+            label = int(any(dz == k or (len(dz) > 4 and dz in k) or (len(k) > 4 and k in dz) for k in inds))
+            X_rows.append(feats[i]); y.append(label); grp.append(j)
+    X = np.array(X_rows); y = np.array(y); grp = np.array(grp)
+    print(f"pairs: {len(y)}, positives: {y.sum()} ({100*y.mean():.2f}%)")
+
+    # disease-grouped CV (generalize to unseen diseases)
+    clf = GradientBoostingClassifier(random_state=0)
+    proba = cross_val_predict(clf, X, y, cv=GroupKFold(5), groups=grp, method="predict_proba")[:, 1]
+    print(f"disease-grouped CV AUC: {roc_auc_score(y, proba):.3f}  (conn-only AUC: {roc_auc_score(y, X[:,0]*-1):.3f})")
+
+    clf.fit(X, y)
+    print("feature importances: " + ", ".join(f"{f}={imp:.2f}" for f, imp in
+          sorted(zip(FEATS, clf.feature_importances_), key=lambda t: -t[1])))
+
+    # apply to HELD-OUT sickle
+    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
+    sk = _ks_connectivity(R, cols_for([g["gene"] for g in sig["up_regulated"]], g2c),
+                          cols_for([g["gene"] for g in sig["down_regulated"]], g2c), n)
+    Xs = featurize(sk, C)
+    p_sickle = clf.predict_proba(Xs)[:, 1]
+
+    res = pd.DataFrame({"model_p": p_sickle, "conn": sk}, index=drugs)
+    res["model_rank"] = res["model_p"].rank(ascending=False).astype(int)
+    res["conn_rank"] = res["conn"].rank(ascending=True).astype(int)  # most negative = best
+    N = len(res)
+    print(f"\n{'drug':14s} {'model_rank':>10s} {'conn_rank(base)':>16s}")
+    for d in ["hydroxyurea", "glutamine"] + NEG5 + ["norethindrone", "ciprofloxacin"]:
+        if d in res.index:
+            r = res.loc[d]
+            print(f"  {d:14s} {int(r['model_rank']):6d}/{N}      {int(r['conn_rank']):6d}/{N}")
+    nb_model = sum(res.loc[d, "model_rank"] > N/2 for d in NEG5 if d in res.index)
+    nb_conn = sum(res.loc[d, "conn_rank"] > N/2 for d in NEG5 if d in res.index)
+    print(f"\nneg controls bottom-half: model {nb_model}/5  vs  baseline {nb_conn}/5")
+    print("model top 10:", ", ".join(res.sort_values('model_p', ascending=False).head(10).index))
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/pose_rmsd.py
+++ b/scripts/pose_rmsd.py
@@ -0,0 +1,113 @@
+"""Pose-RMSD validation of the Boltz-2 HDAC2/vorinostat prediction (closes the §12.4 loop).
+
+Co-folding predicts the whole complex de novo, so it lands in its own frame. To compare poses:
+  1. superpose the predicted protein (chain A) onto the crystal 4LXZ binding chain (Ca, gemmi),
+  2. apply that transform to the predicted vorinostat (chain L) and the predicted Zn (chain M),
+  3. symmetry-corrected in-place RMSD of the transformed ligand vs the crystal SHH (spyrmsd),
+  4. bonus: did the catalytic Zn land near the real one?
+
+<2 A ligand RMSD = co-folding reproduces the geometry, not just the affinity ranking -- the test
+classical Vina failed on this exact Zn-chelation case (7.9 A).
+"""
+
+from __future__ import annotations
+
+import subprocess
+from pathlib import Path
+
+import gemmi
+import numpy as np
+from spyrmsd import io as spyio, rmsd as spyrmsd
+
+PRED = Path("data/processed/binding/HDAC2_vorinostat_pred.pdb")
+XTAL = Path("data/raw/structures/4LXZ.pdb")
+LIG_RES = "SHH"  # crystal vorinostat
+WORK = Path("data/processed/binding")
+
+
+def crystal_binding_chain(st) -> gemmi.Chain:
+    model = st[0]
+    lig = [a.pos for ch in model for r in ch if r.name == LIG_RES for a in r]
+    c = gemmi.Position(sum(p.x for p in lig) / len(lig), sum(p.y for p in lig) / len(lig),
+                       sum(p.z for p in lig) / len(lig))
+    best, bestd = None, 1e9
+    for ch in model:
+        if len(ch.get_polymer()) < 20:
+            continue
+        d = min((a.pos.dist(c) for r in ch for a in r), default=1e9)
+        if d < bestd:
+            best, bestd = ch, d
+    return best
+
+
+def hetatm_lines(pdb: Path, chain: str | None = None, resname: str | None = None) -> list[str]:
+    out = []
+    for ln in pdb.read_text().splitlines():
+        if not ln.startswith(("ATOM", "HETATM")):
+            continue
+        if chain and ln[21] != chain:
+            continue
+        if resname and ln[17:20].strip() != resname:
+            continue
+        out.append(ln)
+    return out
+
+
+def transform_and_write(lines: list[str], T: gemmi.Transform, dest: Path) -> Path:
+    new = []
+    for ln in lines:
+        p = T.apply(gemmi.Position(float(ln[30:38]), float(ln[38:46]), float(ln[46:54])))
+        new.append(f"{ln[:30]}{p.x:8.3f}{p.y:8.3f}{p.z:8.3f}{ln[54:]}")
+    dest.write_text("\n".join(new) + "\nEND\n")
+    return dest
+
+
+def to_sdf(pdb: Path) -> Path:
+    sdf = pdb.with_suffix(".sdf")
+    subprocess.run(["obabel", str(pdb), "-O", str(sdf)], capture_output=True)
+    return sdf
+
+
+def main() -> None:
+    pred_st = gemmi.read_structure(str(PRED))
+    xtal_st = gemmi.read_structure(str(XTAL))
+
+    pred_chain = pred_st[0]["A"]
+    xtal_chain = crystal_binding_chain(xtal_st)
+
+    sup = gemmi.calculate_superposition(xtal_chain.get_polymer(), pred_chain.get_polymer(),
+                                        gemmi.PolymerType.PeptideL, gemmi.SupSelect.CaP)
+    T = sup.transform
+    print(f"protein superposition: Ca RMSD {sup.rmsd:.2f} A over {sup.count} residues")
+
+    # predicted ligand (chain L) and Zn (chain M) -> transform into crystal frame
+    pred_lig = transform_and_write(hetatm_lines(PRED, chain="L"), T, WORK / "pred_lig_aligned.pdb")
+    pred_zn_lines = hetatm_lines(PRED, chain="M")
+    # one copy only (4LXZ has 3 SHH copies); keep the first (chain, resseq) group.
+    shh = hetatm_lines(XTAL, resname=LIG_RES)
+    first_key = (shh[0][21], shh[0][22:26])
+    one_copy = [ln for ln in shh if (ln[21], ln[22:26]) == first_key]
+    crys_lig = WORK / "xtal_lig.pdb"
+    crys_lig.write_text("\n".join(one_copy) + "\nEND\n")
+
+    # ligand pose RMSD (in-place, symmetry-corrected)
+    ref, mol = spyio.loadmol(str(to_sdf(crys_lig))), spyio.loadmol(str(to_sdf(pred_lig)))
+    ref.strip(); mol.strip()
+    rmsd = float(spyrmsd.rmsdwrapper(ref, mol, symmetry=True, minimize=False)[0])
+
+    # zinc placement: transformed predicted Zn vs nearest crystal ZN
+    verdict = "PASS (<2A)" if rmsd < 2 else ("MARGINAL (<3A)" if rmsd < 3 else "FAIL")
+    print(f"\nvorinostat pose RMSD (co-folded vs crystal) = {rmsd:.2f} A  {verdict}")
+    print("(classical Vina on this Zn-chelation case: 7.9 A)")
+
+    pred_zn = next((a.pos for ch in pred_st[0] if ch.name == "M" for r in ch for a in r), None)
+    xtal_zn = [a.pos for ch in xtal_st[0] for r in ch if r.name == "ZN" for a in r]
+    if pred_zn and xtal_zn:
+        q = T.apply(pred_zn)
+        dz = min(((q.x - z.x) ** 2 + (q.y - z.y) ** 2 + (q.z - z.z) ** 2) ** 0.5 for z in xtal_zn)
+        print(f"catalytic Zn placement error = {dz:.2f} A "
+              f"{'(zinc correctly placed)' if dz < 2 else ''}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/week2_lincs_extract.py
+++ b/scripts/week2_lincs_extract.py
@@ -36,9 +36,15 @@ def read_gz_tsv(name: str) -> pd.DataFrame:


 def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
-    lm = pd.read_csv(LINCS / "landmark_genes.csv")
-    ids = [str(x) for x in lm["pr_gene_id"]]
-    id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in lm.itertuples()}
+    """Gene row-ids + id->symbol map for the scored gene space.
+
+    v1.1: use the FULL 12,328-gene space (landmark + inferred), not just the 978 landmarks.
+    This lifts disease-signature overlap from 12% to ~85% and brings the erythroid markers into
+    scoring (see docs/recovery_test_report.md). Inferred genes are model-predicted (noisier).
+    """
+    g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
+    ids = [str(x) for x in g["pr_gene_id"]]
+    id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
    return ids, id_to_symbol


--- a/scripts/week3_scoring.py
+++ b/scripts/week3_scoring.py
@@ -1,13 +1,11 @@
-"""Week 3: run connectivity scoring over all drugs -> ranked_candidates_v1.csv (PLAN §6).
+"""Week 3 (v1.1): connectivity scoring over the full gene space with tau calibration.

-Loads the disease signature + the 300 drug LINCS signatures, computes the weighted-KS
-connectivity score per drug, and produces two rankings:
-  1. raw connectivity (most negative = strongest reversal = rank 1)
-  2. a secondary ranking blending connectivity with a mechanistic prior (sickle-relevant
-     target pathways), to temper broad-effect drugs (HDAC/kinase) that dominate raw rankings.
+Primary ranking is now **tau** (KS connectivity expressed as a signed percentile vs a null of
+random queries) — this calibrates out broad-effect drugs that connect to random signatures too,
+the specificity fix. The weighted connectivity score (WTCS) is retained as a reference column,
+and a secondary ranking blends tau with the sickle-pathway mechanistic prior.

-The formal recovery test (ground-truth + negative-control evaluation against the pre-registered
-criteria) is Week 4; this script only prints a sanity peek.
+Output: data/results/ranked_candidates_v1.csv.
 """

 from __future__ import annotations
@@ -19,10 +17,13 @@ import pandas as pd

 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
-from src.scoring import mechanistic_prior, persist_ranking, rank_drugs  # noqa: E402
+from src.scoring import (  # noqa: E402
+    connectivity_score, mechanistic_prior, normalize_scores, persist_ranking, tau_calibrate,
+)

 PROCESSED = Path("data/processed")
-PRIOR_LAMBDA = 0.5  # weight of the mechanistic prior in the secondary ranking
+N_NULL = 1000
+PRIOR_LAMBDA = 0.5  # spec_z credit per matched sickle pathway, for the blended ranking


 def main() -> None:
@@ -30,52 +31,51 @@ def main() -> None:
    up = [g["gene"] for g in sig["up_regulated"]]
    down = [g["gene"] for g in sig["down_regulated"]]

-    sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")  # drug x 978 symbols
+    sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")  # drug x 12,328 genes
    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")

-    landmark = set(sig_matrix.columns)
-    n_up_ov = len(set(up) & landmark)
-    n_down_ov = len(set(down) & landmark)
-    print(f"query overlap with 978 landmarks: {n_up_ov} up + {n_down_ov} down = {n_up_ov + n_down_ov}")
-    print(f"scoring {len(sig_matrix)} drugs (all scored; 0 without signature)")
+    n_up = len(set(up) & set(sig_matrix.columns))
+    n_down = len(set(down) & set(sig_matrix.columns))
+    print(f"gene space: {sig_matrix.shape[1]} genes; query overlap {n_up} up + {n_down} down = {n_up + n_down}")

-    ranked = rank_drugs(up, down, sig_matrix)
+    # primary: tau calibration
+    print(f"computing tau over {N_NULL} random-query permutations ...", flush=True)
+    ranked = tau_calibrate(up, down, sig_matrix, n_null=N_NULL)

-    # attach metadata + mechanistic prior
+    # reference: weighted connectivity score (WTCS) + NCS
+    wtcs = pd.Series({d: connectivity_score(up, down, sig_matrix.loc[d]) for d in sig_matrix.index},
+                     name="connectivity_score")
+    ranked["connectivity_score"] = wtcs
+    ranked["normalized_score"] = normalize_scores(wtcs)
+
+    # metadata + mechanistic prior
    ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
    ranked["mechanistic_prior"] = ranked["targets"].apply(
-        lambda t: mechanistic_prior(list(t) if t is not None else [])
-    )
+        lambda t: mechanistic_prior(list(t) if t is not None else []))
    ranked["known_targets"] = ranked["targets"].apply(
-        lambda t: "; ".join(t) if t is not None and len(t) else ""
-    )
+        lambda t: "; ".join(t) if t is not None and len(t) else "")
    ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})

-    # secondary, prior-weighted ranking: relevant drugs pushed toward better (more negative)
-    ranked["blended_score"] = ranked["normalized_score"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
+    # secondary, prior-weighted ranking (relevant drugs pushed toward more-negative spec_z)
+    ranked["blended_score"] = ranked["spec_z"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
    ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)

    out = ranked.rename_axis("drug_name").reset_index()[[
-        "rank", "drug_name", "chembl_id", "connectivity_score", "normalized_score",
+        "rank", "drug_name", "chembl_id", "spec_z", "tau", "connectivity_ks", "connectivity_score",
        "inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
    ]]
    path = persist_ranking(out)
    print(f"wrote {path} ({len(out)} drugs)")

-    # --- sanity peek (formal recovery test is Week 4) ---
-    print("\n--- sanity peek (raw connectivity rank) ---")
+    print("\n--- sanity peek (spec_z ranking) ---")
    for gt in ["hydroxyurea", "glutamine"]:
        r = ranked.loc[gt]
-        pct = 100 * r["rank"] / len(ranked)
-        print(f"  {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {pct:.0f}%), "
-              f"score={r['connectivity_score']:.3f}")
-    neg = ranked[ranked["inclusion_reason"] == "negative_control"]
-    print(f"  negative controls in bottom half: "
-          f"{(neg['rank'] > len(ranked) / 2).sum()}/{len(neg)}")
-    print("\n  top 5 raw candidates:")
-    for name, r in ranked.nsmallest(5, "connectivity_score").iterrows():
-        print(f"    {int(r['rank']):3d}  {name:18s} {r['connectivity_score']:+.3f}  "
-              f"[{r['inclusion_reason']}] {r['known_targets'][:50]}")
+        print(f"  {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {100*r['rank']/len(ranked):.0f}%), "
+              f"spec_z={r['spec_z']:.2f}")
+    print("  top 10 by spec_z:")
+    for name, r in ranked.nsmallest(10, "spec_z").iterrows():
+        print(f"    {int(r['rank']):2d} {name:18s} z={r['spec_z']:6.2f} [{r['inclusion_reason']:16s}] "
+              f"{str(r['known_targets'])[:38]}")


 if __name__ == "__main__":
--- a/scripts/week4_recovery_test.py
+++ b/scripts/week4_recovery_test.py
@@ -36,6 +36,7 @@ def main() -> None:
        return int(df.loc[name, "rank"]) if name in df.index else None

    hu, glut = rk("hydroxyurea"), rk("glutamine")
+    glut_z = df.loc["glutamine", "spec_z"]

    # pick negative controls present in the ranking
    negs = {}
@@ -47,9 +48,8 @@ def main() -> None:
    print("=" * 60)
    print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}")
    print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%)  -> top-10%? {hu <= top10_cut}")
-    glut_score = df.loc["glutamine", "connectivity_score"]
-    print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), WTCS={glut_score:.3f}  "
-          f"-> top-25%? {glut <= top25_cut}  (has signature, so NOT 'missing-signature unscorable')")
+    print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), spec_z={glut_z:+.2f}  "
+          f"-> top-25%? {glut <= top25_cut}  (positive z => does not reverse; has a signature)")
    print("\nnegative controls (pre-specified, 1 per category):")
    n_bottom = 0
    for d, (cat, r) in negs.items():
@@ -70,10 +70,10 @@ def main() -> None:
    print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} "
          f"(top {100*hu_b/n:.1f}%)")

-    print("\n--- TOP 10 (raw connectivity) ---")
-    top10 = df.nsmallest(10, "connectivity_score")
+    print("\n--- TOP 10 (primary spec_z ranking) ---")
+    top10 = df.sort_values("rank").head(10)
    for name, r in top10.iterrows():
-        print(f"  {int(r['rank']):2d}  {name:18s} {r['connectivity_score']:+.3f}  "
+        print(f"  {int(r['rank']):2d}  {name:18s} z={r['spec_z']:+.2f}  "
              f"[{r['inclusion_reason']}]  {str(r['known_targets'])[:45]}")


--- a/src/binding.py
+++ b/src/binding.py
@@ -0,0 +1,89 @@
+"""Structure-based binding track (PLAN §12).
+
+Two capabilities:
+- ligand-based retrieval (RDKit, works now): find existing drugs near a query molecule in
+  chemical space — validated, and the engine behind §12.9 generative-guided retrieval.
+- structure-based docking (§12.3): score whether a ligand binds a target pocket. Blocked on an
+  ARM-Mac docking toolchain (AutoDock Vina does not pip-install); see ``dock`` for options.
+
+Caveat carried throughout: chemical similarity != activity, and docking != efficacy (§12.6).
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from rdkit import Chem, DataStructs, RDLogger
+from rdkit.Chem import rdFingerprintGenerator
+
+RDLogger.DisableLog("rdApp.*")
+_MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
+
+STRUCT_DIR = Path("data/raw/structures")
+
+# Known sickle small-molecule binders, by target (positive controls for the §12.4 recovery test).
+KNOWN_BINDERS = {
+    "hemoglobin": "voxelotor",
+    "PKR": "mitapivat",
+    "DNMT1": "decitabine",
+    "HDAC": "vorinostat",
+}
+
+# Curated target structures (PLAN §12.2). Add PDB ids as the harness grows.
+TARGET_PDB = {
+    "hemoglobin": "5E83",  # hemoglobin + voxelotor (GBT440)
+    "PKR": "8XFD",         # pyruvate kinase R + mitapivat
+}
+
+
+def morgan_fp(smiles: str):
+    """Morgan (ECFP4) fingerprint, or None for invalid / missing SMILES ('-666', '')."""
+    if not isinstance(smiles, str) or smiles in ("-666", ""):
+        return None
+    mol = Chem.MolFromSmiles(smiles)
+    return _MORGAN.GetFingerprint(mol) if mol else None
+
+
+def tanimoto(smiles_a: str, smiles_b: str) -> float | None:
+    fa, fb = morgan_fp(smiles_a), morgan_fp(smiles_b)
+    if fa is None or fb is None:
+        return None
+    return DataStructs.TanimotoSimilarity(fa, fb)
+
+
+def retrieve_nearest(
+    query_smiles: str,
+    library: dict[str, str],
+    top_n: int = 5,
+) -> list[tuple[float, str]]:
+    """Rank a library of {name: smiles} by Tanimoto similarity to a query molecule.
+
+    This is the §12.9 retrieval step: the query may be a known binder (positive-control beacon)
+    or a generated idealised binder; the returned existing drugs are repurposing candidates that
+    STILL require docking/validation (similarity != activity).
+    """
+    qfp = morgan_fp(query_smiles)
+    if qfp is None:
+        raise ValueError("invalid query SMILES")
+    sims = []
+    for name, smi in library.items():
+        fp = morgan_fp(smi)
+        if fp is not None:
+            sims.append((DataStructs.TanimotoSimilarity(qfp, fp), name))
+    return sorted(sims, reverse=True)[:top_n]
+
+
+def dock(target: str, ligand_smiles: str) -> float:
+    """Dock a ligand into a target pocket and return a binding score (PLAN §12.3).
+
+    Blocked: AutoDock Vina does not pip-install on ARM Mac and no docking binary is on PATH.
+    Resolve the toolchain first (one of):
+      - conda/micromamba: ``vina`` (conda-forge) or ``smina`` (bioconda), osx-arm64 builds
+      - an AF3-class co-folding model on GPU: Boltz-2 / Chai-1 / DiffDock (also predicts affinity)
+      - x86 Vina binary under Rosetta 2
+    Then: fetch TARGET_PDB[target], define the pocket box, prep the ligand (Meeko), score.
+    """
+    raise NotImplementedError(
+        "Docking toolchain unresolved on ARM Mac (PLAN §12.6 pitfall 4 / §12.8). "
+        "See docstring for options."
+    )
--- a/src/scoring.py
+++ b/src/scoring.py
@@ -16,7 +16,7 @@ from pathlib import Path
 import numpy as np
 import pandas as pd

-from . import RESULTS_DIR
+from . import RANDOM_SEED, RESULTS_DIR

 # Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
 # Keys are pathway categories; values are substrings matched (case-insensitive) against a
@@ -134,6 +134,92 @@ def mechanistic_prior(targets: list[str]) -> float:
    return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))


+# --- KS connectivity + tau calibration (v1.1) -----------------------------------------------
+# Unweighted Kolmogorov-Smirnov connectivity (Lamb 2006) is O(k) per query (depends only on the
+# ranks of the query genes), which makes a permutation null over many random queries cheap. tau
+# expresses each drug's real connectivity as a signed percentile within its own null — so
+# broad-effect drugs that connect to *random* signatures too get down-weighted (specificity).
+
+
+def _ks_es(rank_matrix: np.ndarray, query_cols: np.ndarray, n_genes: int) -> np.ndarray:
+    """Vectorized unweighted KS enrichment score of a query gene set for every drug.
+
+    ``rank_matrix`` is (n_drugs, n_genes) of 1..N rank positions (1 = most up-regulated).
+    Returns one ES per drug; ES>0 => query enriched among up-regulated genes.
+    """
+    k = len(query_cols)
+    if k == 0:
+        return np.zeros(rank_matrix.shape[0])
+    p = np.sort(rank_matrix[:, query_cols], axis=1)  # (n_drugs, k), positions ascending
+    j = np.arange(1, k + 1)
+    a = (j / k - p / n_genes).max(axis=1)
+    b = (p / n_genes - (j - 1) / k).max(axis=1)
+    return np.where(a >= b, a, -b)
+
+
+def _ks_connectivity(rank_matrix: np.ndarray, up_cols: np.ndarray, down_cols: np.ndarray,
+                     n_genes: int) -> np.ndarray:
+    """KS connectivity per drug: (ES_up - ES_down)/2. Negative=reversal.
+
+    Note: unlike WTCS, this does NOT zero same-sign (ambiguous) connections — same-sign ES
+    partially cancel and land near 0 naturally. Hard-zeroing would collapse the random-query
+    null to a spike at 0 and make tau saturate, so the continuous form is required for calibration.
+    """
+    es_up = _ks_es(rank_matrix, up_cols, n_genes)
+    es_down = _ks_es(rank_matrix, down_cols, n_genes)
+    return (es_up - es_down) / 2.0
+
+
+def tau_calibrate(
+    up_genes: list[str],
+    down_genes: list[str],
+    signature_matrix: pd.DataFrame,
+    n_null: int = 1000,
+    seed: int = RANDOM_SEED,
+) -> pd.DataFrame:
+    """Rank drugs by tau: each drug's KS connectivity as a signed percentile vs a null of
+    random queries of the same size (PLAN §6; CMap tau, Subramanian 2017).
+
+    tau in [-100, 100]: -100 => reverses our signature more specifically than any random query
+    (strong, specific candidate); ~0 => connects to our signature no more than to random ones
+    (broad-effect / non-specific). Ranked by tau ascending (rank 1 = most specific reversal).
+    """
+    genes = list(signature_matrix.columns)
+    gene_to_col = {g: i for i, g in enumerate(genes)}
+    n = len(genes)
+    rank_matrix = signature_matrix.rank(axis=1, ascending=False).to_numpy()
+
+    up_cols = np.array([gene_to_col[g] for g in set(up_genes) if g in gene_to_col], dtype=int)
+    down_cols = np.array([gene_to_col[g] for g in set(down_genes) if g in gene_to_col], dtype=int)
+    real = _ks_connectivity(rank_matrix, up_cols, down_cols, n)
+
+    rng = np.random.default_rng(seed)
+    k_up, k_down = len(up_cols), len(down_cols)
+    null = np.empty((rank_matrix.shape[0], n_null))
+    for m in range(n_null):
+        samp = rng.choice(n, size=k_up + k_down, replace=False)
+        null[:, m] = _ks_connectivity(rank_matrix, samp[:k_up], samp[k_up:], n)
+
+    null_mean = null.mean(axis=1)
+    null_std = null.std(axis=1)
+    null_std[null_std == 0] = np.nan
+    # Per-drug standardized connectivity: how many SDs the real reversal is below what random
+    # queries of the same size produce against THIS drug. Continuous (no percentile floor), so it
+    # discriminates within the strong-reversal tail. Negative = specific reversal.
+    spec_z = (real - null_mean) / null_std
+
+    frac = (null <= real[:, None]).mean(axis=1)
+    tau = 100.0 * (2.0 * frac - 1.0)  # retained for reference; saturates at +/-100 in the tail
+
+    df = pd.DataFrame(
+        {"connectivity_ks": real, "null_mean": null_mean, "spec_z": spec_z, "tau": tau},
+        index=signature_matrix.index,
+    )
+    df = df.sort_values("spec_z")  # most negative z = most specific reversal
+    df.insert(0, "rank", range(1, len(df) + 1))
+    return df
+
+
 def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:
    """Write the ranked candidate list to ``data/results/ranked_candidates_v1.csv``."""
    out_path = out_path or (RESULTS_DIR / "ranked_candidates_v1.csv")
--- a/tests/test_scoring.py
+++ b/tests/test_scoring.py
@@ -114,6 +114,33 @@ class TestMechanisticPrior:
        assert mechanistic_prior(["Some unrelated kinase"]) == 0.0


+class TestTauCalibration:
+    """tau should reward a SPECIFIC reverser and give a near-zero score to a noise drug."""
+
+    @staticmethod
+    def _matrix() -> pd.DataFrame:
+        genes = [f"U{i}" for i in range(5)] + [f"D{i}" for i in range(5)] + [f"G{i}" for i in range(40)]
+        rng_vals = {g: 0.01 * ((hash(g) % 7) - 3) for g in genes}  # tiny deterministic noise
+        # specific reverser: query-up genes at the bottom, query-down at the top, rest ~0
+        specific = dict(rng_vals)
+        for i in range(5):
+            specific[f"U{i}"] = -8 - i
+            specific[f"D{i}"] = 8 + i
+        noise = dict(rng_vals)
+        return pd.DataFrame([specific, noise], index=["specific", "noise"])[genes]
+
+    def test_specific_reverser_has_strongly_negative_tau(self):
+        from src.scoring import tau_calibrate
+        up = [f"U{i}" for i in range(5)]
+        down = [f"D{i}" for i in range(5)]
+        out = tau_calibrate(up, down, self._matrix(), n_null=300, seed=0)
+        # Ranked by spec_z (continuous); the specific reverser is the most negative.
+        assert out.loc["specific", "spec_z"] < -2
+        assert out.loc["specific", "spec_z"] < out.loc["noise", "spec_z"]
+        assert out.loc["specific", "tau"] < -50  # tau also flags it (may saturate near -100)
+        assert out.loc["specific", "rank"] == 1
+
+
 def test_rank_drugs_orders_by_reversal():
    from src.scoring import rank_drugs
    genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]
Author	SHA1	Message	Date
Junior B.	f74a58aa43	Harden cache_msa: pick valid FASTA a3m + strip null bytes Pilot validated the MSA-reuse architecture (1 server query, cached & reused -- server hammering fixed) but Boltz's INPUT a3m parser crashed on the cached file: KeyError '\x00'. Boltz writes several .a3m files (some binary); we were grabbing a bad one. Fix: select the largest a3m that looks like FASTA (starts with '>'), and strip null bytes before caching. Raise with the candidate list if none qualifies (so we can see what's there). Pilot status: MSA caching + reuse confirmed working; this fixes the a3m-format crash. Re-validate with a 2-drug pilot before the full run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 00:44:32 +02:00
Junior B.	066e0096d7	Fix screen: compute target MSA once, reuse it, tolerate failures The full 300-drug screen hammered the public ColabFold MSA server (one redundant query per drug for the same HDAC2 sequence) -> timeouts, and the default return_exceptions=False risked aborting the whole run on a single failure. Corrected: - cache_msa(): compute the target MSA ONCE via the server, cache the a3m on the Volume; doubles as the weight/CCD warmup. - build_boltz_yaml(msa_path): protein reuses the cached a3m. - cofold(msa_path): when given, skip --use_msa_server (no server query). - screen(): cache MSA once, then cofold.starmap(..., return_exceptions=True) so a bad drug is skipped, not fatal. Turns a fragile ~2.5 hr run into a fast, robust one. Validation pilot (6 drugs) running; full 300 to run tomorrow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 00:41:51 +02:00
Junior B.	0535886ce6	Phase 2 screen pilot: HDAC2 recovers the inhibitor class (P>=0.99) Add the `screen` entrypoint (parallel ~10-wide, cached weights) and run a 24-drug pilot vs HDAC2 (+Zn), ranked by Boltz-2 P(binder). ~$1.3. Result (recovery test at scale): top 9 are ALL HDAC inhibitors (trichostatin-A/vorinostat/panobinostat/belinostat/scriptaid/mocetinostat/ entinostat/apicidin >=0.99; valproic-acid 0.91), clean drop-off to hydroxyurea 0.78 and non-HDAC drugs to dexamethasone 0.03. Captures the structure-activity gradient (hydroxamates > weak fatty-acid > non-HDAC). Honest false negative: romidepsin (potent HDAC inhibitor) ranks low (0.43) -- it's a depsipeptide PRODRUG co-folding doesn't model. Screen mishandles non-standard chemotypes. Screening pipeline validated; next is the full 300-drug discovery run. max_containers=10 (parallel safe once weights cached). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 22:23:01 +02:00
Junior B.	907a39aaba	Capture structure-track deps + pose entrypoint - gpu/modal_app.py: add the `pose` local entrypoint used for the HDAC2 pose-RMSD validation (run: `modal run gpu/modal_app.py::pose`). - pyproject [structure] extra: add the deps we actually use locally (gemmi, spyrmsd, meeko, modal) for reproducibility; document the non-pip tools (Vina binary, open-babel) and that Boltz/cuequivariance are Modal-image-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 19:56:40 +02:00
Junior B.	9efdc0acf1	Pose-RMSD confirms HDAC2/vorinostat geometry: 0.21A (Vina was 7.9A) Close the §12.4 validation loop. scripts/pose_rmsd.py superposes the Boltz-2-predicted HDAC2 onto crystal 4LXZ, transforms the predicted ligand, and scores pose RMSD (spyrmsd, in-place): - protein fold: Ca RMSD 0.14A over 366 residues - vorinostat pose: 0.21A (crystal-accurate) vs Vina 7.9A on this exact Zn-chelation case - catalytic Zn ion: 2.73A off (ligand perfect, metal slightly less) HDAC2 now validated on BOTH affinity (P(binder)=0.999) and geometry (0.21A). The structure-binding modality is comprehensively validated on its decisive metal-coordination case. Commits the predicted complex as evidence (docs/results/HDAC2_vorinostat_pred.pdb). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 19:50:33 +02:00
Junior B.	c891a78541	Phase 1 co-folding WORKS: HDAC2/Zn validated (Boltz-2 on Modal) First clear positive result in the project. Ran Phase 1 on Modal L4 (~$0.70). Boltz-2 P(binder), cofactors co-folded: - HDAC2 (+Zn): vorinostat 0.9994 vs negatives ~0.1 -> PASS, decisive - hemoglobin (+heme): voxelotor 0.46 -> PASS (weak; covalent/tetramer) - PKR (+FBP/Mg): mitapivat 0.32 < hydroxyurea 0.40 -> FAIL (allosteric) HDAC2/Zn is the exact case classical Vina failed (no metal term, 7.9A redock). Co-folding handles the Zn-chelation chemistry -> the structure- binding modality pivot (PLAN §12) is validated on its decisive test. Engineering fixes that got it running: image needs cuequivariance kernels; max_containers=1 so weights download once (parallel corrupted the shared- Volume checkpoint); rank by P(binder) not affinity_pred_value (sign). Adds docs/results/phase1_affinity.csv (committed; raw under data/ gitignored). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 19:41:19 +02:00
Junior B.	07705a5884	GPU Phase 1: co-fold cofactors/metals (the binding-mode determinants) Add metal/cofactor handling to the Boltz-2 YAML as CCD ligand entries - the modes classical docking couldn't model: - HDAC2 + catalytic Zn (vorinostat chelates it) - PKR + FBP + Mg (allosteric activator + metal) - hemoglobin + heme Same cofactors present when co-folding negatives into a target (fair test). build_boltz_yaml() gains a cofactor_ccds arg (emits `ligand: {ccd: ...}` entries); TARGETS carries per-target cofactors; cofold()/main() thread them through. Verified locally: YAML builds correctly with Zn / FBP+Mg. Honest limitation noted: Hb's voxelotor site is at the tetramer centre and covalent (Schiff base), so single-chain+heme only approximates it - HDAC2 (Zn) and PKR (cofactor) are the real co-folding tests. Ready for `modal run gpu/modal_app.py`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 17:16:06 +02:00
Junior B.	4022c0cb94	GPU Phase 1 runnable: real Boltz-2 co-folding + alignment review Flesh out the Modal app into a runnable Phase-1 positive-control test and reconcile it with the plan: - cofold() GPU fn: build Boltz-2 YAML (protein+ligand+affinity), run `boltz predict --use_msa_server --cache /weights/boltz`, parse affinity JSON + predicted pose; weights persist via Volume. - Local helpers (CPU, import-tested against our PDBs): binding_chain_sequence (gemmi -- correctly picks the binding chain, e.g. alpha-globin for 5E83), pubchem_smiles, build_boltz_yaml, fetch_pdb (RCSB). - main(): fan out cofold.starmap over 3 targets x (known binder + 2 negatives); tabulate; PASS if known binder has top P(binder) for its target. Alignment fixes: - Rank by P(binder) (higher=better), NOT raw affinity_pred_value whose sign (~log IC50) is version-dependent -- avoids a backwards positive-control test. - gpu_plan.md Phase 1 updated to affinity/P(binder) ranking; pose-RMSD noted as a later refinement (needs receptor superposition). Local half verified (sequence/SMILES/YAML); cofold() needs a live `modal run` (account + `modal token new`) to validate end-to-end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 16:56:27 +02:00
Junior B.	81d56b7a76	GPU plan: make weight persistence concrete (Modal Volume cache) Document and wire the weight-caching mechanism: - modal.Volume is a cloud-backed FS independent of the GPU/container; run 1 downloads weights into /weights, run 2+ reuses them (no GPU time wasted re-downloading). - Point downloaders at the mount: HF_HOME/TORCH_HOME/boltz --cache; persist via weights.commit(), see updates via weights.reload(). - Volume storage costs pennies, separate from GPU = near-free caching. modal_app.py cofold(): set cache env vars to /weights, reload()/commit() around the (stubbed) boltz call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 16:48:50 +02:00
Junior B.	08ed713cc8	GPU plan: ephemeral serverless co-folding (Modal) + app skeleton docs/gpu_plan.md: cost-efficient plan for running AF3-class co-folding (Boltz-2/DiffDock) on a GPU then paying nothing when idle. - Key insight: structure-track data is tiny (MB of PDBs/SMILES); only the GPU + model weights are heavy -> serverless is ideal. - Recommend Modal (per-second billing, scales to zero = nothing to kill); RunPod as the SSH-box alternative with idle auto-terminate. - Lifecycle: image -> weights Volume (cache, don't re-download) -> run -> git push small results -> teardown automatic. - Phase 1 validate on 3 known binders (~$1) before paying for a screen; Boltz-2 (affinity) on an L4/A10 (24-48GB); est total ~$5-15. gpu/modal_app.py: Modal app skeleton (image, weights volume, GPU cofold() function, local entrypoint); boltz invocation stubbed with TODOs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 16:45:04 +02:00
Junior B.	7c6cef1aef	Production docking: prep helps (7.9->4.8A) but Vina wrong tool for sickle §12.4 pushed to its limit. Meeko ligand prep + in-place symmetry RMSD (spyrmsd, not obrms) on clean HDAC2/vorinostat: 7.9A -> 4.76A. Prep and metric mattered, but still FAIL. Residual cause is fundamental: vorinostat binds via hydroxamate-Zn chelation and Vina has no metal-coordination term. Real finding: sickle's druggable targets bind via non-classical chemistry classical docking handles poorly -- Hb (covalent), PKR (allosteric+cofactor), HDAC (Zn chelation). Vina is the wrong tool for this target landscape. Redirect: data-driven AF3-class co-folding (Boltz-2/Chai-1/DiffDock) handles these modes -- the indicated next tool, gated by the 24GB local memory ceiling (cloud GPU needed). The "GPU breaks all-local" §12.6 prediction is now the binding constraint of the track. Adds: scripts/dock_production.py; deps meeko, spyrmsd, gemmi. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 16:38:54 +02:00
Junior B.	51bd90df41	Redocking-RMSD validation fails 3/3: pipeline-quality issue §12.4 de-biased validation (scripts/dock_validate.py). Redock each co-crystal ligand into its own structure, RMSD vs crystal: - voxelotor->Hb: NA (covalent binder, out of scope §12.7) - mitapivat->PKR: 8.2A (allosteric, cofactors stripped) - vorinostat->HDAC2 (4LXZ, zinc kept): 7.9A -- a CLASSICAL target that should have worked The clean target also failing => systematic pipeline-quality problem, not target choice. Cheap Vina + open-babel prep gives scores but doesn't reproduce known geometry, so affinities aren't trustworthy. Ligand efficiency over-corrects (ranks tiny hydroxyurea best). Fix needs production prep (Meeko/AutoDockTools prepare_receptor + reduce) and an in-place RMSD metric. Consistent with the project theme: the quick version of every method runs but fails honest validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 07:28:47 +02:00
Junior B.	75f5383961	Docking baseline: toolchain solved, raw affinity is size-biased §12.3-12.4 first binding result on ARM Mac. - Toolchain SOLVED: AutoDock Vina 1.2.5 mac binary (Rosetta) + open-babel (brew). No conda, no MLX. dock_positive_controls.py runs end-to-end. - Cross-dock known binders + negatives into Hb (5E83) and PKR (8XFD), box centered on co-crystal ligands (5L7=voxelotor, WV2=mitapivat). Finding: raw Vina affinity ranks almost perfectly by MOLECULAR SIZE (mitapivat > voxelotor > decitabine/caffeine > hydroxyurea) in both pockets — mitapivat wins even on hemoglobin it doesn't target. Raw score can't distinguish target-specific binding: the docking analog of the connectivity specificity problem. Next: redocking-RMSD validation + ligand-efficiency normalization. Note: machine is 24GB (not 96GB per PLAN §2), capping local AF3-class inference. tools/ gitignored (vina binary). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 00:03:00 +02:00
Junior B.	817bcda7dc	Structure-binding track: scaffold + ligand-retrieval baseline Start the structure-based binding branch (PLAN §12), baseline-first. - src/binding.py: validated RDKit ligand retrieval (morgan_fp, tanimoto, retrieve_nearest = the §12.9 engine) + dock() stub documenting the blocked ARM-Mac toolchain - scripts/binding_ligand_baseline.py: 300 drugs vs known binders - docs/structure_binding_notes.md: status, toolchain blocker, next steps - pyproject: [structure] extra (rdkit); data/raw/structures/ for PDBs Step-0 finding: retrieval engine VALIDATED on in-set classes (decitabine->azacitidine 0.62; vorinostat->scriptaid/belinostat) but the distinctive binders voxelotor/mitapivat have no analog in our 300-drug set (Tanimoto ~0.2). Needs (a) bigger library, (b) real docking (§12.3), which is blocked on the ARM-Mac docking toolchain (§12.6 pitfall 4). Structures 5E83 (Hb+voxelotor) and 8XFD (PKR+mitapivat) fetched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:53:27 +02:00
Junior B.	6c2c71d73d	PLAN §12.9: leave door open for generative-guided retrieval Reframe de novo generation into the repurposing frame per the founder's idea: use a pocket-conditioned generative model (TargetDiff/DiffSBDD/ Pocket2Mol) to propose an idealised binder as a SEARCH BEACON, then retrieve the nearest EXISTING drugs by chemical similarity (Tanimoto/ embedding) as repurposing candidates — the generated molecule is never synthesised. Caveats kept honest: generated molecules used only as beacons (often synthetically invalid); similarity != activity, so retrieved neighbours still must be docked + pass the binding recovery test; open question whether it beats brute-force docking the existing library. Explore only after the §12.3-12.4 docking baseline is validated. §12.7 exclusion reworded to point here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:43:25 +02:00
Junior B.	7449dbeefb	Scope Phase 2 structure-based binding track into PLAN (§12) Add a scoped (not committed) follow-on track pivoting modality from expression-connectivity to structure-based drug-target binding, motivated by the empirical finding that the expression modality is signal-dead for this task (relational-only supervised AUC = 0.49, chance). §12 covers: the evidence for the pivot, a sickle-specific druggable target shortlist with known-binder positive controls (Hb/voxelotor, PKR/mitapivat, DNMT1/decitabine, LSD1, HDAC, EHMT2, PDE9), method (classical docking baseline -> AF3-class co-folding: Boltz-2/Chai-1/DiffDock), a pre-registered binding recovery test, integration with the expression layer as the real prize, honest pitfalls (binding != efficacy, BCL11A untractable, GPU breaks the all-local assumption), and open decisions before committing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:40:18 +02:00
Junior B.	649f617019	Phase D: supervised cross-disease (0.925 AUC degree-bias mirage) Train GradientBoosting on 300 drugs x 839 GEO disease signatures with Repurposing-Hub indications as labels (432 positives), disease-grouped CV. Finding: 0.925 CV AUC looks like a win but is a MIRAGE. Feature importances are all drug-level (drug_std 0.33, drug_mean 0.30, broadness 0.17); drug-disease connectivity importance = 0.01. The model learned a drug-POPULARITY prior, not disease-specific matching. On held-out sickle it ranks hydroxyurea 231/300 (worse than baseline) and tops out with promiscuous drugs (dexamethasone, methotrexate). Classic degree-bias trap. Connectivity also has ~chance AUC (0.51) for predicting approved indications. Both obvious approaches now fail instructively: unsupervised = specificity ceiling; naive supervised = degree bias. Real progress needs degree- debiased training + much larger clean labels (a research effort). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:31:32 +02:00
Junior B.	0ce688449d	Phase A: reference-library tau (negative result on specificity) Calibrate sickle connectivity against a real disease-signature reference population (Enrichr Disease_Signatures_from_GEO, 141 diseases) instead of random gene-set nulls — proper CMap tau. Finding: hydroxyurea still recovers (rank 25, top 8%), but negative- control specificity is UNCHANGED (2/5; norethindrone + ciprofloxacin still top). The reference-calibrated ranking is nearly identical to the random-null ranking. Third independent fix (after gene-space expansion and composition adjustment) that recovers hydroxyurea but does NOT fix specificity. Conclusion: unsupervised connectivity has a specificity ceiling — it cannot distinguish therapeutic reversal from coincidental transcriptional anti-correlation. Breaking it needs external signal (supervised labels or mechanistic filtering), not more calibration. Disease-signature library cached at data/raw/disease_sigs/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:19:26 +02:00
Junior B.	b62048614d	Experiment: composition-adjusted signature (negative result) Test whether fixing the cell-composition confound rescues the recovery test. GSE35007 measured WBC/RBC/MCV per sample, so adjust DE directly for them (per-gene OLS: expression ~ disease + WBC + RBC + MCV + age + sex) — a measured-covariate deconvolution, compared vs unadjusted. Finding (negative, informative): adjustment HURT hydroxyurea (rank 20 -> 35, fails top-10%) — it was recovering partly via composition genes — and did NOT fix negative-control specificity (still 2/5). Only gain: top hits become more mechanistically coherent (resveratrol, aspirin enter). Conclusion: cleaning the disease signature does not rescue connectivity; the binding constraint is matching-side (needs a reference-signature library for proper specificity calibration), which is a multi-disease investment. v1.1 signature NOT replaced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:05:08 +02:00
Junior B.	3417f85eb1	v1.1: full gene space + specificity z-score; hydroxyurea recovers Post-hoc improvement after the pre-registered v1 recovery test failed. Two changes, diagnosing v1's failure: - score on the full 12,328-gene LINCS space (week2_lincs_extract.py), lifting signature overlap from 12% to 85% (brings erythroid markers in) - src/scoring.py: KS connectivity + per-drug specificity z-score (spec_z = SDs below a 1,000 random-query null). Primary ranking is now spec_z. (Textbook tau saturated at +/-100 for a coherent query — documented; needs a reference-signature library, a v2 item.) - week3_scoring.py: spec_z primary + WTCS reference + prior-blended - tests: tau/spec_z calibration test; 19 passing - scripts/exp_genespace.py: the BING vs all-12,328 comparison Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%), confirming the v1 failure was the landmark bottleneck not the algorithm. Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite), and negative controls (norethindrone, ciprofloxacin) rank top-3 — connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory, not a confirmatory test; reported as such in recovery_test_report.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 22:57:30 +02:00