Harden cache_msa: pick valid FASTA a3m + strip null bytes

Pilot validated the MSA-reuse architecture (1 server query, cached & reused -- server hammering fixed) but Boltz's INPUT a3m parser crashed on the cached file: KeyError '\x00'. Boltz writes several .a3m files (some binary); we were grabbing a bad one. Fix: select the largest a3m that looks like FASTA (starts with '>'), and strip null bytes before caching. Raise with the candidate list if none qualifies (so we can see what's there). Pilot status: MSA caching + reuse confirmed working; this fixes the a3m-format crash. Re-validate with a 2-drug pilot before the full run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fix screen: compute target MSA once, reuse it, tolerate failures
2026-06-25 00:44:32 +02:00 · 2026-06-25 00:41:51 +02:00 · 2026-06-24 22:23:01 +02:00 · 2026-06-24 19:56:40 +02:00 · 2026-06-24 19:50:33 +02:00 · 2026-06-24 19:41:19 +02:00
35 changed files with 6520 additions and 124 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -31,3 +31,6 @@ env/
 .DS_Store
 .idea/
 .vscode/
 # Local tool binaries (refetch per docs/structure_binding_notes.md)
 tools/
--- a/PLAN.md
+++ b/PLAN.md
@@ -426,3 +426,143 @@ This MVP exists in a broader strategic context that was built through ~7 expert
 - **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product.
 The MVP's job is to produce one credible result. Everything else follows from that.
 ---
 ## 12. Phase 2 track — Structure-based binding (scoped 2026-06-23)
 > **Status: scoped, not committed.** This is a follow-on track proposed *after* the MVP and its
 > follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what
 > those experiments empirically showed. Read §9–11 and the experiment commits first.
 ### 12.1 Why pivot modality (the evidence, not a hunch)
 The expression-connectivity approach was built, validated, and pushed hard (gene-space
 expansion, cell-composition deconvolution, reference-library tau, supervised learning). The
 empirical verdict:
 - Connectivity **recovers hydroxyurea** (top ~6–8%) but **cannot achieve specificity** —
  unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four
  independent methods.
 - A supervised model on indication labels hit **0.925 CV AUC** — but it was a **degree-bias
  mirage**: it learned drug popularity, not disease matching (it ranked hydroxyurea *231/300*).
 - The decisive test: with drug-popularity features removed, the model trained on the actual
  drug↔disease connectivity scored **AUC 0.491 — pure chance**. **The expression-connectivity
  modality contains essentially no disease-specific therapeutic signal for this task.**
 This is a *signal* problem, not a *model* problem — no amount of model sophistication (diffusion,
 GNNs, etc.) extracts signal that isn't in the data. The response is to **change modality** to one
 with a strong, physical, drug-specific signal: **does a molecule bind a sickle-relevant target?**
 A drug that binds HbS is mechanistically specific by construction — the opposite of a coincidental
 expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is
 itself a diffusion model) actually has traction, because structure has physical ground truth.
 ### 12.2 Targets (sickle-specific, druggable, structurally characterised)
 Small molecules only (§2). Curated shortlist with public structures and, crucially, **known
 small-molecule binders to serve as positive controls**:
 | Target | Mechanism in sickle | Known binder (positive control) |
 |---|---|---|
 | Hemoglobin (HBB/HBA tetramer, HbS) | Anti-polymerisation; R-state stabiliser | **voxelotor** (binds α-Val1) |
 | PKR (PKLR, red-cell pyruvate kinase) | Activator → ↓2,3-BPG → ↑O2 affinity | **mitapivat**, etavopivat |
 | DNMT1 | HbF induction (de-repress γ-globin) | **decitabine**, azacitidine |
 | LSD1 / KDM1A | HbF induction | tranylcypromine analogues |
 | HDAC1/2 | HbF induction | vorinostat, panobinostat |
 | EHMT2 (G9a) | HbF induction | UNC0642 (tool) |
 | PDE9 | ↑cGMP, anti-adhesion | PF-04447943 (sickle trial) |
 Hard/excluded for v1: **BCL11A** (transcription factor, no classic pocket — the γ-globin master
 repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism.
 ### 12.3 Method (baseline → generative co-folding)
 1. **Prepare structures.** Pull target structures from the PDB; AF3/Boltz-predict any missing.
 2. **Prepare ligands.** Reuse the existing ~300-drug set (we already have canonical SMILES from
   ChEMBL); expandable to the full ChEMBL/LINCS catalogue.
 3. **Dock + score**, in increasing sophistication:
   - **Baseline:** classical docking (AutoDock Vina / smina) — fast, CPU, well-understood.
   - **Generative co-folding:** an open AlphaFold3-class model — **Boltz-2** (predicts the
     protein–ligand complex *and* a binding-affinity estimate, MIT-licensed), **Chai-1**, or
     **DiffDock** (a diffusion model for docking — the legitimate home for the "diffusion"
     instinct). These predict the bound pose directly and tend to beat classical docking.
   - Report both; the baseline keeps us honest about whether the ML model adds anything.
 ### 12.4 Validation (a real recovery test, like §6 Week 4)
 Pre-register before scoring: **the known structure-based sickle drugs must rank as top binders to
 their targets** — voxelotor→hemoglobin, mitapivat→PKR, decitabine→DNMT1. Negative controls
 (unrelated drugs) must *not* bind these pockets. This is a cleaner recovery test than the
 expression one, because binding is mechanistically specific — it should not have the
 coincidental-reverser problem that sank the connectivity approach.
 ### 12.5 The real prize — integrate, don't replace
 The long-term value is **both modalities together**: a candidate that *reverses the disease
 signature* (expression) **and** *binds a sickle-relevant target* (structure) is far more credible
 than either alone. Structure supplies the specificity the expression layer lacks; expression
 supplies the systems-level, target-agnostic view structure lacks. The platform thesis (§11) —
 two databases + a matching engine — extends naturally to a third (structures) feeding the same
 confidence-tiered data layer.
 ### 12.6 Honest pitfalls (do not ignore)
 1. **Binding ≠ efficacy.** A molecule can bind and do nothing therapeutic. Structure-based hits
   are still hypotheses (cf. §9.7).
 2. **Only covers the enzyme/pocket subset.** Sickle's biggest lever (γ-globin reactivation via
   BCL11A) is largely transcriptional and not small-molecule-tractable — structure-based screening
   is blind to it. Be explicit about coverage.
 3. **Docking/affinity accuracy is limited.** Pose prediction is decent; absolute affinity is hard.
   Validate on known binders before trusting novel scores.
 4. **Compute.** AF3-class models are GPU-heavy; the local Mac Studio (§2) may not suffice — this
   track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule.
 5. **Moat.** Structures and tools are public; the proprietary value is the curated target list,
   the integration with the expression layer, and provenance/tiering — not the docker.
 ### 12.7 Explicitly NOT in this track
 Free energy perturbation / MD-based affinity; covalent docking; **de novo generation of molecules
 as final candidates to synthesise** (design, not repurposing — but see §12.9 for the
 generate-then-retrieve hybrid, which *is* repurposing); BCL11A or any non-pocket target;
 biologics; combination binding.
 ### 12.8 Open decisions before committing
 - **Tooling:** classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend:
  baseline first, for an honest reference — the lesson of the whole expression arc.)
 - **Compute:** secure a GPU environment (the "all local" §2 assumption breaks here).
 - **Scope of v1:** the 7-target shortlist above, or start with just Hb + PKR (the two with the
  cleanest positive controls) to de-risk the harness before scaling targets.
 ### 12.9 Door left open — generative-guided retrieval (generate → match existing)
 A legitimate way to bring generative models *into the repurposing frame* (vs the design frame
 excluded in §12.7): don't generate molecules to synthesise — generate them as a **search beacon**.
 **The idea.** Use a pocket-conditioned generative model (target-conditioned diffusion — e.g.
 TargetDiff, DiffSBDD, Pocket2Mol) to propose idealised binders for a sickle target. Then retrieve
 the **nearest existing drugs** to those proposals by chemical similarity (Tanimoto over Morgan
 fingerprints, or a learned molecular embedding). The retrieved neighbours — real, already-approved
 or clinical molecules — are the repurposing candidates. The generated molecule is never made; it
 only *defines a region of chemical space* worth searching. This is the user-proposed framing and
 it is sound: "generate the ideal, then find what we already have nearby."
 **Why it could add value.** It can point at scaffolds / regions a known-binder-seeded or
 brute-force docking sweep would miss, and it makes the target's binding requirements explicit as
 geometry rather than as a single reference ligand.
 **Honest caveats (why it's a *door*, not a commitment).**
 1. **Generated molecules are often synthetically unrealistic / invalid** — which is exactly why
   they must be used only as beacons, never as candidates.
 2. **Similarity ≠ activity.** Activity cliffs mean a near-neighbour of a good binder can be inert.
   So retrieved neighbours do **not** bypass validation — they must still be docked/scored (§12.3)
   and clear the binding recovery test (§12.4). The generative step *proposes*; it does not *prove*.
 3. **Marginal-value question.** Directly docking the whole existing library (§12.3) already covers
   chemical space. Whether generate-then-retrieve beats that — by efficiency or by surfacing
   non-obvious scaffolds — is an open empirical question and needs a head-to-head before it earns
   real investment.
 4. **Only as good as the pocket conditioning** of the generator, and the chemistry of the target.
 **Status:** explore only *after* the §12.3–12.4 docking harness works and is validated on the
 known binders — same discipline as everywhere else: prove the baseline, then test whether the
 fancier method adds anything.
--- a/data/raw/structures/.gitkeep
+++ b/data/raw/structures/.gitkeep
--- a/docs/data_sources.md
+++ b/docs/data_sources.md
@@ -10,17 +10,63 @@
 | MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD |
 | Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD |
 | OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
-| GEO | https://www.ncbi.nlm.nih.gov/geo/ | GEOparse, FTP | Public domain | Expression data | TBD | TBD |
+| GEO (GSE35007) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 | GEOparse, FTP | Public domain | Disease signature (study 1) | GPL10558 | 2026-06-23 |
-| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets | TBD | TBD |
+| GEO (GSE16728) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 | GEOparse, FTP | Public domain | Disease signature (study 2) | GPL570 | 2026-06-23 |
-| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures | TBD | TBD |
+| ChEMBL | https://www.ebi.ac.uk/chembl | chembl_webresource_client | CC BY-SA 3.0 | Drug structures, MoA, targets | API (live) | 2026-06-23 |
 | LINCS L1000 Phase I | GSE92742 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. L-glutamine) | GSE92742 | 2026-06-23 |
 | LINCS L1000 Phase II | GSE70138 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. hydroxyurea) | GSE70138 | 2026-06-23 |
 | ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
 | FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
 | Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |
-## Chosen GEO dataset
+## Chosen GEO datasets (disease signature, Tier A via 2-study concordance)
-_Document the chosen study fully: accession, platform, n per group, publication, why it was
+The signature is the cross-study concordance of two independent whole-blood studies (genes
-selected over the alternatives (GSE53441, GSE35007, …)._
+significant at q<0.05 in **both** with the same direction). Whole-blood tissue was required so
 concordance is meaningful; the two differ by platform and population, which strengthens
 robustness.
 | Study | Platform | Tissue | Disease group | Healthy group | n disease / healthy |
 |---|---|---|---|---|---|
 | **GSE35007** | Illumina HumanHT-12 V4 (GPL10558) | whole blood | hb phenotype = SS | hb phenotype = AA | 190 / 12 |
 | **GSE16728** | Affymetrix HG-U133 Plus 2.0 (GPL570) | whole blood (PAXgene) | sickle-cell patient | control | 10 / 10 |
 - DE method: per-gene Welch t-test + Benjamini–Hochberg (microarray, pure Python).
 - Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
 - Result: 16,208 genes tested in both → **671 concordant** (444 up / 227 down). Signature =
  top 250 up + all 227 down by worst-case q-value.
 - **Rejected candidates:** GSE53441 (PBMC — tissue mismatch with the whole-blood anchor);
  GSE84633/GSE84634 (PBMC, no healthy controls).
 - **Tier caveat:** GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict
  n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.
 Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
 `scripts/week1_finalize.py` (mygene mapping + persist).
 ## Drug profiles (Week 2)
 300-drug set (`drug_set_v1.csv`), composed and restricted to LINCS-scorable compounds:
 | Inclusion reason | n | Notes |
 |---|---|---|
 | ground_truth | 2 | hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I) |
 | related_mechanism | 32 | HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories |
 | negative_control | 26 | antifungals, antihistamines, antibiotics, hormones |
 | general_sample | 240 | random from LINCS catalog, seed=42 |
 - **LINCS signatures:** per-drug consensus = mean of Level-5 MODZ z-scores across the drug's
  sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH
  phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
 - **ChEMBL:** matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds
  38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
 - **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
  not-single-source rule).
 - **Gene space (v1.1):** scoring uses the full **12,328-gene** LINCS space, not just the 978
  landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger
  space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from
  LINCS entirely and remain unscoreable.
 - Reproduce: `week2_curate_drugset.py` → `week2_chembl.py` → download Level-5 GCTX →
  `week2_lincs_extract.py` → `week2_assemble.py`.
 ## Licensing note for LINCS
--- a/docs/gpu_plan.md
+++ b/docs/gpu_plan.md
@@ -0,0 +1,112 @@
 # GPU plan — ephemeral AF3-class co-folding for the binding track
 Goal: run AF3-class co-folding (Boltz-2 / DiffDock) on a GPU for the structure-binding track
 (§12), then **pay nothing when idle**. The work is *bursty* (a validation run, then a screen),
 the inputs are *tiny*, so the design optimises for zero idle cost, not for a persistent box.
 ## What actually has to move (small!)
 | Thing | Size | How it gets to the GPU |
 |---|---|---|
 | Code | KB | `git clone` the gitea repo (or mount) |
 | Target structures (PDBs) | a few MB | in the repo / git |
 | Ligands (SMILES, drug set) | KB | in the repo / git |
 | **Model weights** (Boltz-2 / DiffDock) | **~2–6 GB** | downloaded once, **cached in a persistent volume** |
 | Results (poses, scores, RMSDs) | KB–MB | `git push` back / download |
 The 27 GB LINCS data is **not** part of this track — nothing big to upload. The only thing worth
 persisting is the model-weights cache (so we don't re-download = re-pay GPU time every run).
 ## How the model weights persist (the cost-saver)
 A `modal.Volume` is a **named, cloud-backed filesystem that lives independently of any container
 or GPU** — it survives every teardown. Mounted into the function at `/weights`:
 - **Run 1:** `/weights` is empty → the model downloads weights there (the one-time slow cost).
 - **Run 2+:** the same Volume mounts with the files already present → download skipped → **no
  GPU-billed seconds wasted re-fetching 5 GB.**
 Two things make it actually cache:
 1. **Point the downloader at the mount** (weights only persist if written under `/weights`):
   `HF_HOME=/weights/hf` (HuggingFace), `TORCH_HOME=/weights/torch`, `boltz --cache /weights/boltz`.
 2. **Commit semantics:** writes persist on `weights.commit()` (modern Modal also auto-commits on a
   clean exit); other containers see them after `weights.reload()`. Pattern: `reload()` → run →
   `commit()`.
 The Volume itself costs pennies (~$/GB-month of storage), *separate from the GPU* — so caching ~5 GB
 of weights is near-free and saves real GPU time on every subsequent run.
 (Alternative: bake weights into the image at build time via `image.run_function(download)` — fastest
 cold start, but the image rebuilds when weights change. The skeleton uses the Volume approach.)
 ## Provider choice
 | Option | Billing | Idle cost | "Kill" model | Best for |
 |---|---|---|---|---|
 | **Modal** (recommended) | per-second | **$0 (scales to zero)** | automatic — nothing to remember | bursty batch runs |
 | RunPod | per-minute, on-demand/spot | only while pod exists | manual `terminate` | interactive SSH box |
 | Vast.ai | per-minute spot | only while rented | manual destroy | cheapest, more fiddly |
 | Lambda / AWS-GCP spot | per-hour/second | until you stop it | manual stop | if you already have credits |
 **Recommend Modal.** You define the run as a function with a `gpu=` decorator; on call it
 provisions the GPU, runs, and releases it — **there is no GPU to forget to kill**, and you pay only
 for the seconds it ran. That *is* the "kill the GPU to save money" requirement, automated.
 ## The lifecycle (Modal)
 1. **Define image** once: CUDA base + `boltz` (or DiffDock) + `rdkit`, `meeko`, `spyrmsd`, `gemmi`.
 2. **Weights volume**: `modal.Volume` mounted at `/weights`; the model downloads into it on first
   run and is cached forever after (no re-download cost).
 3. **Run**: `modal run gpu/modal_app.py` → provisions GPU → runs the test → returns results to your
   laptop. GPU released the moment the function returns.
 4. **Persist results**: write the returned scores/poses into `data/processed/binding/` and
   `git commit` (small, text).
 5. **Teardown**: nothing to do — Modal scaled to zero. `modal app stop` only if a run is hung.
 RunPod alternative (if you want an interactive box): start pod → `git clone` → run → `git push`
 results → **`runpodctl remove pod <id>`** (or Stop in the UI). Set an **idle auto-terminate** so a
 forgotten box can't bleed money.
 ## What runs on the GPU (in cost order — cheap validation first)
 - **Phase 1 — modality validation (~minutes, ~$1):** co-fold each known binder + 2 negative
  controls (caffeine, hydroxyurea) into each target (Hb, PKR, HDAC2) **with the binding-mode
  cofactors co-folded in** — HDAC2 + catalytic **Zn**, PKR + **FBP/Mg**, Hb + **heme** (as CCD
  ligand entries) — and check the **known binder has the highest Boltz-2 P(binder)** for its own
  target. This is the discrimination Vina couldn't manage precisely because it can't model
  Zn-chelation / cofactors. (Ranking uses P(binder), not the raw affinity value, whose
  sign is version-dependent.) Pose-RMSD vs crystal is a deeper check but needs receptor
  superposition (align predicted protein to crystal, transform ligand) — a later refinement. If
  Phase 1 passes, the modality is real; if not, stop before paying for a screen.
 - **Phase 2 — screen (~tens of minutes, a few $):** run the ~300-drug set (or a focused subset)
  against the sickle targets; rank by Boltz-2 predicted affinity; redo the §12.4 positive-control
  recovery test. Output a ranked CSV, same shape as the connectivity `ranked_candidates`.
 ## Model choice
 - **Boltz-2** (MIT, pip-installable) — predicts the protein–ligand complex **and** a binding
  affinity → directly gives a rankable score. Primary choice. Fits a 24–40 GB GPU for these
  single-domain targets.
 - **DiffDock-L** — lighter, pose-only (needs a separate scorer); fallback if Boltz memory is tight.
 - GPU: an **L4 (24 GB, ~$0.6–0.8/hr)** or **A10/L40S (24–48 GB)** is plenty; no multi-GPU, no A100
  needed for these sizes.
 ## Cost controls (the save-money checklist)
 - Serverless (Modal) → **zero idle cost** by construction; or an **idle-timeout** auto-kill on a box.
 - **Cache weights** in a persistent volume — re-downloading 5 GB on a $1/hr GPU is wasted money.
 - **Validate on one target before screening** — don't pay for a 300-drug screen until Phase 1 passes.
 - Prefer **spot/interruptible** for the batch screen (Phase 2 is restartable).
 - Keep prep (Meeko/RDKit) and result-scoring on the **laptop**; only the model forward pass needs GPU.
 - Estimated total to validate + a first screen: **~$5–15**, not a standing bill.
 ## Repo integration
 - `gpu/modal_app.py` — the Modal app (skeleton committed alongside this plan).
 - Results land in `data/processed/binding/` (gitignored) + a small committed summary.
 - Pin model + weights version in the image for reproducibility.
 ## Next step
 Scaffold `gpu/modal_app.py` for Phase-1 validation (3 known binders), do a dry run locally
 (`modal run --detach`? no — just `modal run`), confirm cost, then Phase 2. Requires a Modal account
 + `pip install modal` + `modal token new` (one-time auth).
--- a/docs/known_limitations.md
+++ b/docs/known_limitations.md
@@ -12,9 +12,11 @@ Source: PLAN.md §9.
   cell lines (MCF7, A375, PC3, …). Signatures for non-oncology diseases may be noisy. A
   field-wide limitation, not unique to Reverso.
-3. **L-glutamine probably has no LINCS signature.** Amino acids and metabolites weren't LINCS
+3. **L-glutamine LINCS coverage — RESOLVED, opposite of expected.** L-glutamine DOES have a
-   priorities. If true, the ground-truth test effectively rests on hydroxyurea alone, which is
+   Phase I signature (hydroxyurea is Phase-II-only) — both ground-truth drugs are scorable. But
-   weaker. _Status: TBD — record the actual finding here once LINCS is pulled (Week 2)._
+   L-glutamine's connectivity is **ambiguous (WTCS=0)**: its up- and down-set enrichments share
   a sign, so it shows no reversal. It ranks 100/300. So the ground-truth test effectively rests
   on hydroxyurea, which itself only reaches top 13% (raw) — see the recovery test report.
 4. **Connectivity scoring surfaces broad-effect drugs as false positives.** HDAC inhibitors and
   broad kinase inhibitors often top connectivity rankings simply because they perturb many
@@ -32,8 +34,28 @@ Source: PLAN.md §9.
 7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
   to test, not discoveries. Use careful language in any write-up.
-## Drug-specific gaps (fill in during Week 2–3)
+8. **Gene-space bottleneck (v1 → fixed in v1.1).** v1 scored on only the 978 landmark genes (12%
   signature overlap) — the main driver of the v1 failure. v1.1 uses the full 12,328-gene space
   (85% overlap) and recovers hydroxyurea. HBG1/HBG2 remain absent from LINCS entirely.
-| Drug | Issue | Handling |
+9. **No reference-signature library for tau.** Textbook CMap tau saturated at ±100 (a coherent
   query always out-connects random gene sets). v1.1 substitutes a per-drug specificity z-score.
   Proper tau needs a library of real reference signatures — a v2 / curated-data item.
 10. **Negative-control criterion may be invalid for connectivity scoring.** Unrelated drugs
    (norethindrone, ciprofloxacin) rank as top specific reversers — connectivity measures
    expression reversal, not therapeutic relatedness.
 ## Recovery test outcome
 Pre-registered test (**v1, confirmatory**): **FAILED** all three criteria (hydroxyurea rank
 40/top 13%; L-glutamine rank 100; 1/5 negative controls bottom-half). Post-hoc (**v1.1,
 exploratory**): hydroxyurea recovers to rank 18 (top 6%, passes), but L-glutamine (rank 213, does
 not reverse) and negative controls (2/5) still fail → overall still FAIL. See
 `recovery_test_report.md`.
 | Drug | Issue | v1.1 status |
 |---|---|---|
-| TBD | e.g. no LINCS signature | flagged "not scored, no signature available" |
+| hydroxyurea | needed the full gene space | rank 18 (top 6%) — recovered post-hoc |
 | L-glutamine | metabolite, no reversal signal (positive connectivity) | rank 213 — genuine negative |
 | neg controls | reverse the generic inflammation signature | 2/5 bottom-half — criterion questionable |
--- a/docs/recovery_test_report.md
+++ b/docs/recovery_test_report.md
@@ -1,13 +1,10 @@
 # Sickle Cell Repurposing — Recovery Test Report
-> **Status: DRAFT SCAFFOLD — not yet run.** Filled in during Week 4 from
+> **Status: COMPLETE (v1 confirmatory + v1.1 exploratory).** Reproduce with `scripts/week1_*` →
-> `notebooks/05_recovery_test.ipynb`. Target length: ~2 pages, readable by a sceptical
+> `week2_*` → `week3_scoring.py` → `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.
 > pharma scientist in 5 minutes.
 ## Pre-registered success criteria
 > ⚠️ **Commit this section to git _before_ running the recovery test** (PLAN.md §8, §10).
 The MVP passes if:
 - Hydroxyurea ranks in the **top 10%** (top 30 of 300), **AND**
@@ -15,54 +12,116 @@ The MVP passes if:
  missing LINCS signature, **AND**
 - At least **4 of 5** negative-control drugs rank in the **bottom half**.
-_Pre-registered on: TBD (date of commit)_
+_Pre-registered in the scaffold commit (`b731478`) before any scoring. **Primary (confirmatory)
 analysis = v1**: 978 landmark genes, weighted connectivity score (WTCS). The 5 negative controls
 were pre-specified by category rule without inspecting ranks._
 ---
 ## Section 1 — Methodology
-_5–6 sentences: what was built, the GEO dataset used, the drug-set composition, and the
+A sickle cell disease signature was built from **two whole-blood microarray studies** (GSE35007
-scoring method (CMap connectivity, Lamb 2006 / Subramanian 2017)._
+Illumina SS-vs-AA; GSE16728 Affymetrix patient-vs-control), keeping the **671 genes concordant**
 across both (q<0.05, same direction) → a cross-platform Tier-A signature (250 up / 227 down).
 Profiles were built for **300 small molecules** (2 ground-truth; 32 related-mechanism; 26
 negative controls; 240 random), each with a **LINCS L1000** consensus signature (mean Level-5
 MODZ across cell lines, both CMap phases). Drugs were ranked by **CMap connectivity scoring**
 (Kolmogorov-Smirnov, Lamb 2006 / Subramanian 2017): negative = reversal = candidate.
 **v1 (pre-registered/confirmatory):** scored on the 978 landmark genes with WTCS.
 **v1.1 (post-hoc/exploratory):** after v1 failed, two changes were made to diagnose why — (a)
 score on the full **12,328-gene** space (landmark overlap 12% → 85%, bringing the erythroid
 markers in); (b) add a **per-drug specificity z-score** (`spec_z`): how many SDs the real
 connectivity is below a null of 1,000 random queries of the same size against that drug. Because
 these changes followed inspection of the v1 result, **v1.1 is exploratory, not a confirmatory
 test of the pre-registered hypothesis.**
 ## Section 2 — Recovery test result
-| Drug | Rank | Percentile | Pass? |
+| Criterion | v1 (confirmatory) | v1.1 (exploratory) |
 |---|---|---|---|
 | Hydroxyurea | TBD | TBD | TBD |
 | L-glutamine | TBD | TBD | TBD |
 Negative controls (expected: bottom half):
 | Control drug | Rank | Bottom half? |
 |---|---|---|
-| TBD | TBD | TBD |
+| Hydroxyurea top-10% (≤30) | rank **40** (13.3%) ❌ | rank **18** (6.0%) ✅ |
 | L-glutamine top-25% (≤75) | rank 100, WTCS=0 ❌ | rank 213, spec_z=+0.98 ❌ |
 | ≥4/5 neg controls bottom-half | 1/5 ❌ | 2/5 ❌ |
 | **Overall** | **FAIL** | **FAIL** (but hydroxyurea recovered) |
-**Overall: PASS / FAIL against pre-registered criteria — TBD**
+v1.1 negative controls: clotrimazole 258 ✅, astemizole 211 ✅, azithromycin 142 ❌,
 ethinyl-estradiol 114 ❌, caffeine 77 ❌.
-## Section 3 — Top 10 candidates
+**Honest reading.** The **pre-registered test FAILED (v1).** The post-hoc v1.1 changes
 **recover hydroxyurea** (rank 40 → 18, passing top-10%) — strong evidence that the v1 failure was
 driven by the 978-landmark bottleneck, not the algorithm. But two failures survive into v1.1, and
 both are now precisely diagnosed:
-| Rank | Drug | Score | Known mechanism | Biological plausibility |
+1. **L-glutamine does not reverse the signature** (positive connectivity, spec_z=+0.98). This is
   intrinsic to its LINCS data — a metabolite with no reversal signal — not a coverage gap. More
   genes cannot fix it.
 2. **The negative-control criterion is arguably invalid for connectivity scoring.** Two
   "negative controls" (norethindrone, ciprofloxacin) rank in the top 3 by spec_z. Connectivity
   measures *expression reversal*, not *therapeutic relatedness* — an antibiotic or contraceptive
   can still down-regulate the inflammation genes that dominate the scorable signature. The test
   design conflates the two.
 A note on the calibration: textbook CMap **tau** (percentile vs a reference population) was
 implemented but **saturated at ±100** here, because a coherent real query always out-connects
 random gene sets — proper tau needs a library of *real* reference signatures, which this MVP
 lacks. The continuous `spec_z` is the workable substitute.
 ## Section 3 — Top 10 candidates (v1.1 spec_z)
 | Rank | Drug | spec_z | Inclusion | Read |
 |---|---|---|---|---|
-| 1 | TBD | TBD | TBD | TBD |
+| 1 | reserpic-acid | −3.80 | random | reserpine metabolite; non-obvious |
 | 2 | norethindrone | −3.78 | **negative control** | false positive (see §2) |
 | 3 | ciprofloxacin | −3.61 | **negative control** | false positive |
 | 4 | resveratrol | −3.46 | related-mechanism | antioxidant studied in SCD — coherent |
 | 5 | BRD-K57490754 | −3.37 | random | tool compound |
 | 6 | anastrozole | −3.27 | random | aromatase inhibitor |
 | 7–10 | BRD-* / palmitoylethanolamide | ~−3.1 | random | mostly tool compounds |
-_Note: HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due
+That two negative controls outrank hydroxyurea is the single most informative result here — see §4.
 to widespread expression effects — flag these honestly (PLAN.md §9.4)._
-## Section 4 — One non-obvious candidate worth investigating
+## Section 4 — One non-obvious result worth investigating
-_A single paragraph on the most interesting result. Language must be careful: this is a
+The most useful finding is **not** a candidate drug but the **negative-control failure**:
-computational hypothesis to test, not a discovery (PLAN.md §9.7)._
+unrelated drugs (norethindrone, ciprofloxacin) score as strong specific reversers. This is a
 real, generalizable lesson — for a signature whose *scorable* portion is generic
 inflammation/metabolic genes, connectivity rewards any broad transcriptional perturbation that
 touches those genes. The honest implication: **this signature is not specific enough to
 discriminate true repurposing candidates from incidental expression reversers.** Of the
 plausibly-real hits, **resveratrol (rank 4)** — an antioxidant with prior sickle cell literature
 — is the most defensible, but it is a hypothesis, not a discovery.
 ## Section 5 — Honest limitations
- Cell-composition confound in whole-blood expression (PLAN.md §9.1)
+1. **Pre-registered test failed; the pass is post-hoc.** v1.1's hydroxyurea recovery is
- LINCS L1000 cell-line limitations — landmark genes measured mostly in cancer lines (§9.2)
+   exploratory and must be re-validated on a held-out disease before any claim is made.
- Missing signatures (e.g. L-glutamine) (§9.3)
+2. **Missing HbF axis** — HBG1/HBG2 are absent from LINCS entirely (not just landmarks), so the
- No mechanistic validation layer — discovery hypothesis generation, not validated prediction (§9.6)
+   pathway hydroxyurea acts on can never be scored by this method.
 3. **Signature specificity** — scorable genes are inflammation/metabolic; negative controls
   reverse them too. Connectivity ≠ therapeutic relatedness.
 4. **Cell-composition confound** — the whole-blood signature is reticulocyte-dominated.
 5. **LINCS cancer-cell-line bias**, and **no reference-signature library** for proper tau.
 6. **No mechanistic validation** — all hits are computational hypotheses.
 ## Section 6 — What v2 would fix
- Cell-type deconvolution of the disease signature
+- **A reference-signature library** to make tau (proper specificity calibration) work — the
- Knowledge graph fallback for missing-signature drugs
+  single biggest fix to the negative-control problem, and a direct use of the curated-data moat.
- A second disease to test generalization (the real test — sickle cell results do not prove
+- **Cell-type deconvolution** + a non-globin-depleted RNA-seq study to recover a more specific,
-  the platform generalizes, §9.5)
+  HbF-containing signature.
 - **Signature prediction / mechanism graph** to score genes with no LINCS measurement.
 - **A second disease** to test generalization and to honestly re-validate the v1.1 method
  (PLAN §9.5).
 ---
 ### Bottom line
 The pre-registered recovery test **failed**. Post-hoc diagnosis shows the dominant cause was a
 fixable gene-space bottleneck — correcting it **recovers hydroxyurea** — but also surfaces a
 deeper, genuine limitation: this whole-blood signature is **not specific enough** for
 connectivity scoring to separate real candidates from incidental reversers (negative controls
 rank at the top). The MVP's real deliverable is a precise, honest map of *what it takes to make
 this method work*: a more specific (deconvolved, HbF-containing) signature and a reference library
 for calibration — exactly the curated-data investments the platform thesis is built on.
--- a/docs/results/HDAC2_vorinostat_pred.pdb
+++ b/docs/results/HDAC2_vorinostat_pred.pdb
--- a/docs/results/phase1_affinity.csv
+++ b/docs/results/phase1_affinity.csv
@@ -0,0 +1,10 @@
 target,ligand,affinity,prob_binder,is_known_binder
 hemoglobin,voxelotor,0.16870097815990448,0.4566290080547333,True
 hemoglobin,caffeine,1.5093848705291748,0.34217822551727295,False
 hemoglobin,hydroxyurea,0.6418474316596985,0.06856877356767654,False
 PKR,mitapivat,1.187251329421997,0.3238581717014313,True
 PKR,caffeine,2.992832899093628,0.2582802474498749,False
 PKR,hydroxyurea,2.444709300994873,0.39834722876548767,False
 HDAC2,vorinostat,-1.7771220207214355,0.9993993043899536,True
 HDAC2,caffeine,2.3925399780273438,0.11900843679904938,False
 HDAC2,hydroxyurea,1.284231424331665,0.7654422521591187,False
--- a/docs/results/screen_HDAC2_pilot.csv
+++ b/docs/results/screen_HDAC2_pilot.csv
@@ -0,0 +1,25 @@
 rank,drug,P_binder,affinity,inclusion_reason
 1,trichostatin-a,0.9994845986366272,-2.883908271789551,related_mechanism
 2,vorinostat,0.9994641542434692,-1.7357702255249023,related_mechanism
 3,panobinostat,0.9983962774276733,-2.4892501831054688,related_mechanism
 4,belinostat,0.9970909357070923,-2.102327823638916,related_mechanism
 5,scriptaid,0.9957252740859985,-1.5838574171066284,related_mechanism
 6,mocetinostat,0.9910210371017456,-1.3829972743988037,related_mechanism
 7,entinostat,0.9903219938278198,-0.6888977289199829,related_mechanism
 8,apicidin,0.9882931113243103,-2.2579169273376465,related_mechanism
 9,valproic-acid,0.9134805202484131,0.3317195773124695,related_mechanism
 10,hydroxyurea,0.77649986743927,1.2352395057678223,ground_truth
 11,curcumin,0.6803375482559204,0.9697343707084656,related_mechanism
 12,sulforaphane,0.5829939842224121,0.6017141342163086,related_mechanism
 13,resveratrol,0.5800595283508301,1.1647570133209229,related_mechanism
 14,tadalafil,0.5426475405693054,0.21663367748260498,related_mechanism
 15,glutamine,0.4674023389816284,2.029467821121216,ground_truth
 16,quercetin,0.45189985632896423,0.34488344192504883,related_mechanism
 17,romidepsin,0.4316568374633789,-0.8200547099113464,related_mechanism
 18,decitabine,0.38500306010246277,0.8381982445716858,related_mechanism
 19,azacitidine,0.31987687945365906,0.6874549388885498,related_mechanism
 20,sildenafil,0.15390564501285553,0.26623794436454773,related_mechanism
 21,thalidomide,0.1190553605556488,1.9194087982177734,related_mechanism
 22,pomalidomide,0.11849743872880936,1.7003729343414307,related_mechanism
 23,lenalidomide,0.09446469694375992,1.9872398376464844,related_mechanism
 24,dexamethasone,0.031893711537122726,0.5724264979362488,related_mechanism
--- a/docs/structure_binding_notes.md
+++ b/docs/structure_binding_notes.md
@@ -0,0 +1,175 @@
 # Structure-based binding track — working notes
 Branch `structure-based-binding`. Implements PLAN §12. Baseline-first, start with the two cleanest
 targets (Hemoglobin + PKR), de-risk the harness before scaling.
 ## Status (2026-06-23)
 **Toolchain check (PLAN §12.6 pitfall 4, confirmed real):**
 - ✅ RDKit installs on ARM Mac — ligand side ready.
 - ❌ AutoDock Vina does NOT pip-install on ARM Mac; no docking binary available. Docking (§12.3)
  is **blocked on toolchain** — must resolve via conda/micromamba (`vina`/`smina`), a GPU AF3-class
  model (Boltz-2/Chai-1/DiffDock), or an x86 Vina binary under Rosetta.
 **Structures obtained:** `5E83` (hemoglobin + voxelotor), `8XFD` (PKR + mitapivat) in
 `data/raw/structures/`.
 **Step 0 — ligand-based retrieval baseline (`scripts/binding_ligand_baseline.py`):**
 RDKit Tanimoto of our 300 drugs vs known sickle binders.
 - Engine VALIDATED on in-set classes: `decitabine`→azacitidine (0.62); `vorinostat`→scriptaid
  (0.42), belinostat (0.28). Correctly clusters DNMT1 / HDAC HbF-inducers.
 - But voxelotor / mitapivat have **no analog** in our set (max Tanimoto ~0.20–0.26). A 300-drug
  library is too sparse to contain look-alikes of distinctive scaffolds.
 **Takeaways:**
 1. Ligand retrieval works but needs a **bigger drug library** to be useful for distinctive targets.
 2. The targets without in-set analogs (Hb, PKR) need **actual docking** (§12.3) — which scores
   binding directly, no look-alike required. That is the gating next step, and it needs the
   toolchain solved.
 ## Step 1 — docking baseline (2026-06-24)
 **Toolchain SOLVED on ARM Mac:** AutoDock Vina 1.2.5 mac binary (`tools/vina`, runs under Rosetta)
 + open-babel (brew) for prep. Docking runs end-to-end (`scripts/dock_positive_controls.py`).
 Co-crystal ligands identified: 5L7 = voxelotor (5E83), WV2 = mitapivat (8XFD).
 **Positive-control cross-docking — inconclusive, and instructively so.** Affinities (kcal/mol):
 | ligand | hemoglobin | PKR |
 |---|---|---|
 | voxelotor | −8.1 | −9.3 |
 | mitapivat | −10.0 | −11.2 |
 | decitabine | −6.6 | −7.0 |
 | hydroxyurea | −3.9 | −3.6 |
 | caffeine | −6.1 | −6.4 |
 The scores rank almost perfectly by **molecular size** (mitapivat > voxelotor > decitabine/caffeine
 > hydroxyurea) in *both* pockets — mitapivat wins even on hemoglobin, which it doesn't target. So
 raw Vina affinity is confounded by ligand size and per-pocket stickiness; it cannot yet
 distinguish target-specific binding. This is the **docking analog of the connectivity specificity
 problem** — raw scores carry a systematic bias (size here, broadness there) that masquerades as
 signal. voxelotor *does* dock to Hb (−8.1, a real score); the cross-target test just isn't the
 right validation.
 ## Step 2 — redocking-RMSD validation FAILS across the board (2026-06-24)
 Redocked each co-crystal ligand into its own structure (`scripts/dock_validate.py`); RMSD vs
 crystal pose via obrms:
 | redock | RMSD | note |
 |---|---|---|
 | voxelotor → Hb (5E83) | NA | covalent binder (Schiff base, αVal1) — out of scope §12.7 |
 | mitapivat → PKR (8XFD) | 8.2 Å | allosteric, cofactor (FBP/Mg) stripped |
 | **vorinostat → HDAC2 (4LXZ, Zn kept)** | **7.9 Å** | classical non-covalent target — should have worked |
 **The clean target also failing means this is a systematic PIPELINE-QUALITY problem, not target
 choice.** The cheap Vina + open-babel setup produces scores but does not reproduce known binding
 geometry, so its affinities are not yet trustworthy. Ligand efficiency (affinity / heavy atoms)
 also doesn't fix it — it over-corrects, ranking tiny hydroxyurea (−0.78) "best".
 Likely causes (in priority order):
 1. **Low-quality receptor prep** — open-babel `-xr` is not production docking prep. Need
   AutoDockTools `prepare_receptor` or **Meeko** + `reduce`/pdb2pqr for protonation, charges, and
   proper AutoDock atom typing.
 2. **Ligand prep** — should use Meeko (correct rotatable bonds / typing), not bare obabel `--gen3d`.
 3. **RMSD metric** — obrms superimposes before RMSD; redocking validation wants symmetry-corrected
   RMSD **in place** (receptor frame). Worth confirming with an in-place metric.
 **Honest takeaway:** consistent with the whole project — the *quick* version of each method runs
 but doesn't survive honest validation. Credible structure-based docking needs production prep
 tooling (Meeko/ADFR), which is the real next investment for this track.
 ## Step 3 — production prep helps, but classical docking is the wrong tool here (2026-06-24)
 `scripts/dock_production.py`: Meeko ligand prep (proper rotatable-bond/AD typing) + in-place
 symmetry-corrected RMSD (spyrmsd, not obrms which superimposes). On the clean HDAC2/vorinostat
 target (Zn kept):
 - **7.9 Å → 4.76 Å** with proper ligand prep + correct metric. Prep and metric genuinely mattered.
 - But still FAIL (>2 Å). The residual is the deeper problem: **vorinostat binding is defined by its
  hydroxamate chelating the catalytic Zn, and Vina has no metal-coordination term** — it cannot
  score the interaction that determines the pose.
 **The real finding: sickle's druggable targets bind via non-classical chemistry that classical
 docking handles poorly** — Hb/voxelotor (covalent), PKR/mitapivat (allosteric + cofactor),
 HDAC/vorinostat (Zn chelation). This is the target landscape, not bad luck. AutoDock Vina is the
 wrong tool for it.
 **Redirect:** the modality that DOES handle covalent/metal/induced-fit binding is **data-driven
 AF3-class co-folding** (Boltz-2 / Chai-1 / DiffDock — they learn these modes from the PDB). That is
 the indicated next tool for this disease — and it's gated by the **24 GB local memory ceiling**
 (PLAN §12.6 pitfall 4): needs a cloud GPU or a bigger box. The "GPU breaks all-local" prediction is
 now the binding constraint of the whole track.
 ## Step 4 — AF3-class co-folding (Boltz-2 on Modal GPU) WORKS on the Zn case (2026-06-24)
 Ran Phase 1 on Modal (L4, serverless) — `gpu/modal_app.py`, ~$0.60–0.80. Co-folded each known
 binder + 2 negatives into each target WITH the binding-mode cofactors (HDAC2+Zn, PKR+FBP/Mg,
 Hb+heme). Ranked by Boltz-2 P(binder):
 | target | known binder P(binder) | negatives | verdict |
 |---|---|---|---|
 | **HDAC2 (+Zn)** | vorinostat **0.9994** | caffeine 0.12, hydroxyurea 0.77 | **PASS — decisive** |
 | hemoglobin (+heme) | voxelotor 0.46 | caffeine 0.34, hydroxyurea 0.07 | PASS (weak) |
 | PKR (+FBP/Mg) | mitapivat 0.32 | hydroxyurea 0.40 (beat it) | FAIL |
 **Headline: HDAC2 + zinc — the exact case Vina failed (7.9 Å redock, no metal term) — co-folding
 NAILS** (vorinostat 0.999 vs negatives ~0.1). The data-driven model handles the Zn-chelation
 chemistry classical docking could not. The modality pivot is validated on its decisive test.
 The first clear positive result in the project after a long string of honest negatives.
 Notes: (1) affinity sign confirmed — vorinostat has the lowest affinity_pred_value (−1.78,
 strongest) AND highest P(binder); ranking by max(affinity) would be backwards (the P(binder) fix
 was necessary). (2) 2/3: Hb weak (covalent/tetramer, as predicted), PKR miss (allosteric pocket).
 (3) Engineering: had to add cuequivariance kernels to the image; serialize (max_containers=1) so
 the weights download once (parallel containers corrupted the checkpoint).
 ## Step 5 — pose-RMSD confirmation: co-folding reproduces the geometry (2026-06-24)
 `scripts/pose_rmsd.py`: superpose predicted HDAC2 onto crystal 4LXZ (Ca), transform the predicted
 vorinostat + Zn, compare poses (spyrmsd, in-place).
 | metric | co-folding | classical Vina |
 |---|---|---|
 | protein fold, Ca RMSD over 366 res | **0.14 A** | — |
 | **vorinostat pose RMSD** | **0.21 A** PASS | 7.9 A FAIL |
 | catalytic Zn placement | 2.73 A (minor) | no metal term |
 Co-folding reproduces the fold AND the vorinostat binding pose to **0.21 A (crystal-accurate)** on
 the exact Zn-chelation case Vina was off by 7.9 A. HDAC2 is now validated on BOTH axes: affinity
 (P(binder)=0.999) and geometry (0.21 A). Minor blemish: the Zn ion itself lands ~2.7 A off (ligand
 perfect, metal slightly less). The structure-binding modality is comprehensively validated on its
 decisive metal-coordination case.
 ## Step 6 — Phase 2 screen pilot (HDAC2): recovers the inhibitor class decisively (2026-06-24)
 `modal run gpu/modal_app.py::screen --limit 24` — co-fold 24 drugs vs HDAC2 (+Zn), parallel
 (~10-wide, cached weights), ranked by P(binder). ~$1.3.
 **Top 9 = all HDAC inhibitors** (trichostatin-A, vorinostat, panobinostat, belinostat, scriptaid,
 mocetinostat, entinostat, apicidin all P≥0.99; valproic-acid 0.91), then a clean drop to
 hydroxyurea 0.78 and non-HDAC drugs sinking to dexamethasone 0.03. The model captures the
 **structure-activity gradient**: potent Zn-chelating hydroxamates ~0.99 vs weak fatty-acid
 valproate 0.91 vs DNMT inhibitors / IMiDs / steroid near 0. The screen recovers the right
 pharmacological class at scale.
 **Honest false negative:** romidepsin (potent HDAC inhibitor) ranked low (0.43) — it is a cyclic
 depsipeptide PRODRUG (must be reduced to expose its Zn-binding thiol), which co-folding doesn't
 model. The screen mishandles non-standard chemotypes (prodrugs, macrocycles).
 The screening pipeline is validated. Next: run the full set (incl. the 240 random + negatives) to
 hunt for NON-obvious HDAC2 binders (the actual discovery run), ~$15-20.
 ## Next steps
 - [ ] Full screen (300 drugs) vs HDAC2 — discovery run for non-obvious binders.
 - [ ] Investigate PKR: allosteric site may need the full assembly / better pocket definition.
 - [ ] Phase 2 screen: rank the ~300-drug set against HDAC2 (the validated target) by P(binder);
  positive-control recovery test at screen scale.
 - [ ] Add a one-time weight-warmup function so post-cache runs go back to fast parallel safely.
 - [ ] (optional) Salvage one classical Vina case: PKR with FBP/Mg cofactors RETAINED, to confirm
  the harness can validate on a non-metal sickle target.
 - [ ] Production receptor prep (Meeko mk_prepare_receptor + protonation) if staying with Vina.
 - [ ] §12.9 generative beacon — only after a validated scoring function exists.
 > **Hardware note:** this machine is **24 GB** unified memory (not the 96 GB PLAN §2 assumed),
 > which caps local AF3-class model inference. Classical docking (above) is unaffected.
--- a/gpu/modal_app.py
+++ b/gpu/modal_app.py
@@ -0,0 +1,329 @@
 """Ephemeral GPU runner for AF3-class co-folding (PLAN §12, docs/gpu_plan.md).
 Serverless: `modal run gpu/modal_app.py` provisions a GPU, runs Phase 1, releases the GPU — zero
 idle cost. Model weights cache in a persistent Volume so we never re-pay GPU time to re-download.
 Phase 1 (positive-control recovery test, §12.4): co-fold each known binder + a couple of negative
 controls against each sickle target and rank by Boltz-2 predicted affinity. The known binder
 should win its own target — the test Vina couldn't pass on metal/covalent/allosteric modes.
 Affinity ranking avoids the receptor-alignment that pose-RMSD would need (a later refinement).
 Setup (one-time): `pip install modal && modal token new`.
 Run: `modal run gpu/modal_app.py`.
 Helpers below the GPU function run locally (no GPU) and are import-safe for testing.
 """
 from __future__ import annotations
 import json
 import subprocess
 from pathlib import Path
 import modal
 app = modal.App("reverso-binding")
 image = (
    modal.Image.debian_slim(python_version="3.12")
    .apt_install("git", "wget")
    # Boltz-2 needs NVIDIA cuequivariance kernels (cuda 12) for inference, plus rdkit/numpy.
    .pip_install("boltz", "cuequivariance-torch", "cuequivariance-ops-torch-cu12", "rdkit", "numpy")
 )
 weights = modal.Volume.from_name("reverso-binding-weights", create_if_missing=True)
 WEIGHTS = "/weights"
 # target -> (PDB id, crystal-ligand resname, drug name, cofactor/metal CCD codes to co-fold).
 # The cofactors are the binding-mode determinants Vina couldn't model: HDAC2 needs the catalytic
 # Zn (vorinostat chelates it), PKR the allosteric FBP + Mg, hemoglobin the heme. Co-folding them
 # in as CCD ligands is the whole point of the AF3-class pivot. The same cofactors are present when
 # co-folding the negatives into that target, for a fair comparison.
 TARGETS = {
    "hemoglobin": ("5E83", "5L7", "voxelotor", ["HEM"]),
    "PKR": ("8XFD", "WV2", "mitapivat", ["FBP", "MG"]),
    "HDAC2": ("4LXZ", "SHH", "vorinostat", ["ZN"]),
 }
 NEGATIVES = ["caffeine", "hydroxyurea"]
 # Honest limitation: hemoglobin's voxelotor site sits at the tetramer centre and the bond is
 # covalent (Schiff base) — a single-chain + heme model only approximates it, so Hb is the weak
 # case. HDAC2 (Zn chelation) and PKR (allosteric + cofactor) are the real tests of whether
 # co-folding handles the modes classical docking could not.
 # --------------------------------------------------------------------------- local helpers (CPU)
 def fetch_pdb(pdb: str, struct_dir: Path = Path("data/raw/structures")) -> Path:
    """Return a local PDB path, downloading from RCSB if absent (keeps the GPU run self-contained)."""
    import requests
    p = struct_dir / f"{pdb}.pdb"
    if not p.exists():
        p.parent.mkdir(parents=True, exist_ok=True)
        p.write_bytes(requests.get(f"https://files.rcsb.org/download/{pdb}.pdb", timeout=60).content)
    return p
 def binding_chain_sequence(pdb: str, lig_resname: str) -> str:
    """One-letter sequence of the protein chain nearest the crystal ligand (the binding chain)."""
    import gemmi
    st = gemmi.read_structure(str(fetch_pdb(pdb)))
    model = st[0]
    lig_atoms = [a.pos for ch in model for res in ch if res.name == lig_resname for a in res]
    if not lig_atoms:
        raise ValueError(f"ligand {lig_resname} not found in {pdb}")
    lig_center = gemmi.Position(
        sum(p.x for p in lig_atoms) / len(lig_atoms),
        sum(p.y for p in lig_atoms) / len(lig_atoms),
        sum(p.z for p in lig_atoms) / len(lig_atoms),
    )
    best, best_d = None, 1e9
    for ch in model:
        poly = ch.get_polymer()
        if len(poly) < 20:
            continue
        d = min((a.pos.dist(lig_center) for res in ch for a in res), default=1e9)
        if d < best_d:
            best, best_d = poly, d
    if best is None:
        raise ValueError(f"no protein chain in {pdb}")
    return gemmi.one_letter_code(best.extract_sequence()).upper().replace("X", "")
 def pubchem_smiles(name: str) -> str:
    import requests
    for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
        try:
            d = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}"
                             f"/property/{prop}/JSON", timeout=30).json()["PropertyTable"]["Properties"][0]
            if prop in d:
                return d[prop]
        except Exception:
            continue
    raise ValueError(f"no SMILES for {name}")
 def build_boltz_yaml(protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str] | None = None,
                     msa_path: str | None = None) -> str:
    """Boltz-2 YAML: protein + the drug ligand (affinity binder) + any cofactor/metal CCD ligands.
    Cofactors/ions are added as `ligand` entries referencing their CCD code (e.g. ZN, MG, FBP,
    HEM) so the model places the metal/cofactor that defines the binding mode. Affinity is
    predicted on the drug ligand L only. If ``msa_path`` is given, the protein reuses a precomputed
    MSA (so a screen against one target queries the MSA server ONCE, not per drug).
    """
    prot = ["  - protein:", "      id: A", f"      sequence: {protein_seq}"]
    if msa_path:
        prot.append(f"      msa: {msa_path}")
    lines = ["version: 1", "sequences:"] + prot + \
            ["  - ligand:", "      id: L", f"      smiles: '{ligand_smiles}'"]
    for i, ccd in enumerate(cofactor_ccds or []):
        lines += ["  - ligand:", f"      id: M{i}", f"      ccd: {ccd}"]
    lines += ["properties:", "  - affinity:", "      binder: L"]
    return "\n".join(lines) + "\n"
 # ------------------------------------------------------------------------------- GPU function
 # max_containers caps parallel fan-out (cost control). The download race that corrupts the
 # checkpoint only happens on a COLD volume; once weights are cached+committed (Phase 1 did this),
 # parallel containers just reload them, so a screen can safely run ~10-wide.
@app.function(gpu="L4", image=image, volumes={WEIGHTS: weights}, timeout=1800)
 def cache_msa(label: str, protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str]) -> str:
    """Compute the target's MSA ONCE (via the server) and cache the a3m on the Volume.
    A screen reuses one target protein for all drugs, so we query the MSA server a single time
    here, then every cofold() reuses the cached a3m (no server hammering). Returns the Volume path
    of the cached a3m. Doubles as the weight/CCD warmup.
    """
    import os
    import shutil
    os.environ["HF_HOME"] = f"{WEIGHTS}/hf"
    boltz_cache = f"{WEIGHTS}/boltz"
    os.makedirs(boltz_cache, exist_ok=True)
    weights.reload()
    work = Path("/tmp") / f"{label}_msa"
    work.mkdir(parents=True, exist_ok=True)
    (work / "in.yaml").write_text(build_boltz_yaml(protein_seq, ligand_smiles, cofactor_ccds))
    out = work / "out"
    subprocess.run(["boltz", "predict", str(work / "in.yaml"), "--use_msa_server",
                    "--cache", boltz_cache, "--out_dir", str(out), "--output_format", "pdb"], check=True)
    # Pick the largest a3m that actually looks like FASTA/a3m text (Boltz writes several files;
    # some are binary and crash its INPUT parser with KeyError '\x00'). Strip null bytes too.
    candidates = sorted(out.rglob("*.a3m"), key=lambda p: p.stat().st_size, reverse=True)
    chosen = None
    for c in candidates:
        raw = c.read_bytes()
        if raw[:1] == b">" or b"\n>" in raw[:512]:
            chosen = raw.replace(b"\x00", b"")
            break
    if chosen is None:
        weights.commit()
        raise RuntimeError(f"no valid a3m; candidates={[(str(p), p.stat().st_size) for p in candidates]}")
    msa_dir = Path(WEIGHTS) / "msa"
    msa_dir.mkdir(parents=True, exist_ok=True)
    dest = msa_dir / f"{label}.a3m"
    dest.write_bytes(chosen)
    weights.commit()
    return str(dest)
@app.function(gpu="L4", image=image, volumes={WEIGHTS: weights}, timeout=3600, max_containers=20)
 def cofold(label: str, protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str],
           msa_path: str | None = None) -> dict:
    """Co-fold one complex (protein + drug + cofactors) on the GPU; return affinity + P(binder).
    Weights persist on the mounted Volume. If ``msa_path`` is given, reuse the cached target MSA
    (no server query); else query the MSA server. GPU is released the moment this returns.
    """
    import os
    os.environ["HF_HOME"] = f"{WEIGHTS}/hf"
    boltz_cache = f"{WEIGHTS}/boltz"
    os.makedirs(boltz_cache, exist_ok=True)
    weights.reload()
    work = Path("/tmp") / label
    work.mkdir(parents=True, exist_ok=True)
    (work / "in.yaml").write_text(build_boltz_yaml(protein_seq, ligand_smiles, cofactor_ccds, msa_path))
    out = work / "out"
    cmd = ["boltz", "predict", str(work / "in.yaml"),
           "--cache", boltz_cache, "--out_dir", str(out), "--output_format", "pdb"]
    if not msa_path:
        cmd.insert(3, "--use_msa_server")
    try:
        subprocess.run(cmd, check=True)
    finally:
        weights.commit()  # persist downloaded weights/CCD even if this run fails, so retries skip it
    # Affinity is written to a JSON under out/predictions/<name>/; parse defensively (keys vary).
    aff = {"affinity_pred_value": None, "affinity_probability_binary": None}
    for jf in out.rglob("affinity*.json"):
        data = json.loads(jf.read_text())
        for k in aff:
            if k in data:
                aff[k] = data[k]
        break
    cif = next(out.rglob("*_model_0.pdb"), None) or next(out.rglob("*.pdb"), None)
    return {"label": label, "affinity": aff["affinity_pred_value"],
            "prob_binder": aff["affinity_probability_binary"],
            "structure": cif.read_text() if cif else None}
 # ------------------------------------------------------------------------------- driver (local)
@app.local_entrypoint()
 def pose() -> None:
    """Save the predicted HDAC2/vorinostat complex for local pose-RMSD validation.
    Run: `modal run gpu/modal_app.py::pose`. Weights are cached, so this is one fast GPU call.
    The returned PDB (protein + vorinostat + Zn) is scored locally against 4LXZ by
    scripts/pose_rmsd.py (align predicted protein to crystal, compare ligand).
    """
    target = "HDAC2"
    pdb, res, drug, cofactors = TARGETS[target]
    seq = binding_chain_sequence(pdb, res)
    r = cofold.remote(f"{target}_{drug}_pose", seq, pubchem_smiles(drug), cofactors)
    out = Path("data/processed/binding"); out.mkdir(parents=True, exist_ok=True)
    if r.get("structure"):
        dest = out / f"{target}_{drug}_pred.pdb"
        dest.write_text(r["structure"])
        print(f"saved {dest}; affinity={r['affinity']}, P(binder)={r['prob_binder']}")
    else:
        print("no structure returned")
@app.local_entrypoint()
 def screen(limit: int = 0) -> None:
    """Phase 2: co-fold the drug set against the validated target (HDAC2 + Zn), rank by P(binder).
    `modal run gpu/modal_app.py::screen --limit 24`  (pilot; omit --limit for the full set).
    Recovery check at scale: the known HDAC inhibitors (related_mechanism) should rank top.
    Weights are cached, so this fans out ~10-wide.
    """
    import csv
    import pandas as pd
    target = "HDAC2"
    pdb, res, _drug, cofactors = TARGETS[target]
    seq = binding_chain_sequence(pdb, res)
    df = pd.read_csv("data/processed/drug_set_v1.csv")
    df = df[df["canonical_smiles"].notna() & (df["canonical_smiles"] != "-666")].copy()
    if limit:  # pilot: prioritise mechanism + controls (incl. the HDAC inhibitors) then fill
        pri = df[df["inclusion_reason"].isin(["ground_truth", "related_mechanism", "negative_control"])]
        df = pd.concat([pri, df.drop(pri.index)]).head(limit)
    # 1) compute the target MSA ONCE (also warms weights/CCD), then reuse it for every drug
    print(f"computing {target} MSA once (server) ...")
    msa_path = cache_msa.remote(target, seq, pubchem_smiles("vorinostat"), cofactors)
    print(f"cached MSA: {msa_path}")
    # 2) screen all drugs reusing the cached MSA; tolerate per-drug failures (no abort)
    jobs = [(f"{target}__{r.pert_iname}", seq, r.canonical_smiles, cofactors, msa_path)
            for r in df.itertuples()]
    print(f"screening {len(jobs)} drugs vs {target} (+{cofactors}), reusing cached MSA")
    results = list(cofold.starmap(jobs, return_exceptions=True))
    by = {j[0].split("__")[1]: (r if isinstance(r, dict) else None) for j, r in zip(jobs, results)}
    n_fail = sum(1 for v in by.values() if v is None)
    if n_fail:
        print(f"  ({n_fail}/{len(jobs)} drugs failed and were skipped)")
    reason = dict(zip(df["pert_iname"], df["inclusion_reason"]))
    rows = [{"drug": d, "P_binder": (r or {}).get("prob_binder"),
             "affinity": (r or {}).get("affinity"), "inclusion_reason": reason.get(d)}
            for d, r in by.items()]
    rows = [x for x in rows if x["P_binder"] is not None]
    rows.sort(key=lambda x: x["P_binder"], reverse=True)
    out = Path("data/processed/binding"); out.mkdir(parents=True, exist_ok=True)
    with (out / f"screen_{target}.csv").open("w", newline="") as f:
        w = csv.DictWriter(f, fieldnames=["rank", "drug", "P_binder", "affinity", "inclusion_reason"])
        w.writeheader()
        for i, x in enumerate(rows, 1):
            w.writerow({"rank": i, **x})
    print(f"\nscreened {len(rows)} drugs vs {target}; top 15 by P(binder):")
    for i, x in enumerate(rows[:15], 1):
        print(f"  {i:2d}  {x['drug']:20s} P={x['P_binder']:.3f}  [{x['inclusion_reason']}]")
    hdac_like = [i for i, x in enumerate(rows, 1) if x["inclusion_reason"] == "related_mechanism"]
    if hdac_like:
        print(f"\nrelated-mechanism (HDAC inhibitors etc.) ranks: {hdac_like[:10]}")
@app.local_entrypoint()
 def main() -> None:
    """Fan out one GPU call per (target, ligand) pair; tabulate affinities; positive-control test."""
    jobs = []  # (label, protein_seq, smiles, cofactor_ccds)
    for target, (pdb, res, drug, cofactors) in TARGETS.items():
        seq = binding_chain_sequence(pdb, res)
        for lig in [drug, *NEGATIVES]:
            jobs.append((f"{target}_{lig}", seq, pubchem_smiles(lig), cofactors))
    cofactor_summary = {t: c[3] for t, c in TARGETS.items()}
    print(f"co-folding {len(jobs)} complexes ({len(TARGETS)} targets x {1+len(NEGATIVES)} ligands); "
          f"cofactors per target: {cofactor_summary}")
    results = list(cofold.starmap(jobs))
    by = {r["label"]: r for r in results}
    print(f"\n{'target':12s}{'ligand':14s}{'affinity':>10s}{'P(binder)':>11s}")
    out_rows = []
    for target, (pdb, res, drug, cofactors) in TARGETS.items():
        prob = {}  # rank by P(binder): unambiguous (higher = more likely a binder). Boltz
        for lig in [drug, *NEGATIVES]:   # affinity_pred_value is ~log(IC50) (lower=stronger) -- sign
            r = by.get(f"{target}_{lig}", {})           # is version-dependent, so don't rank on it.
            a, p = r.get("affinity"), r.get("prob_binder")
            if p is not None:
                prob[lig] = p
            print(f"{target:12s}{lig:14s}{(f'{a:.2f}' if a is not None else 'NA'):>10s}"
                  f"{(f'{p:.2f}' if p is not None else 'NA'):>11s}")
            out_rows.append({"target": target, "ligand": lig, "affinity": a, "prob_binder": p,
                             "is_known_binder": lig == drug})
        if prob:
            best = max(prob, key=prob.get)  # highest P(binder)
            print(f"  -> {target}: best P(binder) = {best} (known = {drug}) "
                  f"{'PASS' if best == drug else 'FAIL'}\n")
    outdir = Path("data/processed/binding"); outdir.mkdir(parents=True, exist_ok=True)
    import csv
    with (outdir / "phase1_affinity.csv").open("w", newline="") as f:
        w = csv.DictWriter(f, fieldnames=["target", "ligand", "affinity", "prob_binder", "is_known_binder"])
        w.writeheader(); w.writerows(out_rows)
    print(f"wrote {outdir/'phase1_affinity.csv'}")
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -33,6 +33,18 @@ dev = [
    "pytest>=8.0",
    "ruff>=0.5",
 ]
 # Structure-based binding track (PLAN §12); see docs/structure_binding_notes.md.
 # Non-pip tools (install separately): AutoDock Vina binary (tools/vina, from the AutoDock-Vina
 # GitHub release) and open-babel (brew). Boltz-2 + cuequivariance run only in the Modal GPU image
 # (gpu/modal_app.py), not locally.
 structure = [
    "rdkit>=2024.3",        # ligand prep, fingerprints
    "requests>=2.31",       # PubChem / RCSB fetch
    "gemmi>=0.6",           # structure parsing + superposition
    "spyrmsd>=0.9",         # symmetry-corrected pose RMSD
    "meeko>=0.7",           # production docking ligand prep
    "modal>=1.5",           # ephemeral GPU orchestration
 ]
 [tool.setuptools.packages.find]
 where = ["."]
--- a/scripts/binding_ligand_baseline.py
+++ b/scripts/binding_ligand_baseline.py
@@ -0,0 +1,66 @@
 """Structure-based track, step 0: ligand-based retrieval baseline (PLAN §12.9 engine).
 Docking (§12.3) needs a toolchain that doesn't pip-install on ARM Mac (AutoDock Vina) — that's the
 next dependency to solve. Meanwhile this runs now with pure RDKit: do any of our 300 drugs sit near
 the KNOWN sickle binders (voxelotor, mitapivat, decitabine) in chemical space? This is the
 retrieval engine §12.9 would point a generative beacon at, and a sanity check on the ligand data.
 NOT docking and NOT a binding claim — chemical similarity only. Similarity != activity (§12.9).
 """
 from __future__ import annotations
 import pandas as pd
 import requests
 from rdkit import Chem, DataStructs, RDLogger
 from rdkit.Chem import rdFingerprintGenerator
 RDLogger.DisableLog("rdApp.*")
 MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
 # Known sickle binders = positive-control beacons (target in parens).
 BINDERS = ["voxelotor", "mitapivat", "decitabine", "vorinostat"]
 def pubchem_smiles(name: str) -> str | None:
    for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
        try:
            u = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON"
            d = requests.get(u, timeout=30).json()["PropertyTable"]["Properties"][0]
            if prop in d:
                return d[prop]
        except Exception:
            continue
    return None
 def fp(smi: str):
    if not isinstance(smi, str) or smi in ("-666", ""):
        return None
    m = Chem.MolFromSmiles(smi)
    return MORGAN.GetFingerprint(m) if m else None
 def main() -> None:
    binder_smi = {b: pubchem_smiles(b) for b in BINDERS}
    print("known-binder SMILES:", {k: (v[:34] + "..." if v else "MISSING") for k, v in binder_smi.items()})
    drugs = pd.read_csv("data/processed/drug_set_v1.csv")[["pert_iname", "canonical_smiles", "inclusion_reason"]]
    reason = dict(zip(drugs.pert_iname, drugs.inclusion_reason))
    drug_fp = {r.pert_iname: fp(r.canonical_smiles) for r in drugs.itertuples()}
    drug_fp = {k: v for k, v in drug_fp.items() if v is not None}
    print(f"fingerprinted {len(drug_fp)}/{len(drugs)} drugs\n")
    for b, smi in binder_smi.items():
        bfp = fp(smi)
        if bfp is None:
            print(f"{b}: no SMILES\n"); continue
        sims = sorted(((DataStructs.TanimotoSimilarity(bfp, v), k) for k, v in drug_fp.items()), reverse=True)
        print(f"nearest drugs to {b}:")
        for s, k in sims[:5]:
            print(f"    {s:.3f}  {k:22s} [{reason.get(k,'?')}]")
        print()
 if __name__ == "__main__":
    main()
--- a/scripts/dock_positive_controls.py
+++ b/scripts/dock_positive_controls.py
@@ -0,0 +1,111 @@
 """Structure-based track §12.4: dock known binders + negatives into Hb and PKR.
 Positive-control recovery test: voxelotor should dock best into hemoglobin (5E83), mitapivat best
 into PKR (8XFD), and unrelated drugs (caffeine, hydroxyurea) should score worse. The box is
 centered on each co-crystal ligand (5L7=voxelotor, WV2=mitapivat). AutoDock Vina (mac binary,
 Rosetta) + open-babel prep. Affinity in kcal/mol; more negative = stronger predicted binding.
 This is a docking baseline, not an efficacy claim (PLAN §12.6).
 """
 from __future__ import annotations
 import re
 import subprocess
 import tempfile
 from pathlib import Path
 import numpy as np
 import requests
 VINA = "./tools/vina"
 STRUCT = Path("data/raw/structures")
 WORK = Path("data/processed/docking")
 WORK.mkdir(parents=True, exist_ok=True)
 TARGETS = {"hemoglobin": ("5E83", "5L7"), "PKR": ("8XFD", "WV2")}
 LIGANDS = ["voxelotor", "mitapivat", "decitabine", "hydroxyurea", "caffeine"]
 EXPECTED = {"voxelotor": "hemoglobin", "mitapivat": "PKR"}
 def pubchem_smiles(name: str) -> str | None:
    for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
        try:
            d = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON",
                             timeout=30).json()["PropertyTable"]["Properties"][0]
            if prop in d:
                return d[prop]
        except Exception:
            continue
    return None
 def prep_receptor_and_box(pdb: str, lig_res: str):
    """Receptor pdbqt (protein only) + box center/size from one copy of the co-crystal ligand."""
    atoms, lig, chosen = [], [], None
    for ln in (STRUCT / f"{pdb}.pdb").read_text().splitlines():
        if ln.startswith("ATOM"):
            atoms.append(ln)
        elif ln.startswith("HETATM") and ln[17:20].strip() == lig_res:
            key = (ln[21], ln[22:26])
            chosen = chosen or key
            if key == chosen:
                lig.append(ln)
    rec_pdb = WORK / f"{pdb}_rec.pdb"
    rec_pdb.write_text("\n".join(atoms) + "\nEND\n")
    rec_pdbqt = WORK / f"{pdb}_rec.pdbqt"
    subprocess.run(["obabel", str(rec_pdb), "-O", str(rec_pdbqt), "-xr", "-p", "7.4"],
                   capture_output=True, check=True)
    xyz = np.array([[float(l[30:38]), float(l[38:46]), float(l[46:54])] for l in lig])
    center = xyz.mean(0)
    size = np.clip((xyz.max(0) - xyz.min(0)) + 9.0, 18.0, 28.0)
    return rec_pdbqt, center, size
 def prep_ligand(name: str, smiles: str) -> Path | None:
    out = WORK / f"lig_{name}.pdbqt"
    r = subprocess.run(["obabel", f"-:{smiles}", "-O", str(out), "--gen3d", "-p", "7.4"],
                       capture_output=True, text=True)
    return out if out.exists() and out.stat().st_size > 0 else None
 def dock(rec_pdbqt: Path, lig_pdbqt: Path, center, size) -> float | None:
    out = subprocess.run([VINA, "--receptor", str(rec_pdbqt), "--ligand", str(lig_pdbqt),
                          "--center_x", f"{center[0]:.2f}", "--center_y", f"{center[1]:.2f}",
                          "--center_z", f"{center[2]:.2f}", "--size_x", f"{size[0]:.1f}",
                          "--size_y", f"{size[1]:.1f}", "--size_z", f"{size[2]:.1f}",
                          "--exhaustiveness", "8", "--cpu", "4",
                          "--out", str(WORK / "o.pdbqt")], capture_output=True, text=True)
    m = re.search(r"^\s+1\s+(-?\d+\.\d+)", out.stdout, re.M)
    return float(m.group(1)) if m else None
 def main() -> None:
    smis = {n: pubchem_smiles(n) for n in LIGANDS}
    ligs = {n: prep_ligand(n, s) for n, s in smis.items() if s}
    ligs = {n: p for n, p in ligs.items() if p}
    print(f"prepared ligands: {list(ligs)}")
    boxes = {t: prep_receptor_and_box(pdb, lr) for t, (pdb, lr) in TARGETS.items()}
    print("prepared receptors:", list(boxes))
    print(f"\n{'ligand':14s}" + "".join(f"{t:>13s}" for t in TARGETS))
    results = {}
    for lname, lpath in ligs.items():
        row = {}
        for tname, (rec, center, size) in boxes.items():
            row[tname] = dock(rec, lpath, center, size)
        results[lname] = row
        cells = "".join(f"{(f'{row[t]:.1f}' if row[t] is not None else 'NA'):>13s}" for t in TARGETS)
        print(f"  {lname:12s}{cells}")
    print("\n=== positive-control recovery test (§12.4) ===")
    for lig, exp_target in EXPECTED.items():
        if lig in results:
            best = min(results[lig], key=lambda t: results[lig][t] if results[lig][t] is not None else 99)
            ok = best == exp_target
            print(f"  {lig:12s} best target = {best:11s} (expected {exp_target}) -> {'PASS' if ok else 'FAIL'}")
 if __name__ == "__main__":
    main()
--- a/scripts/dock_production.py
+++ b/scripts/dock_production.py
@@ -0,0 +1,83 @@
 """Push docking to credibility: Meeko ligand prep + in-place spyrmsd, on the clean HDAC2 target.
 Isolates the two cheap suspects behind the 7.9 A redocking failure:
  1. ligand prep — replace bare obabel --gen3d with Meeko (correct rotatable bonds / AD typing)
  2. RMSD metric — replace obrms (superimposes) with spyrmsd in-place, symmetry-corrected
 Receptor reused (obabel-protonated 4LXZ with catalytic Zn kept). If redock RMSD drops <2 A, the
 docking pipeline is validated and the earlier failure was prep+metric, not docking itself.
 """
 from __future__ import annotations
 import subprocess
 from pathlib import Path
 import numpy as np
 from rdkit import Chem, RDLogger
 from rdkit.Chem import AllChem
 from meeko import MoleculePreparation, PDBQTWriterLegacy
 from spyrmsd import io as spyio, rmsd as spyrmsd
 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from scripts.dock_positive_controls import STRUCT, VINA, WORK, pubchem_smiles  # noqa: E402
 from scripts.dock_validate import dock_pose, extract_crystal_ligand  # noqa: E402
 RDLogger.DisableLog("rdApp.*")
 PDB, RES, DRUG = "4LXZ", "SHH", "vorinostat"
 def prep_receptor_keep_zn() -> tuple[Path, np.ndarray, np.ndarray]:
    atoms, lig, chosen = [], [], None
    for ln in (STRUCT / f"{PDB}.pdb").read_text().splitlines():
        if ln.startswith("ATOM") or (ln.startswith("HETATM") and ln[17:20].strip() == "ZN"):
            atoms.append(ln)
        elif ln.startswith("HETATM") and ln[17:20].strip() == RES:
            key = (ln[21], ln[22:26]); chosen = chosen or key
            if key == chosen:
                lig.append(ln)
    recpdb = WORK / f"{PDB}_rec.pdb"; recpdb.write_text("\n".join(atoms) + "\nEND\n")
    recpdbqt = WORK / f"{PDB}_rec.pdbqt"
    subprocess.run(["obabel", str(recpdb), "-O", str(recpdbqt), "-xr", "-p", "7.4"], capture_output=True)
    xyz = np.array([[float(l[30:38]), float(l[38:46]), float(l[46:54])] for l in lig])
    return recpdbqt, xyz.mean(0), np.clip((xyz.max(0) - xyz.min(0)) + 8, 16, 26)
 def meeko_ligand(smiles: str) -> Path:
    mol = Chem.AddHs(Chem.MolFromSmiles(smiles))
    AllChem.EmbedMolecule(mol, randomSeed=0xC0FFEE)
    AllChem.MMFFOptimizeMolecule(mol)
    setups = MoleculePreparation().prepare(mol)
    pdbqt, ok, err = PDBQTWriterLegacy.write_string(setups[0])
    if not ok:
        raise RuntimeError(f"meeko ligand prep failed: {err}")
    out = WORK / f"lig_{DRUG}_meeko.pdbqt"; out.write_text(pdbqt)
    return out
 def inplace_rmsd(crystal_pdb: Path, docked_pdbqt: Path) -> float:
    cref = WORK / "xtal.sdf"; cdoc = WORK / "docked.sdf"
    subprocess.run(["obabel", str(crystal_pdb), "-O", str(cref)], capture_output=True)
    subprocess.run(["obabel", str(docked_pdbqt), "-O", str(cdoc), "-f", "1", "-l", "1"], capture_output=True)
    ref, mol = spyio.loadmol(str(cref)), spyio.loadmol(str(cdoc))
    ref.strip(); mol.strip()
    r = spyrmsd.rmsdwrapper(ref, mol, symmetry=True, minimize=False)
    return float(r[0] if isinstance(r, (list, tuple, np.ndarray)) else r)
 def main() -> None:
    rec, center, size = prep_receptor_keep_zn()
    smi = pubchem_smiles(DRUG)
    lig = meeko_ligand(smi)
    print(f"meeko ligand pdbqt: {lig.name}; box center {center.round(1)} size {size.round(1)}")
    pose = dock_pose(rec, lig, center, size)  # writes WORK/redock_out.pdbqt
    raw = WORK / "redock_out.pdbqt"
    xtal = extract_crystal_ligand(PDB, RES)
    rmsd = inplace_rmsd(xtal, raw)
    verdict = "PASS (<2A)" if rmsd < 2 else ("MARGINAL (<3A)" if rmsd < 3 else "FAIL")
    print(f"\nvorinostat -> HDAC2 redock | in-place spyrmsd RMSD = {rmsd:.2f} A  {verdict}")
    print("(prior obabel-ligand + obrms gave 7.9 A)")
 if __name__ == "__main__":
    main()
--- a/scripts/dock_validate.py
+++ b/scripts/dock_validate.py
@@ -0,0 +1,93 @@
 """Structure-binding §12.4: redocking-RMSD validation + ligand-efficiency normalization.
 Two de-biased checks of the docking baseline:
  A. REDOCKING RMSD — redock each co-crystal ligand into its own structure and measure pose RMSD
     vs the crystal pose (open-babel obrms, symmetry-aware). <2 A = geometry validated. This is
     the gold-standard positive control and is immune to the molecular-size bias that made the
     cross-target affinity test inconclusive.
  B. LIGAND EFFICIENCY — affinity / heavy-atom count, to de-bias the size effect in the raw scores.
 """
 from __future__ import annotations
 import re
 import subprocess
 from pathlib import Path
 from rdkit import Chem, RDLogger
 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from scripts.dock_positive_controls import (  # noqa: E402
    STRUCT, VINA, WORK, TARGETS, pubchem_smiles, prep_ligand, prep_receptor_and_box,
 )
 RDLogger.DisableLog("rdApp.*")
 XTAL_LIG = {"hemoglobin": ("5E83", "5L7", "voxelotor"), "PKR": ("8XFD", "WV2", "mitapivat")}
 def extract_crystal_ligand(pdb: str, resname: str) -> Path:
    lines, chosen = [], None
    for ln in (STRUCT / f"{pdb}.pdb").read_text().splitlines():
        if ln.startswith("HETATM") and ln[17:20].strip() == resname:
            key = (ln[21], ln[22:26])
            chosen = chosen or key
            if key == chosen:
                lines.append(ln)
    out = WORK / f"xtal_{resname}.pdb"
    out.write_text("\n".join(lines) + "\nEND\n")
    return out
 def dock_pose(rec, lig, center, size) -> Path | None:
    pose = WORK / "redock_out.pdbqt"
    subprocess.run([VINA, "--receptor", str(rec), "--ligand", str(lig),
                    "--center_x", f"{center[0]:.2f}", "--center_y", f"{center[1]:.2f}",
                    "--center_z", f"{center[2]:.2f}", "--size_x", f"{size[0]:.1f}",
                    "--size_y", f"{size[1]:.1f}", "--size_z", f"{size[2]:.1f}",
                    "--exhaustiveness", "16", "--out", str(pose)], capture_output=True, text=True)
    if not pose.exists():
        return None
    top = WORK / "redock_top.pdb"  # first model only
    subprocess.run(["obabel", str(pose), "-O", str(top), "-f", "1", "-l", "1"], capture_output=True)
    return top
 def obrms(ref: Path, test: Path) -> float | None:
    # strip H to compare heavy-atom poses; obrms is automorphism-aware
    r2, t2 = WORK / "ref_noH.sdf", WORK / "test_noH.sdf"
    subprocess.run(["obabel", str(ref), "-d", "-O", str(r2)], capture_output=True)
    subprocess.run(["obabel", str(test), "-d", "-O", str(t2)], capture_output=True)
    out = subprocess.run(["obrms", str(r2), str(t2)], capture_output=True, text=True).stdout
    m = re.search(r"(-?\d+\.\d+)", out)
    return float(m.group(1)) if m else None
 def main() -> None:
    print("=== A. Redocking-RMSD validation ===")
    for target, (pdb, resname, drug) in XTAL_LIG.items():
        rec, center, size = prep_receptor_and_box(pdb, resname)
        smi = pubchem_smiles(drug)
        lig = prep_ligand(drug, smi)
        xtal = extract_crystal_ligand(pdb, resname)
        pose = dock_pose(rec, lig, center, size)
        rmsd = obrms(xtal, pose) if pose else None
        verdict = "PASS (<2A)" if (rmsd is not None and rmsd < 2.0) else \
                  ("MARGINAL (<3A)" if (rmsd is not None and rmsd < 3.0) else "FAIL")
        rstr = f"{rmsd:.2f} A" if rmsd is not None else "NA"
        print(f"  {drug:12s} -> {target:11s} ({pdb}/{resname}): redock RMSD {rstr}  {verdict}")
    print("\n=== B. Ligand efficiency (affinity / heavy atoms), de-biasing size ===")
    # raw affinities from the cross-dock baseline (scripts/dock_positive_controls.py)
    aff = {"voxelotor": {"hemoglobin": -8.1, "PKR": -9.3}, "mitapivat": {"hemoglobin": -10.0, "PKR": -11.2},
           "decitabine": {"hemoglobin": -6.6, "PKR": -7.0}, "hydroxyurea": {"hemoglobin": -3.9, "PKR": -3.6},
           "caffeine": {"hemoglobin": -6.1, "PKR": -6.4}}
    print(f"  {'ligand':12s}{'heavy':>6s}" + "".join(f"{t+' LE':>14s}" for t in TARGETS))
    for lig, row in aff.items():
        ha = Chem.MolFromSmiles(pubchem_smiles(lig)).GetNumHeavyAtoms()
        cells = "".join(f"{row[t]/ha:>14.3f}" for t in TARGETS)
        print(f"  {lig:12s}{ha:>6d}{cells}")
 if __name__ == "__main__":
    main()
--- a/scripts/exp_deconv_signature.py
+++ b/scripts/exp_deconv_signature.py
@@ -0,0 +1,137 @@
 """Experiment: composition-adjusted sickle signature, to fix specificity (option 1).
 The v1 signature is confounded by cell composition (SS patients have very different WBC/RBC
 than AA controls). GSE35007 *measured* those counts per sample, so we adjust the differential
 expression for them directly (a measured-covariate alternative to estimated deconvolution):
    expression ~ disease + WBC + RBC + MCV + age + sex     (per gene, vectorized OLS)
 We compare the composition-ADJUSTED signature against an UNADJUSTED single-study signature
 (same samples, model without the covariates), score both with the v1.1 engine (full gene space
 + spec_z), and report the recovery test for each. Writes nothing to committed artifacts.
 """
 from __future__ import annotations
 import warnings
 from pathlib import Path
 import GEOparse
 import numpy as np
 import pandas as pd
 from scipy.stats import false_discovery_control
 from scipy.stats import t as tdist
 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from src.disease import collapse_probes_to_symbols  # noqa: E402
 from src.scoring import tau_calibrate  # noqa: E402
 warnings.filterwarnings("ignore")
 PROCESSED = Path("data/processed")
 NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
 SYMBOL_COLS = ["Symbol", "ILMN_Gene", "Gene Symbol", "GeneSymbol"]
 def cval(gsm, key):
    for c in gsm.metadata.get("characteristics_ch1", []):
        if c.lower().startswith(key.lower()):
            return c.split(":", 1)[1].strip()
    return None
 def load_data():
    gse = GEOparse.get_GEO(geo="GSE35007", destdir="data/raw/geo", silent=True)
    meta = []
    for gid, g in gse.gsms.items():
        meta.append({"gsm": gid, "hb": cval(g, "hb phenotype"), "wbc": cval(g, "white blood cells"),
                     "rbc": cval(g, "red blood cells"), "mcv": cval(g, "mean corpuscular volume"),
                     "age": cval(g, "age"), "sex": cval(g, "Sex")})
    meta = pd.DataFrame(meta).set_index("gsm")
    for c in ["wbc", "rbc", "mcv", "age"]:
        meta[c] = pd.to_numeric(meta[c], errors="coerce")
    meta["disease"] = meta["hb"].map({"SS": 1.0, "AA": 0.0})
    meta["sex_m"] = (meta["sex"] == "M").astype(float)
    keep = meta[meta["hb"].isin(["SS", "AA"])].dropna(subset=["wbc", "rbc", "mcv", "age", "disease"])
    expr = pd.DataFrame({gid: gse.gsms[gid].table.set_index("ID_REF")["VALUE"] for gid in keep.index})
    expr = expr.apply(pd.to_numeric, errors="coerce").dropna(how="any")
    if float(np.nanmax(expr.to_numpy())) > 50:
        expr = np.log2(expr.clip(lower=0) + 1.0)
    gpl = list(gse.gpls.values())[0]
    col = next((c for c in SYMBOL_COLS if c in gpl.table.columns), None)
    sym = gpl.table.set_index("ID")[col].astype(str).str.strip().replace({"": np.nan, "nan": np.nan})
    return expr, keep, sym.dropna()
 def ols_de(expr, design, disease_idx):
    """Vectorized per-gene OLS; return DE table (log_fc=disease coef, pvalue, qvalue)."""
    X = design.to_numpy(dtype=float)
    Y = expr.T.to_numpy(dtype=float)  # samples x genes
    n, p = X.shape
    XtX_inv = np.linalg.pinv(X.T @ X)
    B = XtX_inv @ X.T @ Y
    resid = Y - X @ B
    dof = n - p
    sigma2 = (resid ** 2).sum(0) / dof
    se = np.sqrt(sigma2 * XtX_inv[disease_idx, disease_idx])
    t = B[disease_idx] / se
    pval = 2 * tdist.sf(np.abs(t), dof)
    out = pd.DataFrame({"log_fc": B[disease_idx], "pvalue": pval}, index=expr.index).dropna()
    out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
    return out
 def make_signature(de, sym, expr, top_n=250):
    de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
    sig = de_sym[de_sym["qvalue"] < 0.05]
    up = sig[sig["log_fc"] > 0].nsmallest(top_n, "qvalue").index.tolist()
    down = sig[sig["log_fc"] < 0].nsmallest(top_n, "qvalue").index.tolist()
    return up, down
 def evaluate(label, up, down, lincs):
    ranked = tau_calibrate(up, down, lincs, n_null=1000)
    n = len(ranked)
    top10, top25, half = int(n * .10), int(n * .25), n // 2
    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
    ranked = ranked.join(profiles[["inclusion_reason"]])
    hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
    negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
    n_bottom = sum(r > half for r in negs.values())
    n_ov = len((set(up) | set(down)) & set(lincs.columns))
    print(f"\n### {label}: {len(up)} up / {len(down)} down (query overlap {n_ov})")
    print(f"  hydroxyurea rank {hu}/{n} (top {100*hu/n:.1f}%)  top-10%? {hu <= top10}")
    print(f"  L-glutamine rank {glut}/{n} (top {100*glut/n:.1f}%)  top-25%? {glut <= top25}")
    print(f"  neg controls bottom-half: {n_bottom}/5  {negs}")
    print("  top 8: " + ", ".join(
        f"{name}[{r['inclusion_reason'][:3]}]" for name, r in ranked.nsmallest(8, "spec_z").iterrows()))
    return ranked
 def main():
    expr, meta, sym = load_data()
    print(f"loaded {expr.shape[1]} samples x {expr.shape[0]} probes; "
          f"{int(meta.disease.sum())} SS / {int((meta.disease==0).sum())} AA")
    lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
    base = pd.DataFrame({"intercept": 1.0, "disease": meta["disease"]}, index=meta.index)
    adj = base.assign(wbc=meta["wbc"], rbc=meta["rbc"], mcv=meta["mcv"], age=meta["age"], sex_m=meta["sex_m"])
    de_unadj = ols_de(expr, base, disease_idx=1)
    de_adj = ols_de(expr, adj, disease_idx=1)
    up_u, dn_u = make_signature(de_unadj, sym, expr)
    up_a, dn_a = make_signature(de_adj, sym, expr)
    # how much does adjustment change the gene set?
    overlap = len((set(up_u) | set(dn_u)) & (set(up_a) | set(dn_a)))
    print(f"\nsignature gene overlap unadjusted vs adjusted: {overlap}/{len(set(up_u)|set(dn_u))}")
    evaluate("UNADJUSTED (GSE35007 only)", up_u, dn_u, lincs)
    evaluate("COMPOSITION-ADJUSTED", up_a, dn_a, lincs)
 if __name__ == "__main__":
    main()
--- a/scripts/exp_genespace.py
+++ b/scripts/exp_genespace.py
@@ -0,0 +1,102 @@
 """Experiment (v1.1): re-score on a larger LINCS gene space and re-run the recovery test.
 v1 used only the 978 landmark genes (12% signature overlap). This re-slices the SAME GCTX files
 to the BING space (~10,174) and the full 12,328-gene space, re-aggregates per-drug consensus
 signatures, re-scores connectivity, and evaluates the pre-registered recovery criteria — so we
 can see whether hydroxyurea recovers. Writes nothing to the committed v1 artifacts.
 """
 from __future__ import annotations
 import gzip
 import io
 import json
 from pathlib import Path
 import pandas as pd
 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from src.scoring import rank_drugs  # noqa: E402
 LINCS = Path("data/raw/lincs")
 PROCESSED = Path("data/processed")
 GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
 SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
 NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
 def read_gz(name):
    return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
 def gene_ids_for_space(space: str):
    g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
    if space == "bing":
        g = g[g.pr_is_bing == 1]
    # 'all' -> keep everything
    ids = [str(x) for x in g.pr_gene_id]
    id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
    return ids, id_to_symbol
 def extract(space, drug_names, gene_ids, id_to_symbol):
    from cmapPy.pandasGEXpress.parse import parse
    frames = []
    for ph in (1, 2):
        sig = read_gz(SIG_INFO[ph])
        sig = sig[(sig.pert_type == "trt_cp") & (sig.pert_iname.isin(drug_names))]
        if sig.empty:
            continue
        gct = parse(str(GCTX[ph]), rid=gene_ids, cid=sig.sig_id.tolist())
        data = gct.data_df
        s2d = dict(zip(sig.sig_id, sig.pert_iname))
        frames.append(data.T.groupby(data.columns.map(s2d)).mean())
        print(f"    [{space}] phase {ph}: {sig.pert_iname.nunique()} drugs sliced", flush=True)
    combined = pd.concat(frames).groupby(level=0).mean()
    combined.columns = [id_to_symbol.get(c, c) for c in combined.columns]
    combined = combined.loc[:, ~combined.columns.duplicated()]  # drop dup symbols
    return combined
 def evaluate(space, sig_matrix, up, down):
    landmark_overlap = None
    ranked = rank_drugs(up, down, sig_matrix)
    n = len(ranked)
    top10, top25, half = int(n * 0.10), int(n * 0.25), n // 2
    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
    ranked = ranked.join(profiles[["inclusion_reason"]])
    hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
    glut_s = ranked.loc["glutamine", "connectivity_score"]
    n_overlap = len((set(up) | set(down)) & set(sig_matrix.columns))
    negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
    n_bottom = sum(r > half for r in negs.values())
    print(f"\n=== gene space: {space.upper()} ({sig_matrix.shape[1]} genes; query overlap {n_overlap}) ===")
    print(f"  hydroxyurea: rank {hu}/{n} (top {100*hu/n:.1f}%)  top-10%? {hu <= top10}")
    print(f"  L-glutamine: rank {glut}/{n} (top {100*glut/n:.1f}%), WTCS={glut_s:.3f}  top-25%? {glut <= top25}")
    print(f"  neg controls in bottom half: {n_bottom}/5  {negs}")
    crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
    print(f"  OVERALL: {'PASS' if crit else 'FAIL'}")
    print("  top 8:")
    for name, r in ranked.nsmallest(8, "connectivity_score").iterrows():
        print(f"    {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} [{r['inclusion_reason']}]")
    return ranked
 def main():
    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
    up = [g["gene"] for g in sig["up_regulated"]]
    down = [g["gene"] for g in sig["down_regulated"]]
    drug_names = set(pd.read_csv(PROCESSED / "drug_set_v1.csv").pert_iname)
    for space in ("bing", "all"):
        print(f"\n>>> extracting {space} ...", flush=True)
        ids, id2sym = gene_ids_for_space(space)
        mat = extract(space, drug_names, ids, id2sym)
        evaluate(space, mat, up, down)
 if __name__ == "__main__":
    main()
--- a/scripts/phaseA_reference_tau.py
+++ b/scripts/phaseA_reference_tau.py
@@ -0,0 +1,118 @@
 """Phase A: calibrate sickle connectivity against a REAL disease-signature reference population.
 The v1.1 tau saturated because it used random gene-set nulls. Proper specificity calibration
 asks: does a drug reverse SICKLE more than it reverses diseases in general? We download a library
 of real disease signatures (Enrichr "Disease Signatures from GEO", up+down), compute each drug's
 connectivity to every reference disease, and express its sickle connectivity as a z-score within
 that per-drug reference distribution. Broad-effect drugs (reverse everything) -> z~0 -> down-ranked.
 Re-runs the recovery test and compares to v1.1 (random-null). Writes nothing committed.
 """
 from __future__ import annotations
 from pathlib import Path
 import numpy as np
 import pandas as pd
 import requests
 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from src.scoring import _ks_connectivity  # noqa: E402
 PROCESSED = Path("data/processed")
 RAW = Path("data/raw/disease_sigs")
 ENRICHR = "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName={}"
 LIBS = {"up": "Disease_Signatures_from_GEO_up_2014", "down": "Disease_Signatures_from_GEO_down_2014"}
 NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
 def fetch_gmt(name: str) -> dict[str, list[str]]:
    RAW.mkdir(parents=True, exist_ok=True)
    path = RAW / f"{name}.gmt"
    if not path.exists():
        r = requests.get(ENRICHR.format(name), timeout=120)
        r.raise_for_status()
        path.write_text(r.text)
    out = {}
    for line in path.read_text().splitlines():
        parts = line.split("\t")
        if len(parts) < 3:
            continue
        term = parts[0]
        genes = [g.split(",")[0].strip().upper() for g in parts[2:] if g.strip()]
        out[term] = genes
    print(f"  {name}: {path.stat().st_size/1e6:.1f} MB, {len(out)} terms")
    return out
 def build_reference() -> list[dict]:
    up = fetch_gmt(LIBS["up"])
    down = fetch_gmt(LIBS["down"])
    shared = set(up) & set(down)
    refs = []
    for term in shared:
        if "sickle" in term.lower():
            continue  # exclude the target disease from its own reference population
        refs.append({"name": term, "up": up[term], "down": down[term]})
    print(f"reference disease signatures (paired up+down, sickle excluded): {len(refs)}")
    return refs
 def cols_for(genes, gene_to_col):
    return np.array([gene_to_col[g] for g in set(genes) if g in gene_to_col], dtype=int)
 def main():
    import json
    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
    sk_up = [g["gene"] for g in sig["up_regulated"]]
    sk_down = [g["gene"] for g in sig["down_regulated"]]
    lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
    genes = list(lincs.columns)
    gene_to_col = {g: i for i, g in enumerate(genes)}
    n = len(genes)
    R = lincs.rank(axis=1, ascending=False).to_numpy()
    refs = build_reference()
    # connectivity of every drug to every reference disease -> (n_drugs, n_refs)
    C = np.empty((R.shape[0], len(refs)))
    for j, d in enumerate(refs):
        C[:, j] = _ks_connectivity(R, cols_for(d["up"], gene_to_col), cols_for(d["down"], gene_to_col), n)
    # drop reference diseases with too few mapped genes (degenerate columns)
    mapped = np.array([len(cols_for(d["up"], gene_to_col)) + len(cols_for(d["down"], gene_to_col)) for d in refs])
    keep = mapped >= 10
    C = C[:, keep]
    print(f"usable reference diseases (>=10 mapped genes): {keep.sum()}")
    real = _ks_connectivity(R, cols_for(sk_up, gene_to_col), cols_for(sk_down, gene_to_col), n)
    ref_mean, ref_std = C.mean(axis=1), C.std(axis=1)
    ref_std[ref_std == 0] = np.nan
    spec_z = (real - ref_mean) / ref_std  # negative = reverses sickle more than diseases-in-general
    ranked = pd.DataFrame({"spec_z": spec_z, "connectivity": real}, index=lincs.index).sort_values("spec_z")
    ranked.insert(0, "rank", range(1, len(ranked) + 1))
    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
    ranked = ranked.join(profiles[["inclusion_reason"]])
    N = len(ranked)
    top10, top25, half = int(N * .10), int(N * .25), N // 2
    hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
    negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
    n_bottom = sum(r > half for r in negs.values())
    print("\n==== RECOVERY TEST (reference-calibrated tau) ====")
    print(f"  hydroxyurea rank {hu}/{N} (top {100*hu/N:.1f}%)  top-10%? {hu <= top10}  z={ranked.loc['hydroxyurea','spec_z']:.2f}")
    print(f"  L-glutamine rank {glut}/{N} (top {100*glut/N:.1f}%)  top-25%? {glut <= top25}")
    print(f"  neg controls bottom-half: {n_bottom}/5  {negs}")
    crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
    print(f"  OVERALL: {'PASS' if crit else 'FAIL'}")
    print("\n  top 12 by reference-calibrated z:")
    for name, r in ranked.nsmallest(12, "spec_z").iterrows():
        print(f"    {int(r['rank']):2d} {name:20s} z={r['spec_z']:6.2f} [{r['inclusion_reason']}]")
 if __name__ == "__main__":
    main()
--- a/scripts/phaseD_supervised.py
+++ b/scripts/phaseD_supervised.py
@@ -0,0 +1,137 @@
 """Phase D: supervised cross-disease repurposing — can labels break the specificity ceiling?
 Connectivity alone can't tell therapeutic from coincidental reversal. A supervised model trained
 on known drug-disease pairs CAN learn that pattern — given features that expose drug "broadness"
 (a drug that reverses everything is non-specific). We train on 839 GEO disease signatures with
 Repurposing-Hub indications as labels, evaluate with disease-grouped CV, then apply to HELD-OUT
 sickle and check whether the coincidental reversers (norethindrone, ciprofloxacin) finally drop.
 Baseline = rank by raw connectivity (the single conn feature). Win = model down-ranks the
 negative controls vs baseline while keeping hydroxyurea high.
 """
 from __future__ import annotations
 import json
 import re
 from pathlib import Path
 import numpy as np
 import pandas as pd
 from sklearn.ensemble import GradientBoostingClassifier
 from sklearn.model_selection import GroupKFold, cross_val_predict
 from sklearn.metrics import roc_auc_score
 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from src.scoring import _ks_connectivity  # noqa: E402
 PROCESSED = Path("data/processed")
 SIGS = Path("data/raw/disease_sigs")
 NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
 ID_RE = re.compile(r"\s*(c\d{6,}|doid[-\s]?\d+|gse\d+|umls\S+)\s*", re.I)
 def clean_disease(term: str) -> str:
    return ID_RE.sub(" ", term).strip().lower()
 def load_disease_sigs():
    def parse(name):
        out = {}
        for line in (SIGS / name).read_text().splitlines():
            p = line.split("\t")
            if len(p) < 3:
                continue
            out[p[0]] = [g.split(",")[0].strip().upper() for g in p[2:] if g.strip()]
        return out
    up = parse("Disease_Perturbations_from_GEO_up.gmt")
    down = parse("Disease_Perturbations_from_GEO_down.gmt")
    terms = sorted(set(up) & set(down))
    return [{"term": t, "disease": clean_disease(t), "up": up[t], "down": down[t]} for t in terms]
 def cols_for(genes, g2c):
    return np.array([g2c[g] for g in set(genes) if g in g2c], dtype=int)
 def featurize(conn_col, C):
    """Per-(drug,disease) features for one disease column conn_col, given full matrix C."""
    drug_mean, drug_std = C.mean(1), C.std(1)
    drug_min = C.min(1)
    broad = (C < -0.10).mean(1)  # fraction of diseases this drug strongly reverses
    dmean, dstd = conn_col.mean(), conn_col.std() or 1.0
    return np.column_stack([
        conn_col, drug_mean, drug_std, drug_min, broad,
        np.full_like(conn_col, dmean),
        (conn_col - drug_mean) / np.where(drug_std == 0, 1, drug_std),  # specificity within drug
        (conn_col - dmean) / dstd,                                       # specificity within disease
    ])
 FEATS = ["conn", "drug_mean", "drug_std", "drug_min", "broadness",
         "disease_mean", "z_within_drug", "z_within_disease"]
 def main():
    lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
    drugs = list(lincs.index)
    g2c = {g: i for i, g in enumerate(lincs.columns)}
    n = len(lincs.columns)
    R = lincs.rank(axis=1, ascending=False).to_numpy()
    refs = [d for d in load_disease_sigs()
            if len(cols_for(d["up"], g2c)) + len(cols_for(d["down"], g2c)) >= 10]
    C = np.column_stack([_ks_connectivity(R, cols_for(d["up"], g2c), cols_for(d["down"], g2c), n) for d in refs])
    print(f"{len(drugs)} drugs x {len(refs)} disease signatures; connectivity matrix {C.shape}")
    # labels from Repurposing Hub indications
    hub = pd.read_csv("data/raw/repurposing_drugs.txt", sep="\t", comment="!", low_memory=False)
    ind = {r.pert_iname: [i.strip().lower() for i in re.split(r"[|,]", r.indication) if len(i.strip()) > 3]
           for r in hub.itertuples() if isinstance(r.indication, str)}
    drug_idx = {d: i for i, d in enumerate(drugs)}
    X_rows, y, grp = [], [], []
    for j, d in enumerate(refs):
        feats = featurize(C[:, j], C)
        dz = d["disease"]
        for i, drug in enumerate(drugs):
            inds = ind.get(drug, [])
            label = int(any(dz == k or (len(dz) > 4 and dz in k) or (len(k) > 4 and k in dz) for k in inds))
            X_rows.append(feats[i]); y.append(label); grp.append(j)
    X = np.array(X_rows); y = np.array(y); grp = np.array(grp)
    print(f"pairs: {len(y)}, positives: {y.sum()} ({100*y.mean():.2f}%)")
    # disease-grouped CV (generalize to unseen diseases)
    clf = GradientBoostingClassifier(random_state=0)
    proba = cross_val_predict(clf, X, y, cv=GroupKFold(5), groups=grp, method="predict_proba")[:, 1]
    print(f"disease-grouped CV AUC: {roc_auc_score(y, proba):.3f}  (conn-only AUC: {roc_auc_score(y, X[:,0]*-1):.3f})")
    clf.fit(X, y)
    print("feature importances: " + ", ".join(f"{f}={imp:.2f}" for f, imp in
          sorted(zip(FEATS, clf.feature_importances_), key=lambda t: -t[1])))
    # apply to HELD-OUT sickle
    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
    sk = _ks_connectivity(R, cols_for([g["gene"] for g in sig["up_regulated"]], g2c),
                          cols_for([g["gene"] for g in sig["down_regulated"]], g2c), n)
    Xs = featurize(sk, C)
    p_sickle = clf.predict_proba(Xs)[:, 1]
    res = pd.DataFrame({"model_p": p_sickle, "conn": sk}, index=drugs)
    res["model_rank"] = res["model_p"].rank(ascending=False).astype(int)
    res["conn_rank"] = res["conn"].rank(ascending=True).astype(int)  # most negative = best
    N = len(res)
    print(f"\n{'drug':14s} {'model_rank':>10s} {'conn_rank(base)':>16s}")
    for d in ["hydroxyurea", "glutamine"] + NEG5 + ["norethindrone", "ciprofloxacin"]:
        if d in res.index:
            r = res.loc[d]
            print(f"  {d:14s} {int(r['model_rank']):6d}/{N}      {int(r['conn_rank']):6d}/{N}")
    nb_model = sum(res.loc[d, "model_rank"] > N/2 for d in NEG5 if d in res.index)
    nb_conn = sum(res.loc[d, "conn_rank"] > N/2 for d in NEG5 if d in res.index)
    print(f"\nneg controls bottom-half: model {nb_model}/5  vs  baseline {nb_conn}/5")
    print("model top 10:", ", ".join(res.sort_values('model_p', ascending=False).head(10).index))
 if __name__ == "__main__":
    main()
--- a/scripts/pose_rmsd.py
+++ b/scripts/pose_rmsd.py
@@ -0,0 +1,113 @@
 """Pose-RMSD validation of the Boltz-2 HDAC2/vorinostat prediction (closes the §12.4 loop).
 Co-folding predicts the whole complex de novo, so it lands in its own frame. To compare poses:
  1. superpose the predicted protein (chain A) onto the crystal 4LXZ binding chain (Ca, gemmi),
  2. apply that transform to the predicted vorinostat (chain L) and the predicted Zn (chain M),
  3. symmetry-corrected in-place RMSD of the transformed ligand vs the crystal SHH (spyrmsd),
  4. bonus: did the catalytic Zn land near the real one?
 <2 A ligand RMSD = co-folding reproduces the geometry, not just the affinity ranking -- the test
 classical Vina failed on this exact Zn-chelation case (7.9 A).
 """
 from __future__ import annotations
 import subprocess
 from pathlib import Path
 import gemmi
 import numpy as np
 from spyrmsd import io as spyio, rmsd as spyrmsd
 PRED = Path("data/processed/binding/HDAC2_vorinostat_pred.pdb")
 XTAL = Path("data/raw/structures/4LXZ.pdb")
 LIG_RES = "SHH"  # crystal vorinostat
 WORK = Path("data/processed/binding")
 def crystal_binding_chain(st) -> gemmi.Chain:
    model = st[0]
    lig = [a.pos for ch in model for r in ch if r.name == LIG_RES for a in r]
    c = gemmi.Position(sum(p.x for p in lig) / len(lig), sum(p.y for p in lig) / len(lig),
                       sum(p.z for p in lig) / len(lig))
    best, bestd = None, 1e9
    for ch in model:
        if len(ch.get_polymer()) < 20:
            continue
        d = min((a.pos.dist(c) for r in ch for a in r), default=1e9)
        if d < bestd:
            best, bestd = ch, d
    return best
 def hetatm_lines(pdb: Path, chain: str | None = None, resname: str | None = None) -> list[str]:
    out = []
    for ln in pdb.read_text().splitlines():
        if not ln.startswith(("ATOM", "HETATM")):
            continue
        if chain and ln[21] != chain:
            continue
        if resname and ln[17:20].strip() != resname:
            continue
        out.append(ln)
    return out
 def transform_and_write(lines: list[str], T: gemmi.Transform, dest: Path) -> Path:
    new = []
    for ln in lines:
        p = T.apply(gemmi.Position(float(ln[30:38]), float(ln[38:46]), float(ln[46:54])))
        new.append(f"{ln[:30]}{p.x:8.3f}{p.y:8.3f}{p.z:8.3f}{ln[54:]}")
    dest.write_text("\n".join(new) + "\nEND\n")
    return dest
 def to_sdf(pdb: Path) -> Path:
    sdf = pdb.with_suffix(".sdf")
    subprocess.run(["obabel", str(pdb), "-O", str(sdf)], capture_output=True)
    return sdf
 def main() -> None:
    pred_st = gemmi.read_structure(str(PRED))
    xtal_st = gemmi.read_structure(str(XTAL))
    pred_chain = pred_st[0]["A"]
    xtal_chain = crystal_binding_chain(xtal_st)
    sup = gemmi.calculate_superposition(xtal_chain.get_polymer(), pred_chain.get_polymer(),
                                        gemmi.PolymerType.PeptideL, gemmi.SupSelect.CaP)
    T = sup.transform
    print(f"protein superposition: Ca RMSD {sup.rmsd:.2f} A over {sup.count} residues")
    # predicted ligand (chain L) and Zn (chain M) -> transform into crystal frame
    pred_lig = transform_and_write(hetatm_lines(PRED, chain="L"), T, WORK / "pred_lig_aligned.pdb")
    pred_zn_lines = hetatm_lines(PRED, chain="M")
    # one copy only (4LXZ has 3 SHH copies); keep the first (chain, resseq) group.
    shh = hetatm_lines(XTAL, resname=LIG_RES)
    first_key = (shh[0][21], shh[0][22:26])
    one_copy = [ln for ln in shh if (ln[21], ln[22:26]) == first_key]
    crys_lig = WORK / "xtal_lig.pdb"
    crys_lig.write_text("\n".join(one_copy) + "\nEND\n")
    # ligand pose RMSD (in-place, symmetry-corrected)
    ref, mol = spyio.loadmol(str(to_sdf(crys_lig))), spyio.loadmol(str(to_sdf(pred_lig)))
    ref.strip(); mol.strip()
    rmsd = float(spyrmsd.rmsdwrapper(ref, mol, symmetry=True, minimize=False)[0])
    # zinc placement: transformed predicted Zn vs nearest crystal ZN
    verdict = "PASS (<2A)" if rmsd < 2 else ("MARGINAL (<3A)" if rmsd < 3 else "FAIL")
    print(f"\nvorinostat pose RMSD (co-folded vs crystal) = {rmsd:.2f} A  {verdict}")
    print("(classical Vina on this Zn-chelation case: 7.9 A)")
    pred_zn = next((a.pos for ch in pred_st[0] if ch.name == "M" for r in ch for a in r), None)
    xtal_zn = [a.pos for ch in xtal_st[0] for r in ch if r.name == "ZN" for a in r]
    if pred_zn and xtal_zn:
        q = T.apply(pred_zn)
        dz = min(((q.x - z.x) ** 2 + (q.y - z.y) ** 2 + (q.z - z.z) ** 2) ** 0.5 for z in xtal_zn)
        print(f"catalytic Zn placement error = {dz:.2f} A "
              f"{'(zinc correctly placed)' if dz < 2 else ''}")
 if __name__ == "__main__":
    main()
--- a/scripts/week1_explore.py
+++ b/scripts/week1_explore.py
@@ -0,0 +1,139 @@
 """Week 1 exploratory run: build per-study DE + concordance for the whole-blood pair.
 Read-only decision aid (NOT the final pipeline): downloads GSE35007 + GSE16728, runs DE on
 each, collapses to HGNC symbols, computes concordance, and reports whether the sanity-gate
 genes survive — so we can choose Tier A (concordance pair) vs Tier B (GSE35007 alone) with
 real numbers. Intermediate DE tables are cached under data/raw/geo/.
 """
 from __future__ import annotations
 import sys
 from pathlib import Path
 import GEOparse
 import numpy as np
 import pandas as pd
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from src.disease import (  # noqa: E402
    DISEASE_LABEL,
    HEALTHY_LABEL,
    collapse_probes_to_symbols,
    compute_differential_expression,
    concordance_filter,
 )
 GEO_DIR = Path("data/raw/geo")
 GEO_DIR.mkdir(parents=True, exist_ok=True)
 SANITY_GENES = ["HBG1", "HBG2", "ALAS2", "CA1", "AHSP", "SLC4A1", "HBB", "ERAF"]
 SYMBOL_COL_CANDIDATES = [
    "Symbol", "ILMN_Gene", "Gene Symbol", "GENE_SYMBOL", "Gene symbol",
    "gene_assignment", "GeneSymbol",
 ]
 def log(msg: str) -> None:
    print(msg, flush=True)
 def get_symbol_map(gpl) -> pd.Series:
    tbl = gpl.table
    col = next((c for c in SYMBOL_COL_CANDIDATES if c in tbl.columns), None)
    if col is None:
        raise RuntimeError(f"No symbol column in GPL {gpl.name}; columns={list(tbl.columns)[:15]}")
    s = tbl.set_index("ID")[col].astype(str).str.strip()
    # gene_assignment fields are '// SYMBOL // ...'; take the first real token.
    if col == "gene_assignment":
        s = s.str.split("//").str[1].str.strip()
    s = s.replace({"": np.nan, "nan": np.nan, "---": np.nan})
    return s.dropna()
 def build_expression(gse, sample_ids: list[str]) -> pd.DataFrame:
    cols = {}
    for gsm_id in sample_ids:
        tbl = gse.gsms[gsm_id].table
        cols[gsm_id] = tbl.set_index("ID_REF")["VALUE"]
    return pd.DataFrame(cols)
 def char_value(gsm, key: str) -> str | None:
    for c in gsm.metadata.get("characteristics_ch1", []):
        if c.lower().startswith(key.lower()):
            return c.split(":", 1)[1].strip()
    return None
 def run_study(gse, groups: dict[str, str], label: str) -> pd.DataFrame:
    """groups: gsm_id -> 'disease'/'healthy'. Returns symbol-level DE table."""
    sample_ids = list(groups.keys())
    log(f"[{label}] {len(sample_ids)} samples "
        f"({sum(v==DISEASE_LABEL for v in groups.values())} disease / "
        f"{sum(v==HEALTHY_LABEL for v in groups.values())} healthy)")
    expr = build_expression(gse, sample_ids).apply(pd.to_numeric, errors="coerce").dropna(how="all")
    sample_groups = pd.Series(groups)
    de = compute_differential_expression(expr, sample_groups, method="welch")
    gpl = list(gse.gpls.values())[0]
    sym = get_symbol_map(gpl)
    de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
    log(f"[{label}] probes->symbols: {len(de)} probes -> {len(de_sym)} symbols; "
        f"sig q<0.05: {(de_sym['qvalue']<0.05).sum()}")
    de_sym.to_parquet(GEO_DIR / f"de_{label}.parquet")
    return de_sym
 def main() -> None:
    # --- GSE35007: whole blood, hb phenotype SS (disease) vs AA (healthy) ---
    log("downloading GSE35007 ...")
    gse1 = GEOparse.get_GEO(geo="GSE35007", destdir=str(GEO_DIR), silent=True)
    g1 = {}
    for gsm_id, gsm in gse1.gsms.items():
        hb = char_value(gsm, "hb phenotype")
        if hb == "SS":
            g1[gsm_id] = DISEASE_LABEL
        elif hb == "AA":
            g1[gsm_id] = HEALTHY_LABEL
    de1 = run_study(gse1, g1, "GSE35007")
    # --- GSE16728: whole blood only (PAXgene preps; exclude PBMC), patient vs control ---
    log("downloading GSE16728 ...")
    gse2 = GEOparse.get_GEO(geo="GSE16728", destdir=str(GEO_DIR), silent=True)
    g2 = {}
    for gsm_id, gsm in gse2.gsms.items():
        prep = char_value(gsm, "rna prep") or ""
        subj = char_value(gsm, "subject") or ""
        if "PBMC" in prep:
            continue  # whole-blood only
        if subj.lower().startswith("sickle"):
            g2[gsm_id] = DISEASE_LABEL
        elif subj.lower().startswith("control"):
            g2[gsm_id] = HEALTHY_LABEL
    de2 = run_study(gse2, g2, "GSE16728")
    # --- Concordance ---
    keep, summary = concordance_filter(de1, de2)
    log("\n==== CONCORDANCE (whole-blood pair) ====")
    log(f"genes tested in both: {summary.n_genes_tested}")
    log(f"concordant (q<0.05 both + same sign): {summary.n_concordant} "
        f"(up={summary.n_up}, down={summary.n_down})")
    keep.to_parquet(GEO_DIR / "concordant_genes.parquet")
    log("\n==== SANITY-GATE GENES ====")
    for g in SANITY_GENES:
        in1 = g in de1.index
        in2 = g in de2.index
        row1 = de1.loc[g] if in1 else None
        row2 = de2.loc[g] if in2 else None
        concord = "YES" if g in keep.index else "no"
        d1 = f"lfc={row1['log_fc']:+.2f},q={row1['qvalue']:.1e}" if in1 else "absent"
        d2 = f"lfc={row2['log_fc']:+.2f},q={row2['qvalue']:.1e}" if in2 else "absent"
        log(f"  {g:7s} | GSE35007 {d1:28s} | GSE16728 {d2:28s} | concordant={concord}")
    log("\n==== TOP 15 CONCORDANT (by worst-case q) ====")
    log(keep.head(15).to_string())
 if __name__ == "__main__":
    main()
--- a/scripts/week1_finalize.py
+++ b/scripts/week1_finalize.py
@@ -0,0 +1,121 @@
 """Finalize the Week 1 sickle cell signature (Tier A, whole-blood concordance pair).
 Reads the cached concordant gene table, maps symbols -> Entrez/Ensembl via mygene, attaches
 two-study provenance + concordance summary, and persists
 data/processed/sickle_cell_signature_v1.json.
 Tier note: GSE16728 is exactly 10/group, which fails assign_tier()'s strict n>10. Tier A is
 assigned explicitly here on the basis of cross-study, cross-platform, cross-population
 concordance (the founder's decision), with the n=10 / prep-merge caveat documented in the
 tier rationale and limitations.
 """
 from __future__ import annotations
 import sys
 from pathlib import Path
 import mygene
 import pandas as pd
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from src.disease import (  # noqa: E402
    ConcordanceSummary,
    SignatureProvenance,
    StudyProvenance,
    build_signature,
    persist_signature,
 )
 from src.provenance import ConfidenceTier  # noqa: E402
 CREATED_DATE = "2026-06-23"
 GEO_DIR = Path("data/raw/geo")
 def map_ids(symbols: list[str]) -> pd.DataFrame:
    mg = mygene.MyGeneInfo()
    res = mg.querymany(
        symbols, scopes="symbol", fields="entrezgene,ensembl.gene",
        species="human", as_dataframe=True, verbose=False,
    )
    res = res[~res.index.duplicated(keep="first")]
    def _ensembl(v):
        if isinstance(v, list) and v:
            v = v[0]
        if isinstance(v, dict):
            return v.get("gene")
        return v
    id_map = pd.DataFrame(index=res.index)
    id_map["entrez_id"] = res["entrezgene"] if "entrezgene" in res.columns else None
    ens_col = "ensembl.gene" if "ensembl.gene" in res.columns else ("ensembl" if "ensembl" in res.columns else None)
    id_map["ensembl_id"] = res[ens_col].map(_ensembl) if ens_col else None
    return id_map
 def main() -> None:
    concordant = pd.read_parquet(GEO_DIR / "concordant_genes.parquet")
    print(f"loaded {len(concordant)} concordant genes "
          f"({(concordant['log_fc']>0).sum()} up / {(concordant['log_fc']<0).sum()} down)")
    id_map = map_ids(list(concordant.index))
    print(f"mygene mapped {id_map['entrez_id'].notna().sum()}/{len(id_map)} symbols to Entrez")
    provenance = SignatureProvenance(
        studies=[
            StudyProvenance(
                geo_accession="GSE35007", n_disease=190, n_healthy=12,
                platform="Illumina HumanHT-12 V4 (GPL10558)",
                tissue="whole blood", method="welch",
            ),
            StudyProvenance(
                geo_accession="GSE16728", n_disease=10, n_healthy=10,
                platform="Affymetrix HG-U133 Plus 2.0 (GPL570)",
                tissue="whole blood (PAXgene; globin-reduced + total RNA preps merged)",
                method="welch",
            ),
        ],
        concordance=ConcordanceSummary(
            n_genes_tested=16208, n_concordant=671, n_up=444, n_down=227,
        ),
        created_date=CREATED_DATE,
    )
    tier_rationale = (
        "Measured RNA expression concordant across two independent peer-reviewed whole-blood "
        "studies on different platforms (Illumina GPL10558, Affymetrix GPL570) and populations "
        "(pediatric West-African GSE35007; adult GSE16728). Tier A rests on cross-study / "
        "cross-platform concordance. Caveat: GSE16728 is exactly 10/group (two PAXgene preps "
        "merged), which does not meet the strict n>10 per-study threshold on its own."
    )
    limitations = [
        "Whole-blood expression is partly driven by cell-composition differences "
        "(reticulocytosis in sickle patients), not pure disease-state regulation; the "
        "signature is dominated by erythroid markers (CA1, AHSP, SLC4A1). v2 needs cell-type "
        "deconvolution.",
        "Fetal-hemoglobin genes (HBG1/HBG2) are NOT captured: flat in GSE35007 and removed by "
        "GSE16728's globin-depleted RNA prep. The signature therefore does not directly encode "
        "the HbF axis that hydroxyurea acts on, which may weaken the hydroxyurea recovery test.",
        "GSE35007 is highly imbalanced (SS n=190 vs AA n=12); the healthy arm is small.",
        "GSE16728 whole-blood arm is small (10/group) and merges two RNA-prep batches.",
        "Cross-platform probe->symbol collapse (keep max-mean-expression probe) loses "
        "isoform-level resolution and drops genes absent from either platform.",
    ]
    signature = build_signature(
        concordant, provenance,
        tier=ConfidenceTier.A,
        tier_rationale=tier_rationale,
        limitations=limitations,
        id_map=id_map,
        top_n=250,
    )
    path = persist_signature(signature)
    print(f"wrote {path}")
    print(f"signature: {len(signature.up_regulated)} up / {len(signature.down_regulated)} down, "
          f"tier {signature.confidence_tier.value}")
 if __name__ == "__main__":
    main()
--- a/scripts/week2_assemble.py
+++ b/scripts/week2_assemble.py
@@ -0,0 +1,91 @@
 """Week 2, task 4: assemble drug_profiles_v1.parquet (PLAN §6).
 Joins the curated drug set + ChEMBL enrichment + LINCS consensus signatures into one profile
 table. Each drug carries a confidence tier: LINCS is a single source, so signature-backed drugs
 are Tier B at best (assign_tier with single_source=True); drugs with no signature are Tier C and
 marked not-scored (not dropped silently — PLAN §6 Week 3 task 2).
 The 978-gene signature order is the column order of lincs_signatures_v1.parquet (landmark
 symbols); each profile's `lincs_signature` is that vector (or null).
 """
 from __future__ import annotations
 import ast
 from pathlib import Path
 import pandas as pd
 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from src.drugs import persist_drug_profiles  # noqa: E402
 from src.provenance import ConfidenceTier, assign_tier  # noqa: E402
 PROCESSED = Path("data/processed")
 DRUG_SET = PROCESSED / "drug_set_v1.csv"
 CHEMBL = Path("data/raw/chembl/chembl_enrichment.parquet")
 LINCS_SIG = PROCESSED / "lincs_signatures_v1.parquet"
 def main() -> None:
    drugs = pd.read_csv(DRUG_SET)
    chembl = pd.read_parquet(CHEMBL)[
        ["pert_iname", "chembl_id", "pref_name", "smiles", "mechanism_of_action", "targets"]
    ]
    sigs = pd.read_parquet(LINCS_SIG)  # rows=pert_iname, cols=978 landmark symbols
    gene_order = list(sigs.columns)
    df = drugs.merge(chembl, on="pert_iname", how="left")
    rows = []
    for r in df.itertuples():
        has_sig = r.pert_iname in sigs.index
        vector = sigs.loc[r.pert_iname].tolist() if has_sig else None
        # LINCS = single source => Tier B max when measured; no signature => Tier C.
        tier = assign_tier(
            is_measured=has_sig, n_per_group=None, peer_reviewed=True, single_source=True
        ) if has_sig else ConfidenceTier.C
        targets = r.targets
        if isinstance(targets, str):
            try:
                targets = ast.literal_eval(targets)
            except (ValueError, SyntaxError):
                targets = []
        elif hasattr(targets, "tolist"):  # numpy ndarray from parquet round-trip
            targets = targets.tolist()
        elif targets is None or (not isinstance(targets, (list, tuple))):
            targets = []
        rows.append({
            "name": r.pert_iname,
            "chembl_id": r.chembl_id if pd.notna(r.chembl_id) else None,
            "pref_name": r.pref_name if pd.notna(r.pref_name) else None,
            "inchikey": r.inchi_key if pd.notna(r.inchi_key) else None,
            "smiles": r.smiles if pd.notna(r.smiles) else None,
            "targets": list(targets),
            "mechanism_of_action": r.mechanism_of_action if pd.notna(r.mechanism_of_action) else None,
            "inclusion_reason": r.inclusion_reason,
            "lincs_phase": r.phase,
            "scored": has_sig,
            "lincs_signature": vector,
            "confidence_tier": tier.value,
        })
    profiles = pd.DataFrame(rows)
    # Persist the gene order alongside, so Week-3 scoring can align the vectors.
    (PROCESSED / "lincs_gene_order.txt").write_text("\n".join(gene_order))
    path = persist_drug_profiles(profiles)
    print(f"drug_profiles_v1: {len(profiles)} drugs")
    print(f"  scored (have LINCS signature): {profiles['scored'].sum()}")
    print(f"  not scored: {(~profiles['scored']).sum()}")
    print("  by inclusion reason (scored rate):")
    print(profiles.groupby("inclusion_reason")["scored"].agg(["sum", "count"]).to_string())
    print("  tier split:", profiles["confidence_tier"].value_counts().to_dict())
    for gt in ["hydroxyurea", "glutamine"]:
        row = profiles[profiles["name"] == gt]
        print(f"  ground truth '{gt}': scored={bool(row['scored'].iloc[0]) if len(row) else 'ABSENT'}")
    print(f"wrote {path}")
 if __name__ == "__main__":
    main()
--- a/scripts/week2_chembl.py
+++ b/scripts/week2_chembl.py
@@ -0,0 +1,99 @@
 """Week 2, task 2: enrich the drug set with ChEMBL structure/target/mechanism data.
 Drugs are matched to ChEMBL by the InChIKey already carried from LINCS pert_info (reliable),
 then mechanism-of-action and target names are pulled. Compounds absent from ChEMBL (many
 research/tool compounds in the random arm) keep null ChEMBL fields — they still have LINCS
 signatures for scoring; only the Week-3 mechanistic prior won't apply. Output cached to
 data/raw/chembl/chembl_enrichment.parquet.
 """
 from __future__ import annotations
 from pathlib import Path
 import pandas as pd
 from chembl_webresource_client.new_client import new_client
 DRUG_SET = Path("data/processed/drug_set_v1.csv")
 OUT = Path("data/raw/chembl/chembl_enrichment.parquet")
 BATCH = 40
 def chunks(seq, n):
    for i in range(0, len(seq), n):
        yield seq[i:i + n]
 def main() -> None:
    drugs = pd.read_csv(DRUG_SET)
    inchikeys = sorted({k for k in drugs["inchi_key"].dropna() if isinstance(k, str) and len(k) > 10})
    print(f"{len(drugs)} drugs; {len(inchikeys)} usable InChIKeys to resolve")
    molecule = new_client.molecule
    mechanism = new_client.mechanism
    target = new_client.target
    # 1) InChIKey -> ChEMBL molecule (id, name, smiles)
    mol_rows = []
    for i, batch in enumerate(chunks(inchikeys, BATCH)):
        res = molecule.filter(molecule_structures__standard_inchi_key__in=batch).only(
            ["molecule_chembl_id", "pref_name", "molecule_structures"])
        for m in res:
            ms = m.get("molecule_structures") or {}
            mol_rows.append({
                "chembl_id": m["molecule_chembl_id"],
                "pref_name": m.get("pref_name"),
                "smiles": ms.get("canonical_smiles"),
                "inchi_key": ms.get("standard_inchi_key"),
            })
        print(f"  molecules batch {i+1}: cumulative {len(mol_rows)} hits", flush=True)
    mols = pd.DataFrame(mol_rows).drop_duplicates("inchi_key")
    chembl_ids = sorted(mols["chembl_id"].unique())
    print(f"resolved {len(mols)} molecules -> {len(chembl_ids)} ChEMBL ids")
    # 2) ChEMBL id -> mechanism of action + target ids
    mech_rows = []
    for batch in chunks(chembl_ids, BATCH):
        for m in mechanism.filter(molecule_chembl_id__in=batch).only(
                ["molecule_chembl_id", "mechanism_of_action", "target_chembl_id"]):
            mech_rows.append(m)
    mech = pd.DataFrame(mech_rows)
    print(f"mechanism records: {len(mech)}")
    # 3) target id -> name
    tgt_names = {}
    if not mech.empty:
        tids = sorted({t for t in mech["target_chembl_id"].dropna().unique()})
        for batch in chunks(tids, BATCH):
            for t in target.filter(target_chembl_id__in=batch).only(["target_chembl_id", "pref_name"]):
                tgt_names[t["target_chembl_id"]] = t.get("pref_name")
    # aggregate mechanism/targets per molecule
    def agg(df):
        moa = sorted({x for x in df["mechanism_of_action"].dropna()})
        tns = sorted({tgt_names.get(t) for t in df["target_chembl_id"].dropna() if tgt_names.get(t)})
        return pd.Series({"mechanism_of_action": "; ".join(moa) or None, "targets": tns})
    if not mech.empty:
        per_mol = mech.groupby("molecule_chembl_id").apply(agg, include_groups=False).reset_index()
        per_mol = per_mol.rename(columns={"molecule_chembl_id": "chembl_id"})
        mols = mols.merge(per_mol, on="chembl_id", how="left")
    else:
        mols["mechanism_of_action"] = None
        mols["targets"] = None
    # join back to the drug set on inchi_key
    enriched = drugs.merge(mols, on="inchi_key", how="left", suffixes=("", "_chembl"))
    OUT.parent.mkdir(parents=True, exist_ok=True)
    enriched.to_parquet(OUT, index=False)
    n_resolved = enriched["chembl_id"].notna().sum()
    n_moa = enriched["mechanism_of_action"].notna().sum()
    print(f"\nenriched {len(enriched)} drugs: {n_resolved} matched ChEMBL, {n_moa} have MoA")
    print(f"by reason, ChEMBL match rate:")
    print(enriched.assign(matched=enriched["chembl_id"].notna()).groupby("inclusion_reason")["matched"].mean().round(2).to_string())
    print(f"wrote {OUT}")
 if __name__ == "__main__":
    main()
--- a/scripts/week2_curate_drugset.py
+++ b/scripts/week2_curate_drugset.py
@@ -0,0 +1,131 @@
 """Week 2, task 1: curate the deliberately-composed ~300-drug set (PLAN §6).
 Composition: 2 ground-truth + ~50 related-mechanism + ~50 negative controls + ~200 random.
 The universe is restricted to compounds that actually have a LINCS Level-5 signature (in
 Phase I and/or Phase II), so every curated drug is scorable. Output: drug_set_v1.csv.
 """
 from __future__ import annotations
 import gzip
 import io
 from pathlib import Path
 import pandas as pd
 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from src import RANDOM_SEED  # noqa: E402
 LINCS = Path("data/raw/lincs")
 OUT = Path("data/processed/drug_set_v1.csv")
 GROUND_TRUTH = ["hydroxyurea", "glutamine"]  # glutamine == L-glutamine in LINCS
 # Curated by mechanism (PLAN §6). Intersected with the LINCS catalog below, so misses are
 # silently dropped — we keep whatever actually has a signature.
 RELATED_MECHANISM = [
    # HbF inducers / epigenetic
    "decitabine", "azacitidine", "vorinostat", "panobinostat", "romidepsin", "entinostat",
    "mocetinostat", "belinostat", "pomalidomide", "lenalidomide", "thalidomide", "apicidin",
    "trichostatin-a", "scriptaid", "valproic-acid",
    # NO / vascular
    "sildenafil", "tadalafil", "nitroprusside",
    # antioxidants
    "n-acetyl-cysteine", "resveratrol", "curcumin", "quercetin", "sulforaphane",
    # anti-inflammatory studied in SCD
    "dexamethasone", "prednisolone", "hydrocortisone", "ibuprofen", "indomethacin",
    "sulfasalazine", "montelukast", "aspirin",
    # iron / heme / SCD-adjacent
    "hemin", "deferoxamine", "deferasirox", "simvastatin", "atorvastatin", "ticagrelor",
 ]
 NEGATIVE_CONTROL = [
    # antifungals
    "fluconazole", "ketoconazole", "itraconazole", "clotrimazole", "terbinafine", "miconazole",
    # antihistamines
    "loratadine", "cetirizine", "fexofenadine", "diphenhydramine", "chlorpheniramine",
    "astemizole",
    # antibiotics
    "amoxicillin", "ciprofloxacin", "doxycycline", "trimethoprim", "azithromycin", "tetracycline",
    "nitrofurantoin",
    # hormones / contraceptives
    "levonorgestrel", "ethinyl-estradiol", "norethindrone", "medroxyprogesterone-acetate",
    # misc unrelated
    "omeprazole", "ranitidine", "loperamide", "caffeine", "acetaminophen", "lidocaine",
 ]
 # Fill the random sample so the total set is ~300 (the denominator the pre-registered
 # recovery-test thresholds assume: "top 30 of 300"). Curated mechanism/control drugs are
 # capped by what LINCS actually contains, so the random arm absorbs the remainder.
 TARGET_TOTAL = 300
 def load_catalog() -> pd.DataFrame:
    """Compounds with >=1 Level-5 signature, annotated with phase + inchi/smiles."""
    def read_gz(fn, **kw):
        return pd.read_csv(io.BytesIO(gzip.decompress(Path(fn).read_bytes())), sep="\t", **kw)
    sig1 = read_gz(LINCS / "GSE92742_sig_info.txt.gz", low_memory=False)
    sig2 = read_gz(LINCS / "GSE70138_sig_info.txt.gz", low_memory=False)
    cp1 = set(sig1[sig1["pert_type"] == "trt_cp"]["pert_iname"])
    cp2 = set(sig2[sig2["pert_type"] == "trt_cp"]["pert_iname"])
    pert1 = read_gz(LINCS / "GSE92742_pert_info.txt.gz", low_memory=False)
    pert2 = read_gz(LINCS / "GSE70138_pert_info.txt.gz", low_memory=False)
    info = pd.concat([pert1, pert2], ignore_index=True)
    info = info[info["pert_type"] == "trt_cp"].drop_duplicates("pert_iname", keep="first")
    info = info.set_index("pert_iname")
    names = cp1 | cp2
    rows = []
    for nm in names:
        phase = "both" if nm in cp1 and nm in cp2 else ("P1" if nm in cp1 else "P2")
        rec = info.loc[nm] if nm in info.index else None
        rows.append({
            "pert_iname": nm,
            "phase": phase,
            "pert_id": rec["pert_id"] if rec is not None else None,
            "inchi_key": rec["inchi_key"] if rec is not None else None,
            "canonical_smiles": rec["canonical_smiles"] if rec is not None else None,
        })
    return pd.DataFrame(rows).set_index("pert_iname")
 def pick(catalog: pd.DataFrame, names: list[str], reason: str) -> pd.DataFrame:
    present = [n for n in names if n in catalog.index]
    missing = [n for n in names if n not in catalog.index]
    if missing:
        print(f"  [{reason}] {len(present)}/{len(names)} in LINCS; dropped: {missing}")
    out = catalog.loc[present].copy()
    out["inclusion_reason"] = reason
    return out
 def main() -> None:
    catalog = load_catalog()
    print(f"LINCS scorable compound universe: {len(catalog)}")
    gt = pick(catalog, GROUND_TRUTH, "ground_truth")
    rel = pick(catalog, RELATED_MECHANISM, "related_mechanism")
    neg = pick(catalog, NEGATIVE_CONTROL, "negative_control")
    chosen = pd.concat([gt, rel, neg])
    remaining = catalog.drop(index=chosen.index)
    n_random = TARGET_TOTAL - len(chosen)
    rand = remaining.sample(n=n_random, random_state=RANDOM_SEED).copy()
    rand["inclusion_reason"] = "general_sample"
    drug_set = pd.concat([gt, rel, neg, rand]).reset_index()
    OUT.parent.mkdir(parents=True, exist_ok=True)
    drug_set.to_csv(OUT, index=False)
    print(f"\ndrug_set_v1.csv: {len(drug_set)} drugs")
    print(drug_set["inclusion_reason"].value_counts().to_string())
    print(f"phase split:\n{drug_set['phase'].value_counts().to_string()}")
    print(f"wrote {OUT}")
 if __name__ == "__main__":
    main()
--- a/scripts/week2_lincs_extract.py
+++ b/scripts/week2_lincs_extract.py
@@ -0,0 +1,107 @@
 """Week 2, task 3: extract per-drug LINCS L1000 consensus signatures.
 For each drug in the set, slice its Level-5 MODZ signatures (978 landmark genes x its sig_ids)
 out of the big GCTX via cmapPy, then aggregate to ONE consensus 978-vector per drug by mean
 across its sig_ids (the "MODZ aggregation across cell lines/replicates" of PLAN §6). hydroxyurea
 lives in Phase II, L-glutamine in Phase I, so both phases are processed and merged.
 Output: data/processed/lincs_signatures_v1.parquet (rows = pert_iname, cols = 978 landmark
 gene symbols, values = mean MODZ z-score).
 Usage:
    python scripts/week2_lincs_extract.py --phase 2   # test on Phase II (already downloaded)
    python scripts/week2_lincs_extract.py             # both phases (needs Phase I gunzipped)
 """
 from __future__ import annotations
 import argparse
 import gzip
 import io
 from pathlib import Path
 import pandas as pd
 from cmapPy.pandasGEXpress.parse import parse
 LINCS = Path("data/raw/lincs")
 DRUG_SET = Path("data/processed/drug_set_v1.csv")
 OUT = Path("data/processed/lincs_signatures_v1.parquet")
 GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
 SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
 def read_gz_tsv(name: str) -> pd.DataFrame:
    return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
 def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
    """Gene row-ids + id->symbol map for the scored gene space.
    v1.1: use the FULL 12,328-gene space (landmark + inferred), not just the 978 landmarks.
    This lifts disease-signature overlap from 12% to ~85% and brings the erythroid markers into
    scoring (see docs/recovery_test_report.md). Inferred genes are model-predicted (noisier).
    """
    g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
    ids = [str(x) for x in g["pr_gene_id"]]
    id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
    return ids, id_to_symbol
 def extract_phase(phase: int, drug_names: set[str], landmark_ids: list[str]) -> pd.DataFrame:
    """Return DataFrame: rows=pert_iname, cols=landmark gene_id (str), one mean vector per drug."""
    sig = read_gz_tsv(SIG_INFO[phase])
    sig = sig[(sig["pert_type"] == "trt_cp") & (sig["pert_iname"].isin(drug_names))]
    if sig.empty:
        return pd.DataFrame()
    sig_ids = sig["sig_id"].tolist()
    print(f"[phase {phase}] {sig['pert_iname'].nunique()} drugs, {len(sig_ids)} signatures to slice", flush=True)
    gctoo = parse(str(GCTX[phase]), rid=landmark_ids, cid=sig_ids)
    data = gctoo.data_df  # rows=gene_id, cols=sig_id
    sig_to_drug = dict(zip(sig["sig_id"], sig["pert_iname"]))
    # mean across each drug's sig_ids -> one consensus vector per drug
    per_drug = data.T.groupby(data.columns.map(sig_to_drug)).mean()
    print(f"[phase {phase}] aggregated to {len(per_drug)} drug consensus vectors", flush=True)
    return per_drug  # rows=pert_iname, cols=gene_id
 def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--phase", type=int, choices=[1, 2], default=None, help="single phase (test)")
    args = ap.parse_args()
    drugs = pd.read_csv(DRUG_SET)
    drug_names = set(drugs["pert_iname"])
    landmark_ids, id_to_symbol = landmark_ids_and_symbols()
    phases = [args.phase] if args.phase else [1, 2]
    frames = []
    for ph in phases:
        if not GCTX[ph].exists():
            print(f"[phase {ph}] {GCTX[ph]} missing — skipping")
            continue
        frames.append(extract_phase(ph, drug_names, landmark_ids))
    frames = [f for f in frames if not f.empty]
    if not frames:
        print("no signatures extracted")
        return
    # A drug present in both phases: average the two phase consensus vectors.
    combined = pd.concat(frames).groupby(level=0).mean()
    combined.columns = [id_to_symbol.get(c, c) for c in combined.columns]  # gene_id -> symbol
    covered = sorted(set(combined.index))
    missing = sorted(drug_names - set(covered))
    print(f"\nsignatures extracted for {len(covered)}/{len(drug_names)} drugs")
    for gt in ["hydroxyurea", "glutamine"]:
        print(f"  ground truth '{gt}': {'OK' if gt in covered else 'MISSING'}")
    if args.phase is None:
        combined.to_parquet(OUT)
        print(f"wrote {OUT} ({combined.shape[0]} drugs x {combined.shape[1]} landmark genes)")
        if missing:
            print(f"{len(missing)} drugs without signature (will be marked not-scored in Week 3)")
 if __name__ == "__main__":
    main()
--- a/scripts/week3_scoring.py
+++ b/scripts/week3_scoring.py
@@ -0,0 +1,82 @@
 """Week 3 (v1.1): connectivity scoring over the full gene space with tau calibration.
 Primary ranking is now **tau** (KS connectivity expressed as a signed percentile vs a null of
 random queries) — this calibrates out broad-effect drugs that connect to random signatures too,
 the specificity fix. The weighted connectivity score (WTCS) is retained as a reference column,
 and a secondary ranking blends tau with the sickle-pathway mechanistic prior.
 Output: data/results/ranked_candidates_v1.csv.
 """
 from __future__ import annotations
 import json
 from pathlib import Path
 import pandas as pd
 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from src.scoring import (  # noqa: E402
    connectivity_score, mechanistic_prior, normalize_scores, persist_ranking, tau_calibrate,
 )
 PROCESSED = Path("data/processed")
 N_NULL = 1000
 PRIOR_LAMBDA = 0.5  # spec_z credit per matched sickle pathway, for the blended ranking
 def main() -> None:
    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
    up = [g["gene"] for g in sig["up_regulated"]]
    down = [g["gene"] for g in sig["down_regulated"]]
    sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")  # drug x 12,328 genes
    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
    n_up = len(set(up) & set(sig_matrix.columns))
    n_down = len(set(down) & set(sig_matrix.columns))
    print(f"gene space: {sig_matrix.shape[1]} genes; query overlap {n_up} up + {n_down} down = {n_up + n_down}")
    # primary: tau calibration
    print(f"computing tau over {N_NULL} random-query permutations ...", flush=True)
    ranked = tau_calibrate(up, down, sig_matrix, n_null=N_NULL)
    # reference: weighted connectivity score (WTCS) + NCS
    wtcs = pd.Series({d: connectivity_score(up, down, sig_matrix.loc[d]) for d in sig_matrix.index},
                     name="connectivity_score")
    ranked["connectivity_score"] = wtcs
    ranked["normalized_score"] = normalize_scores(wtcs)
    # metadata + mechanistic prior
    ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
    ranked["mechanistic_prior"] = ranked["targets"].apply(
        lambda t: mechanistic_prior(list(t) if t is not None else []))
    ranked["known_targets"] = ranked["targets"].apply(
        lambda t: "; ".join(t) if t is not None and len(t) else "")
    ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})
    # secondary, prior-weighted ranking (relevant drugs pushed toward more-negative spec_z)
    ranked["blended_score"] = ranked["spec_z"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
    ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)
    out = ranked.rename_axis("drug_name").reset_index()[[
        "rank", "drug_name", "chembl_id", "spec_z", "tau", "connectivity_ks", "connectivity_score",
        "inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
    ]]
    path = persist_ranking(out)
    print(f"wrote {path} ({len(out)} drugs)")
    print("\n--- sanity peek (spec_z ranking) ---")
    for gt in ["hydroxyurea", "glutamine"]:
        r = ranked.loc[gt]
        print(f"  {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {100*r['rank']/len(ranked):.0f}%), "
              f"spec_z={r['spec_z']:.2f}")
    print("  top 10 by spec_z:")
    for name, r in ranked.nsmallest(10, "spec_z").iterrows():
        print(f"    {int(r['rank']):2d} {name:18s} z={r['spec_z']:6.2f} [{r['inclusion_reason']:16s}] "
              f"{str(r['known_targets'])[:38]}")
 if __name__ == "__main__":
    main()
--- a/scripts/week4_recovery_test.py
+++ b/scripts/week4_recovery_test.py
@@ -0,0 +1,81 @@
 """Week 4: formal recovery test against the pre-registered criteria (PLAN §6).
 Pre-registered criteria (committed in docs/recovery_test_report.md before this run):
  - hydroxyurea in top 10% (top 30 of 300), AND
  - L-glutamine in top 25% (top 75) OR documented unscorable due to missing LINCS signature, AND
  - >=4 of 5 pre-specified negative controls in the bottom half.
 The 5 negative controls are pre-specified here by a category rule (one per category, alphabetically
 first available) so the choice does not peek at ranks. Primary ranking = raw connectivity.
 """
 from __future__ import annotations
 from pathlib import Path
 import pandas as pd
 RANKED = Path("data/results/ranked_candidates_v1.csv")
 # One per unrelated category, alphabetical-first — chosen without looking at ranks.
 NEG_CONTROL_CATEGORIES = {
    "antifungal": ["clotrimazole", "fluconazole", "itraconazole", "ketoconazole", "miconazole", "terbinafine"],
    "antihistamine": ["astemizole", "cetirizine", "diphenhydramine", "fexofenadine", "loratadine"],
    "antibiotic": ["azithromycin", "ciprofloxacin", "doxycycline", "tetracycline", "trimethoprim"],
    "hormone": ["ethinyl-estradiol", "levonorgestrel", "medroxyprogesterone-acetate", "norethindrone"],
    "misc": ["caffeine", "lidocaine", "loperamide", "omeprazole", "ranitidine"],
 }
 def main() -> None:
    df = pd.read_csv(RANKED).set_index("drug_name")
    n = len(df)
    top10_cut, top25_cut, half = int(n * 0.10), int(n * 0.25), n // 2
    def rk(name):
        return int(df.loc[name, "rank"]) if name in df.index else None
    hu, glut = rk("hydroxyurea"), rk("glutamine")
    glut_z = df.loc["glutamine", "spec_z"]
    # pick negative controls present in the ranking
    negs = {}
    for cat, options in NEG_CONTROL_CATEGORIES.items():
        pick = next((d for d in options if d in df.index), None)
        if pick:
            negs[pick] = (cat, rk(pick))
    print("=" * 60)
    print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}")
    print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%)  -> top-10%? {hu <= top10_cut}")
    print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), spec_z={glut_z:+.2f}  "
          f"-> top-25%? {glut <= top25_cut}  (positive z => does not reverse; has a signature)")
    print("\nnegative controls (pre-specified, 1 per category):")
    n_bottom = 0
    for d, (cat, r) in negs.items():
        in_bottom = r > half
        n_bottom += in_bottom
        print(f"  {d:18s} [{cat:13s}] rank {r:3d}  bottom-half? {in_bottom}")
    print(f"  -> {n_bottom}/5 in bottom half (need >=4)")
    crit_hu = hu <= top10_cut
    crit_glut = glut <= top25_cut
    crit_neg = n_bottom >= 4
    overall = crit_hu and crit_glut and crit_neg
    print(f"\nCRITERIA: hydroxyurea={crit_hu}, L-glutamine={crit_glut}, neg-controls={crit_neg}")
    print(f"OVERALL (raw ranking): {'PASS' if overall else 'FAIL'}")
    # secondary prior-weighted view (reported, not the primary criterion)
    hu_b = int(df.loc["hydroxyurea", "blended_rank"])
    print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} "
          f"(top {100*hu_b/n:.1f}%)")
    print("\n--- TOP 10 (primary spec_z ranking) ---")
    top10 = df.sort_values("rank").head(10)
    for name, r in top10.iterrows():
        print(f"  {int(r['rank']):2d}  {name:18s} z={r['spec_z']:+.2f}  "
              f"[{r['inclusion_reason']}]  {str(r['known_targets'])[:45]}")
 if __name__ == "__main__":
    main()
--- a/src/binding.py
+++ b/src/binding.py
@@ -0,0 +1,89 @@
 """Structure-based binding track (PLAN §12).
 Two capabilities:
 - ligand-based retrieval (RDKit, works now): find existing drugs near a query molecule in
  chemical space — validated, and the engine behind §12.9 generative-guided retrieval.
 - structure-based docking (§12.3): score whether a ligand binds a target pocket. Blocked on an
  ARM-Mac docking toolchain (AutoDock Vina does not pip-install); see ``dock`` for options.
 Caveat carried throughout: chemical similarity != activity, and docking != efficacy (§12.6).
 """
 from __future__ import annotations
 from pathlib import Path
 from rdkit import Chem, DataStructs, RDLogger
 from rdkit.Chem import rdFingerprintGenerator
 RDLogger.DisableLog("rdApp.*")
 _MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
 STRUCT_DIR = Path("data/raw/structures")
 # Known sickle small-molecule binders, by target (positive controls for the §12.4 recovery test).
 KNOWN_BINDERS = {
    "hemoglobin": "voxelotor",
    "PKR": "mitapivat",
    "DNMT1": "decitabine",
    "HDAC": "vorinostat",
 }
 # Curated target structures (PLAN §12.2). Add PDB ids as the harness grows.
 TARGET_PDB = {
    "hemoglobin": "5E83",  # hemoglobin + voxelotor (GBT440)
    "PKR": "8XFD",         # pyruvate kinase R + mitapivat
 }
 def morgan_fp(smiles: str):
    """Morgan (ECFP4) fingerprint, or None for invalid / missing SMILES ('-666', '')."""
    if not isinstance(smiles, str) or smiles in ("-666", ""):
        return None
    mol = Chem.MolFromSmiles(smiles)
    return _MORGAN.GetFingerprint(mol) if mol else None
 def tanimoto(smiles_a: str, smiles_b: str) -> float | None:
    fa, fb = morgan_fp(smiles_a), morgan_fp(smiles_b)
    if fa is None or fb is None:
        return None
    return DataStructs.TanimotoSimilarity(fa, fb)
 def retrieve_nearest(
    query_smiles: str,
    library: dict[str, str],
    top_n: int = 5,
 ) -> list[tuple[float, str]]:
    """Rank a library of {name: smiles} by Tanimoto similarity to a query molecule.
    This is the §12.9 retrieval step: the query may be a known binder (positive-control beacon)
    or a generated idealised binder; the returned existing drugs are repurposing candidates that
    STILL require docking/validation (similarity != activity).
    """
    qfp = morgan_fp(query_smiles)
    if qfp is None:
        raise ValueError("invalid query SMILES")
    sims = []
    for name, smi in library.items():
        fp = morgan_fp(smi)
        if fp is not None:
            sims.append((DataStructs.TanimotoSimilarity(qfp, fp), name))
    return sorted(sims, reverse=True)[:top_n]
 def dock(target: str, ligand_smiles: str) -> float:
    """Dock a ligand into a target pocket and return a binding score (PLAN §12.3).
    Blocked: AutoDock Vina does not pip-install on ARM Mac and no docking binary is on PATH.
    Resolve the toolchain first (one of):
      - conda/micromamba: ``vina`` (conda-forge) or ``smina`` (bioconda), osx-arm64 builds
      - an AF3-class co-folding model on GPU: Boltz-2 / Chai-1 / DiffDock (also predicts affinity)
      - x86 Vina binary under Rosetta 2
    Then: fetch TARGET_PDB[target], define the pocket box, prep the ligand (Meeko), score.
    """
    raise NotImplementedError(
        "Docking toolchain unresolved on ARM Mac (PLAN §12.6 pitfall 4 / §12.8). "
        "See docstring for options."
    )
--- a/src/disease.py
+++ b/src/disease.py
@@ -1,19 +1,27 @@
 """Disease signature construction.
 Week 1 (PLAN.md §6). Builds a Tier-A sickle cell signature from GEO expression data via
-differential expression, then persists it with full provenance to
+differential expression on TWO studies, keeping only genes that are concordant across both
-``data/processed/sickle_cell_signature_v1.json``.
+(the multi-source evidence that earns Tier A — see project memory). Persists with full
 provenance to ``data/processed/sickle_cell_signature_v1.json``.
-This module defines the persisted schema (pydantic) and the construction stubs. The actual
+Pipeline (driven from ``notebooks/02_disease_signature.ipynb``):
-data pull + differential expression is driven from ``notebooks/02_disease_signature.ipynb``.
+    per study:  compute_differential_expression -> collapse_probes_to_symbols
    across:     concordance_filter -> build_signature -> persist_signature
 Conventions:
    - log_fc > 0  =>  up-regulated in DISEASE vs HEALTHY (Week 1 refinement R1)
    - cross-study join key is the HGNC gene symbol (R2)
 """
 from __future__ import annotations
 from pathlib import Path
 import numpy as np
 import pandas as pd
 from pydantic import BaseModel, Field
 from scipy.stats import false_discovery_control, ttest_ind
 from . import PIPELINE_VERSION, PROCESSED_DIR
 from .provenance import ConfidenceTier
@@ -22,6 +30,10 @@ from .provenance import ConfidenceTier
 TOP_N_PER_DIRECTION = 250
 QVALUE_CUTOFF = 0.05
 # Group labels used throughout. The disease group is the "treatment" arm in DE contrasts.
 DISEASE_LABEL = "disease"
 HEALTHY_LABEL = "healthy"
 class GeneEntry(BaseModel):
    """A single differentially expressed gene in the signature."""
@@ -33,15 +45,37 @@ class GeneEntry(BaseModel):
    qvalue: float
-class SignatureProvenance(BaseModel):
+class StudyProvenance(BaseModel):
-    """Provenance block for a disease signature (PLAN.md §6 schema)."""
+    """Provenance for one contributing GEO study (Week 1 refinement R6)."""
    geo_accession: str
    n_disease: int
    n_healthy: int
    platform: str
-    method: str = Field(..., description="Differential expression method, e.g. 'limma', 'deseq2'.")
+    tissue: str
    method: str = Field(..., description="DE method: 'welch' (microarray) or 'deseq2' (RNA-seq).")
 class ConcordanceSummary(BaseModel):
    """How the two studies were reconciled into the signature (R6)."""
    rule: str = Field(
        default="q<0.05 in both studies AND same sign of log_fc in both",
        description="The concordance rule applied across studies.",
    )
    n_genes_tested: int = Field(..., description="Genes present in both studies after symbol collapse.")
    n_concordant: int
    n_up: int
    n_down: int
 class SignatureProvenance(BaseModel):
    """Provenance block for a disease signature (revised for 2-study concordance, R6)."""
    studies: list[StudyProvenance]
    concordance: ConcordanceSummary
    created_date: str
    pipeline_version: str = PIPELINE_VERSION
 class DiseaseSignature(BaseModel):
@@ -58,44 +92,213 @@ class DiseaseSignature(BaseModel):
    limitations: list[str]
 def _looks_like_linear_intensity(expression: pd.DataFrame) -> bool:
    """Heuristic: microarray data still on a linear scale has a large dynamic range.
    log2-transformed intensities are typically <~20; raw intensities run into the thousands.
    """
    return float(np.nanmax(expression.to_numpy())) > 50.0
 def _welch_de(expression: pd.DataFrame, groups: pd.Series) -> pd.DataFrame:
    """Per-gene Welch t-test + Benjamini-Hochberg, the limma-equivalent for microarray.
    Assumes ``expression`` is genes (rows) x samples (columns), on (or coercible to) a log2
    scale. ``log_fc`` is mean(disease) - mean(healthy) (R1 sign convention).
    """
    expr = expression.copy()
    if _looks_like_linear_intensity(expr):
        # log2(x+1) guards against zeros/negatives while keeping the transform standard.
        expr = np.log2(expr.clip(lower=0) + 1.0)
    disease_cols = groups.index[groups == DISEASE_LABEL]
    healthy_cols = groups.index[groups == HEALTHY_LABEL]
    disease = expr[disease_cols].to_numpy()
    healthy = expr[healthy_cols].to_numpy()
    log_fc = disease.mean(axis=1) - healthy.mean(axis=1)
    # Welch's t-test (unequal variance) per gene across samples.
    _, pvalue = ttest_ind(disease, healthy, axis=1, equal_var=False, nan_policy="omit")
    out = pd.DataFrame({"log_fc": log_fc, "pvalue": pvalue}, index=expr.index)
    out = out.dropna(subset=["pvalue"])
    out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
    return out.sort_values("qvalue")
 def _deseq2_de(counts: pd.DataFrame, groups: pd.Series) -> pd.DataFrame:
    """RNA-seq differential expression via pydeseq2.
    ``counts`` is genes (rows) x samples (columns) of raw integer counts. Returns the same
    schema as ``_welch_de`` (log_fc = log2FC of disease vs healthy, plus pvalue/qvalue).
    """
    from pydeseq2.dds import DeseqDataSet
    from pydeseq2.ds import DeseqStats
    # pydeseq2 wants samples (rows) x genes (cols), with an aligned metadata frame.
    counts_t = counts.T.round().astype(int)
    metadata = pd.DataFrame({"condition": groups.reindex(counts_t.index).to_numpy()}, index=counts_t.index)
    dds = DeseqDataSet(counts=counts_t, metadata=metadata, design="~condition", quiet=True)
    dds.deseq2()
    stats = DeseqStats(dds, contrast=["condition", DISEASE_LABEL, HEALTHY_LABEL], quiet=True)
    stats.summary()
    res = stats.results_df.rename(columns={"log2FoldChange": "log_fc", "pvalue": "pvalue", "padj": "qvalue"})
    return res[["log_fc", "pvalue", "qvalue"]].dropna(subset=["qvalue"]).sort_values("qvalue")
 def compute_differential_expression(
    expression: pd.DataFrame,
    sample_groups: pd.Series,
    *,
    method: str,
 ) -> pd.DataFrame:
-    """Compute gene-level log fold change and adjusted p-values.
+    """Compute gene-level log fold change and adjusted p-values for one study.
    For RNA-seq use ``pydeseq2``; for microarray log2-transform/normalize and use a
    limma-equivalent (PLAN.md §6, Week 1 task 4).
    Args:
-        expression: Genes (rows) x samples (columns) expression matrix.
+        expression: Genes (rows) x samples (columns). Microarray intensities or RNA-seq counts.
-        sample_groups: Per-sample group label ('disease' / 'healthy'), indexed by sample.
+        sample_groups: Per-sample 'disease'/'healthy' label, indexed by sample id (column).
-        method: 'deseq2' (RNA-seq) or 'limma' (microarray).
+        method: 'welch' (microarray, PLAN choice) or 'deseq2' (RNA-seq).
    Returns:
-        A table indexed by gene with at least ``log_fc`` and ``qvalue`` columns.
+        Table indexed by gene/probe with ``log_fc``, ``pvalue``, ``qvalue`` (R1 sign convention).
    """
-    raise NotImplementedError("Differential expression: implement in Week 1 (notebook 02).")
+    groups = sample_groups.reindex(expression.columns)
    if groups.isna().any():
        missing = list(groups.index[groups.isna()])
        raise ValueError(f"sample_groups missing labels for samples: {missing[:5]}...")
    if method == "welch":
        return _welch_de(expression, groups)
    if method == "deseq2":
        return _deseq2_de(expression, groups)
    raise ValueError(f"Unknown method {method!r}; expected 'welch' or 'deseq2'.")
 def collapse_probes_to_symbols(
    de_table: pd.DataFrame,
    probe_to_symbol: pd.Series,
    expression_for_ranking: pd.DataFrame | None = None,
 ) -> pd.DataFrame:
    """Collapse a probe-level DE table to one row per HGNC symbol (R2).
    When multiple probes map to the same symbol, keep the probe with the highest mean
    expression (the standard collapseRows heuristic) if ``expression_for_ranking`` is given;
    otherwise keep the probe with the smallest qvalue.
    Args:
        de_table: DE table indexed by probe id (from ``compute_differential_expression``).
        probe_to_symbol: Probe id -> HGNC symbol mapping.
        expression_for_ranking: Optional genes x samples matrix to rank duplicate probes.
    Returns:
        DE table indexed by HGNC symbol.
    """
    df = de_table.copy()
    df["symbol"] = probe_to_symbol.reindex(df.index)
    df = df.dropna(subset=["symbol"])
    if expression_for_ranking is not None:
        rank_key = expression_for_ranking.reindex(df.index).mean(axis=1)
        df = df.assign(_rank=rank_key).sort_values("_rank", ascending=False)
    else:
        df = df.sort_values("qvalue", ascending=True)
    collapsed = df[~df["symbol"].duplicated(keep="first")].set_index("symbol")
    return collapsed.drop(columns=[c for c in ("_rank",) if c in collapsed.columns])
 def concordance_filter(
    de_a: pd.DataFrame,
    de_b: pd.DataFrame,
    *,
    qvalue_cutoff: float = QVALUE_CUTOFF,
 ) -> tuple[pd.DataFrame, ConcordanceSummary]:
    """Keep only genes concordant across two symbol-level DE tables (R3).
    A gene qualifies iff q<``qvalue_cutoff`` in BOTH studies AND log_fc has the same sign in
    both. Reported ``log_fc`` = mean of the two; ``qvalue`` = max of the two (worst case).
    Ranked by ``max(q_a, q_b)`` ascending.
    Returns:
        (concordant_table indexed by symbol with log_fc/qvalue, ConcordanceSummary).
    """
    merged = de_a.join(de_b, lsuffix="_a", rsuffix="_b", how="inner")
    n_tested = len(merged)
    sig_both = (merged["qvalue_a"] < qvalue_cutoff) & (merged["qvalue_b"] < qvalue_cutoff)
    same_sign = np.sign(merged["log_fc_a"]) == np.sign(merged["log_fc_b"])
    keep = merged[sig_both & same_sign].copy()
    keep["log_fc"] = (keep["log_fc_a"] + keep["log_fc_b"]) / 2.0
    keep["qvalue"] = keep[["qvalue_a", "qvalue_b"]].max(axis=1)
    keep = keep[["log_fc", "qvalue"]].sort_values("qvalue")
    summary = ConcordanceSummary(
        n_genes_tested=n_tested,
        n_concordant=len(keep),
        n_up=int((keep["log_fc"] > 0).sum()),
        n_down=int((keep["log_fc"] < 0).sum()),
    )
    return keep, summary
 def build_signature(
-    de_table: pd.DataFrame,
+    concordant_table: pd.DataFrame,
    provenance: SignatureProvenance,
    *,
    tier: ConfidenceTier,
    tier_rationale: str,
    limitations: list[str],
    id_map: pd.DataFrame | None = None,
    top_n: int = TOP_N_PER_DIRECTION,
    qvalue_cutoff: float = QVALUE_CUTOFF,
 ) -> DiseaseSignature:
-    """Assemble a ``DiseaseSignature`` from a differential expression table.
+    """Assemble a ``DiseaseSignature`` from the concordant gene table.
-    Takes the top ``top_n`` up- and down-regulated genes (by qvalue, cut at
+    Takes the top ``top_n`` up- and down-regulated genes by qvalue per direction (R3). If
-    ``qvalue_cutoff``) per PLAN.md §6, Week 1 task 5.
+    fewer than ``top_n`` qualify in a direction, takes all available (the caller should have
    logged the count). ``id_map`` optionally provides 'entrez_id'/'ensembl_id' columns indexed
    by symbol (from ``mygene``).
    Args:
        concordant_table: Output of ``concordance_filter`` (indexed by symbol).
        provenance: Fully populated ``SignatureProvenance`` (both studies + concordance).
        tier: Confidence tier (Tier A only if genuinely multi-source).
        tier_rationale: One-line justification of the tier.
        limitations: Honest limitations list (R8).
        id_map: Optional symbol -> {entrez_id, ensembl_id} table.
        top_n: Max genes per direction.
    Returns:
        A populated ``DiseaseSignature``.
    """
-    raise NotImplementedError("Signature assembly: implement in Week 1 (notebook 02).")
+
    def _entries(rows: pd.DataFrame) -> list[GeneEntry]:
        entries: list[GeneEntry] = []
        for symbol, row in rows.iterrows():
            ids = id_map.loc[symbol] if (id_map is not None and symbol in id_map.index) else None
            entries.append(
                GeneEntry(
                    gene=str(symbol),
                    entrez_id=(str(ids["entrez_id"]) if ids is not None and pd.notna(ids.get("entrez_id")) else None),
                    ensembl_id=(str(ids["ensembl_id"]) if ids is not None and pd.notna(ids.get("ensembl_id")) else None),
                    log_fc=float(row["log_fc"]),
                    qvalue=float(row["qvalue"]),
                )
            )
        return entries
    up = concordant_table[concordant_table["log_fc"] > 0].sort_values("qvalue").head(top_n)
    down = concordant_table[concordant_table["log_fc"] < 0].sort_values("qvalue").head(top_n)
    return DiseaseSignature(
        up_regulated=_entries(up),
        down_regulated=_entries(down),
        provenance=provenance,
        confidence_tier=tier,
        tier_rationale=tier_rationale,
        limitations=limitations,
    )
 def persist_signature(signature: DiseaseSignature, out_path: Path | None = None) -> Path:
--- a/src/scoring.py
+++ b/src/scoring.py
@@ -1,33 +1,65 @@
-"""CMap-style connectivity scoring — the matching engine.
+"""CMap-style connectivity scoring — the matching engine (Week 3, PLAN §6).
-Week 3 (PLAN.md §6). Scores each drug's LINCS signature against the disease signature using
+Scores each drug's LINCS consensus signature against the disease signature using the weighted
-weighted Kolmogorov-Smirnov enrichment (Lamb 2006 / Subramanian 2017). Strongly *negative*
+Kolmogorov-Smirnov / GSEA enrichment statistic (Lamb 2006; Subramanian 2017). The query is the
-connectivity = strong reversal of the disease signature = candidate match.
+disease up/down gene sets; the reference is each drug's 978 landmark genes ranked by z-score.
-Uses ``cmapPy`` as the reference implementation. ``tests/test_scoring.py`` verifies the
+Sign convention (PLAN §6): strongly **negative** connectivity = strong **reversal** of the
-implementation against a known reference.
+disease signature = candidate match. A drug that down-regulates the disease's up-genes and
 up-regulates its down-genes scores negative.
 """
 from __future__ import annotations
 from pathlib import Path
 import numpy as np
 import pandas as pd
 from pydantic import BaseModel
-from . import RESULTS_DIR
+from . import RANDOM_SEED, RESULTS_DIR
 # Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
 # Keys are pathway categories; values are substrings matched (case-insensitive) against a
 # drug's ChEMBL target names.
 SICKLE_PATHWAYS: dict[str, tuple[str, ...]] = {
    "hbf_epigenetic": ("histone deacetylase", "hdac", "methyltransferase", "dnmt",
                        "ribonucleoside-diphosphate reductase", "ribonucleotide reductase"),
    "hemoglobin": ("hemoglobin", "globin"),
    "no_signaling": ("nitric oxide", "guanylate cyclase", "phosphodiesterase 5", "pde5"),
    "inflammation": ("cyclooxygenase", "prostaglandin", "nf-kappa", "interleukin",
                     "leukotriene", "selectin", "tumor necrosis factor"),
    "oxidative_stress": ("glutathione", "superoxide", "nadph oxidase", "thioredoxin", "nrf2"),
 }
-class ConnectivityResult(BaseModel):
+def _enrichment_score(drug_profile: pd.Series, gene_set: set[str], weight: float = 1.0) -> float:
-    """Connectivity score for a single drug against the disease signature."""
+    """Weighted GSEA/KS enrichment score of ``gene_set`` in a drug's ranked profile.
-    chembl_id: str
+    The profile is ranked by z-score (most up-regulated first). Hits increment a running sum in
-    drug_name: str
+    proportion to ``|z|**weight``; misses decrement uniformly. ES is the max signed deviation
-    connectivity_score: float | None  # None when the drug has no LINCS signature.
+    from zero. ES>0 => set enriched among up-regulated genes; ES<0 => among down-regulated.
-    normalized_score: float | None = None
+    """
-    p_value: float | None = None
+    s = drug_profile.sort_values(ascending=False)
-    scored: bool  # False => no signature available, not scored (do not skip silently).
+    genes = s.index.to_numpy()
-    n_genes_overlap: int | None = None
+    vals = s.to_numpy(dtype=float)
    hit = np.fromiter((g in gene_set for g in genes), dtype=bool, count=len(genes))
    n_hit = int(hit.sum())
    n = len(genes)
    if n_hit == 0 or n_hit == n:
        return 0.0
    w = (np.abs(vals) ** weight) * hit
    sum_hit = w.sum()
    if sum_hit == 0:
        return 0.0
    inc = w / sum_hit
    dec = (~hit) / (n - n_hit)
    running = np.cumsum(inc - dec)
    hi, lo = running.max(), running.min()
    return float(hi if abs(hi) >= abs(lo) else lo)
 def connectivity_score(
@@ -35,46 +67,157 @@ def connectivity_score(
    down_genes: list[str],
    drug_signature: pd.Series,
 ) -> float:
-    """Weighted KS connectivity score for one drug vs the disease up/down gene sets.
+    """Weighted connectivity score (WTCS) for one drug vs the disease up/down sets.
-    Only the intersection of disease-signature genes and LINCS landmark genes is scored;
+    Only query genes present in the drug's profile index (the 978 landmarks) are used — callers
-    callers must record the overlap count (PLAN.md §6, Week 3 task 2).
+    should record the overlap count (PLAN §6 Week 3 task 2). Returns the WTCS: if the two
-
+    enrichment scores share a sign the result is 0 (ambiguous), else ``(ES_up - ES_down)/2``.
-    Args:
+    Negative => reversal => candidate.
        up_genes: Disease up-regulated gene identifiers.
        down_genes: Disease down-regulated gene identifiers.
        drug_signature: Drug's expression vector indexed by gene identifier.
    Returns:
        Connectivity score in roughly [-1, 1]; strongly negative = strong reversal.
    """
-    raise NotImplementedError("Connectivity scoring: implement in Week 3 (notebook 04).")
+    profile_genes = set(drug_signature.index)
    up = set(up_genes) & profile_genes
    down = set(down_genes) & profile_genes
    es_up = _enrichment_score(drug_signature, up)
    es_down = _enrichment_score(drug_signature, down)
    if np.sign(es_up) == np.sign(es_down):
        return 0.0
    return (es_up - es_down) / 2.0
 def normalize_scores(scores: pd.Series) -> pd.Series:
    """Signed normalization (NCS, Subramanian 2017): divide by the mean magnitude of same-sign
    scores, so positive and negative tails are separately scaled to a mean magnitude of 1."""
    out = scores.astype(float).copy()
    pos_mean = scores[scores > 0].mean()
    neg_mean = scores[scores < 0].abs().mean()
    if pos_mean and not np.isnan(pos_mean):
        out[scores > 0] = scores[scores > 0] / pos_mean
    if neg_mean and not np.isnan(neg_mean):
        out[scores < 0] = scores[scores < 0] / neg_mean
    return out
 def rank_drugs(
-    signature_up: list[str],
+    up_genes: list[str],
-    signature_down: list[str],
+    down_genes: list[str],
-    drug_profiles: pd.DataFrame,
+    signature_matrix: pd.DataFrame,
 ) -> pd.DataFrame:
-    """Score and rank all drugs against the disease signature.
+    """Score and rank all drugs (rows of ``signature_matrix``: drug x landmark-gene z-scores).
-    Drugs without a LINCS signature are marked ``scored=False`` and excluded from the ranking
+    Returns a table indexed by drug with ``rank`` (1 = strongest reversal = most negative),
-    rather than dropped silently (PLAN.md §6, Week 3 task 2).
+    ``connectivity_score`` and ``normalized_score``. Drugs are expected to all have signatures
-
+    here; signature-less drugs are handled (marked not-scored) by the orchestration layer per
-    Returns a ranked table with the columns described in PLAN.md §6 (rank, drug_name,
+    PLAN §6 Week 3 task 2.
    chembl_id, connectivity_score, normalized_score, p_value, inclusion_reason,
    known_targets, mechanism_summary).
    """
-    raise NotImplementedError("Drug ranking: implement in Week 3 (notebook 04).")
+    scores = pd.Series(
        {drug: connectivity_score(up_genes, down_genes, signature_matrix.loc[drug])
         for drug in signature_matrix.index},
        name="connectivity_score",
    )
    df = pd.DataFrame({"connectivity_score": scores, "normalized_score": normalize_scores(scores)})
    df = df.sort_values("connectivity_score")  # most negative (reversal) first
    df.insert(0, "rank", range(1, len(df) + 1))
    return df
 def mechanistic_prior(targets: list[str]) -> float:
-    """Prior weight for a drug based on sickle-cell-relevant target pathways.
+    """Count of sickle-cell-relevant pathway categories a drug's targets hit (PLAN §6 task 3).
-    Pathways of interest: HbF regulation, hemoglobin, NO signaling, inflammation, oxidative
+    Higher = more mechanistically plausible. Used to build the secondary, prior-weighted ranking
-    stress (PLAN.md §6, Week 3 task 3). Used to build the secondary, prior-weighted ranking.
+    alongside the raw connectivity ranking.
    """
-    raise NotImplementedError("Mechanistic prior: implement in Week 3 (notebook 04).")
+    if not targets:
        return 0.0
    text = " ; ".join(str(t) for t in targets).lower()
    return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))
 # --- KS connectivity + tau calibration (v1.1) -----------------------------------------------
 # Unweighted Kolmogorov-Smirnov connectivity (Lamb 2006) is O(k) per query (depends only on the
 # ranks of the query genes), which makes a permutation null over many random queries cheap. tau
 # expresses each drug's real connectivity as a signed percentile within its own null — so
 # broad-effect drugs that connect to *random* signatures too get down-weighted (specificity).
 def _ks_es(rank_matrix: np.ndarray, query_cols: np.ndarray, n_genes: int) -> np.ndarray:
    """Vectorized unweighted KS enrichment score of a query gene set for every drug.
    ``rank_matrix`` is (n_drugs, n_genes) of 1..N rank positions (1 = most up-regulated).
    Returns one ES per drug; ES>0 => query enriched among up-regulated genes.
    """
    k = len(query_cols)
    if k == 0:
        return np.zeros(rank_matrix.shape[0])
    p = np.sort(rank_matrix[:, query_cols], axis=1)  # (n_drugs, k), positions ascending
    j = np.arange(1, k + 1)
    a = (j / k - p / n_genes).max(axis=1)
    b = (p / n_genes - (j - 1) / k).max(axis=1)
    return np.where(a >= b, a, -b)
 def _ks_connectivity(rank_matrix: np.ndarray, up_cols: np.ndarray, down_cols: np.ndarray,
                     n_genes: int) -> np.ndarray:
    """KS connectivity per drug: (ES_up - ES_down)/2. Negative=reversal.
    Note: unlike WTCS, this does NOT zero same-sign (ambiguous) connections — same-sign ES
    partially cancel and land near 0 naturally. Hard-zeroing would collapse the random-query
    null to a spike at 0 and make tau saturate, so the continuous form is required for calibration.
    """
    es_up = _ks_es(rank_matrix, up_cols, n_genes)
    es_down = _ks_es(rank_matrix, down_cols, n_genes)
    return (es_up - es_down) / 2.0
 def tau_calibrate(
    up_genes: list[str],
    down_genes: list[str],
    signature_matrix: pd.DataFrame,
    n_null: int = 1000,
    seed: int = RANDOM_SEED,
 ) -> pd.DataFrame:
    """Rank drugs by tau: each drug's KS connectivity as a signed percentile vs a null of
    random queries of the same size (PLAN §6; CMap tau, Subramanian 2017).
    tau in [-100, 100]: -100 => reverses our signature more specifically than any random query
    (strong, specific candidate); ~0 => connects to our signature no more than to random ones
    (broad-effect / non-specific). Ranked by tau ascending (rank 1 = most specific reversal).
    """
    genes = list(signature_matrix.columns)
    gene_to_col = {g: i for i, g in enumerate(genes)}
    n = len(genes)
    rank_matrix = signature_matrix.rank(axis=1, ascending=False).to_numpy()
    up_cols = np.array([gene_to_col[g] for g in set(up_genes) if g in gene_to_col], dtype=int)
    down_cols = np.array([gene_to_col[g] for g in set(down_genes) if g in gene_to_col], dtype=int)
    real = _ks_connectivity(rank_matrix, up_cols, down_cols, n)
    rng = np.random.default_rng(seed)
    k_up, k_down = len(up_cols), len(down_cols)
    null = np.empty((rank_matrix.shape[0], n_null))
    for m in range(n_null):
        samp = rng.choice(n, size=k_up + k_down, replace=False)
        null[:, m] = _ks_connectivity(rank_matrix, samp[:k_up], samp[k_up:], n)
    null_mean = null.mean(axis=1)
    null_std = null.std(axis=1)
    null_std[null_std == 0] = np.nan
    # Per-drug standardized connectivity: how many SDs the real reversal is below what random
    # queries of the same size produce against THIS drug. Continuous (no percentile floor), so it
    # discriminates within the strong-reversal tail. Negative = specific reversal.
    spec_z = (real - null_mean) / null_std
    frac = (null <= real[:, None]).mean(axis=1)
    tau = 100.0 * (2.0 * frac - 1.0)  # retained for reference; saturates at +/-100 in the tail
    df = pd.DataFrame(
        {"connectivity_ks": real, "null_mean": null_mean, "spec_z": spec_z, "tau": tau},
        index=signature_matrix.index,
    )
    df = df.sort_values("spec_z")  # most negative z = most specific reversal
    df.insert(0, "rank", range(1, len(df) + 1))
    return df
 def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:
--- a/tests/test_disease.py
+++ b/tests/test_disease.py
@@ -0,0 +1,122 @@
 """Tests for the Week 1 signature-construction logic on synthetic data.
 These verify the DE / probe-collapse / concordance math without touching the network, so the
 pipeline is trustworthy before it is pointed at real GEO studies.
 """
 from __future__ import annotations
 import numpy as np
 import pandas as pd
 import pytest
 from src.disease import (
    DISEASE_LABEL,
    HEALTHY_LABEL,
    ConcordanceSummary,
    SignatureProvenance,
    StudyProvenance,
    build_signature,
    collapse_probes_to_symbols,
    compute_differential_expression,
    concordance_filter,
 )
 from src.provenance import ConfidenceTier
 def _synthetic_study(seed: int, n_per_group: int = 12) -> tuple[pd.DataFrame, pd.Series]:
    """Genes x samples matrix where UP is higher and DOWN is lower in disease."""
    rng = np.random.default_rng(seed)
    genes = ["UP1", "UP2", "DOWN1", "DOWN2", "NOISE1", "NOISE2"]
    samples = [f"d{i}" for i in range(n_per_group)] + [f"h{i}" for i in range(n_per_group)]
    groups = pd.Series([DISEASE_LABEL] * n_per_group + [HEALTHY_LABEL] * n_per_group, index=samples)
    base = rng.normal(8.0, 0.3, size=(len(genes), len(samples)))
    df = pd.DataFrame(base, index=genes, columns=samples)
    disease = groups == DISEASE_LABEL
    df.loc["UP1", disease] += 3.0
    df.loc["UP2", disease] += 2.0
    df.loc["DOWN1", disease] -= 3.0
    df.loc["DOWN2", disease] -= 2.0
    return df, groups
 def test_welch_de_recovers_direction_and_significance():
    expr, groups = _synthetic_study(seed=1)
    de = compute_differential_expression(expr, groups, method="welch")
    assert de.loc["UP1", "log_fc"] > 0 and de.loc["UP1", "qvalue"] < 0.05
    assert de.loc["DOWN1", "log_fc"] < 0 and de.loc["DOWN1", "qvalue"] < 0.05
    # Pure-noise genes should not be significant.
    assert de.loc["NOISE1", "qvalue"] > 0.05
 def test_compute_de_rejects_unlabelled_samples():
    expr, groups = _synthetic_study(seed=2)
    with pytest.raises(ValueError):
        compute_differential_expression(expr, groups.iloc[:-3], method="welch")
 def test_collapse_probes_keeps_highest_mean_expression():
    de = pd.DataFrame(
        {"log_fc": [1.0, 2.0, -1.0], "pvalue": [0.1, 0.2, 0.3], "qvalue": [0.1, 0.2, 0.3]},
        index=["probeA1", "probeA2", "probeB1"],
    )
    probe_to_symbol = pd.Series({"probeA1": "GENEA", "probeA2": "GENEA", "probeB1": "GENEB"})
    expr = pd.DataFrame(
        {"s1": [1.0, 100.0, 5.0], "s2": [1.0, 100.0, 5.0]},
        index=["probeA1", "probeA2", "probeB1"],
    )
    collapsed = collapse_probes_to_symbols(de, probe_to_symbol, expression_for_ranking=expr)
    assert set(collapsed.index) == {"GENEA", "GENEB"}
    # probeA2 has the higher mean expression, so its log_fc (2.0) should win.
    assert collapsed.loc["GENEA", "log_fc"] == 2.0
 def test_concordance_filter_keeps_only_agreeing_genes():
    de_a = pd.DataFrame(
        {"log_fc": [2.0, -2.0, 1.5, 0.1], "qvalue": [0.001, 0.001, 0.2, 0.001]},
        index=["UP1", "DOWN1", "WEAK", "DISAGREE"],
    )
    de_b = pd.DataFrame(
        {"log_fc": [1.8, -2.2, 1.4, -0.1], "qvalue": [0.002, 0.002, 0.2, 0.002]},
        index=["UP1", "DOWN1", "WEAK", "DISAGREE"],
    )
    keep, summary = concordance_filter(de_a, de_b)
    assert set(keep.index) == {"UP1", "DOWN1"}  # WEAK fails q-cut; DISAGREE flips sign
    assert keep.loc["UP1", "log_fc"] == pytest.approx(1.9)  # mean of the two
    assert keep.loc["UP1", "qvalue"] == pytest.approx(0.002)  # max of the two
    assert isinstance(summary, ConcordanceSummary)
    assert summary.n_genes_tested == 4 and summary.n_concordant == 2
    assert summary.n_up == 1 and summary.n_down == 1
 def test_build_signature_splits_directions_and_respects_top_n():
    concordant = pd.DataFrame(
        {
            "log_fc": [3.0, 2.0, 1.0, -1.0, -2.0],
            "qvalue": [0.001, 0.002, 0.003, 0.004, 0.005],
        },
        index=["UP1", "UP2", "UP3", "DOWN1", "DOWN2"],
    )
    prov = SignatureProvenance(
        studies=[
            StudyProvenance(geo_accession="GSE1", n_disease=12, n_healthy=12,
                            platform="P", tissue="whole blood", method="welch"),
            StudyProvenance(geo_accession="GSE2", n_disease=15, n_healthy=11,
                            platform="P", tissue="whole blood", method="welch"),
        ],
        concordance=ConcordanceSummary(n_genes_tested=100, n_concordant=5, n_up=3, n_down=2),
        created_date="2026-06-23",
    )
    sig = build_signature(
        concordant, prov, tier=ConfidenceTier.A,
        tier_rationale="Two-study concordance", limitations=["cell-composition confound"],
        top_n=2,
    )
    assert [g.gene for g in sig.up_regulated] == ["UP1", "UP2"]  # top 2 up by qvalue
    assert [g.gene for g in sig.down_regulated] == ["DOWN1", "DOWN2"]
    assert sig.confidence_tier == ConfidenceTier.A
--- a/tests/test_scoring.py
+++ b/tests/test_scoring.py
@@ -1,14 +1,13 @@
 """Tests for the matching engine and provenance logic.
-The headline test (PLAN.md §6, Week 3 task 4) verifies connectivity scoring against a known
+Connectivity tests (PLAN.md §6, Week 3 task 4) pin the weighted-KS scorer against hand-built
-reference within tolerance; it is marked xfail until the scorer is implemented in Week 3.
+reference profiles. The tier-assignment tests pin the rules from PLAN.md §3 so the most
 The tier-assignment tests run today — they pin the rules from PLAN.md §3 so the most
 commercially important design decision can't silently drift.
 """
 from __future__ import annotations
 import pandas as pd
 import pytest
 from src.provenance import ConfidenceTier, assign_tier
@@ -52,14 +51,103 @@ class TestAssignTier:
        assert assign_tier(**kwargs) == ConfidenceTier.B
-@pytest.mark.xfail(reason="Connectivity scoring implemented in Week 3 (notebook 04).", strict=True)
+class TestConnectivityScore:
-def test_connectivity_score_matches_reference():
+    """Reference checks for the weighted-KS connectivity score (PLAN §6 Week 3 task 4).
    """Verify connectivity scoring against a CMap/cmapPy reference within tolerance.
-    PLAN.md §6, Week 3 task 4. Replace this body with a known reference example
+    Query: up = {U1, U2}, down = {D1, D2}. We build drug profiles with a known relationship to
-    (disease up/down sets + drug signature -> expected score) once the scorer exists.
+    the query and assert the sign/ordering the CMap convention requires.
    """
    from src.scoring import connectivity_score
-    score = connectivity_score(up_genes=[], down_genes=[], drug_signature=None)  # noqa
+    UP = ["U1", "U2"]
-    assert score == pytest.approx(0.0, abs=1e-6)
+    DOWN = ["D1", "D2"]
    @staticmethod
    def _profile(values: dict[str, float]) -> pd.Series:
        # 20 filler genes at ~0 so the query genes sit clearly at the extremes.
        base = {f"N{i}": 0.01 * ((i % 5) - 2) for i in range(20)}
        base.update(values)
        return pd.Series(base)
    def test_perfect_reversal_is_strongly_negative(self):
        from src.scoring import connectivity_score
        # Drug pushes disease-up genes DOWN (very negative) and disease-down genes UP (very
        # positive) => reversal => negative connectivity.
        prof = self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7})
        assert connectivity_score(self.UP, self.DOWN, prof) < -0.4
    def test_perfect_mimic_is_strongly_positive(self):
        from src.scoring import connectivity_score
        prof = self._profile({"U1": 8, "U2": 7, "D1": -8, "D2": -7})
        assert connectivity_score(self.UP, self.DOWN, prof) > 0.4
    def test_reversal_beats_mimic_and_null(self):
        from src.scoring import connectivity_score
        rev = connectivity_score(self.UP, self.DOWN, self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7}))
        mimic = connectivity_score(self.UP, self.DOWN, self._profile({"U1": 8, "U2": 7, "D1": -8, "D2": -7}))
        null = connectivity_score(self.UP, self.DOWN, self._profile({"U1": 0.2, "U2": -0.1, "D1": 0.1, "D2": -0.2}))
        assert rev < null < mimic
        assert abs(null) < abs(rev)
    def test_same_sign_enrichment_returns_zero(self):
        from src.scoring import connectivity_score
        # Both up- and down-sets at the top => same-sign ES => ambiguous => 0 (WTCS rule).
        prof = self._profile({"U1": 8, "U2": 7, "D1": 6, "D2": 5})
        assert connectivity_score(self.UP, self.DOWN, prof) == 0.0
    def test_genes_absent_from_profile_are_ignored(self):
        from src.scoring import connectivity_score
        prof = self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7})
        # Adding a query gene not in the profile must not change the score.
        s1 = connectivity_score(self.UP, self.DOWN, prof)
        s2 = connectivity_score(self.UP + ["NOT_IN_PROFILE"], self.DOWN, prof)
        assert s1 == pytest.approx(s2)
 class TestMechanisticPrior:
    def test_counts_distinct_sickle_pathways(self):
        from src.scoring import mechanistic_prior
        # ribonucleotide reductase (hydroxyurea) -> hbf_epigenetic category.
        assert mechanistic_prior(["Ribonucleoside-diphosphate reductase RR1"]) == 1.0
        # DNMT (epigenetic) + hemoglobin -> two categories.
        assert mechanistic_prior(["DNA (cytosine-5)-methyltransferase 1", "Hemoglobin subunit beta"]) == 2.0
        assert mechanistic_prior([]) == 0.0
        assert mechanistic_prior(["Some unrelated kinase"]) == 0.0
 class TestTauCalibration:
    """tau should reward a SPECIFIC reverser and give a near-zero score to a noise drug."""
    @staticmethod
    def _matrix() -> pd.DataFrame:
        genes = [f"U{i}" for i in range(5)] + [f"D{i}" for i in range(5)] + [f"G{i}" for i in range(40)]
        rng_vals = {g: 0.01 * ((hash(g) % 7) - 3) for g in genes}  # tiny deterministic noise
        # specific reverser: query-up genes at the bottom, query-down at the top, rest ~0
        specific = dict(rng_vals)
        for i in range(5):
            specific[f"U{i}"] = -8 - i
            specific[f"D{i}"] = 8 + i
        noise = dict(rng_vals)
        return pd.DataFrame([specific, noise], index=["specific", "noise"])[genes]
    def test_specific_reverser_has_strongly_negative_tau(self):
        from src.scoring import tau_calibrate
        up = [f"U{i}" for i in range(5)]
        down = [f"D{i}" for i in range(5)]
        out = tau_calibrate(up, down, self._matrix(), n_null=300, seed=0)
        # Ranked by spec_z (continuous); the specific reverser is the most negative.
        assert out.loc["specific", "spec_z"] < -2
        assert out.loc["specific", "spec_z"] < out.loc["noise", "spec_z"]
        assert out.loc["specific", "tau"] < -50  # tau also flags it (may saturate near -100)
        assert out.loc["specific", "rank"] == 1
 def test_rank_drugs_orders_by_reversal():
    from src.scoring import rank_drugs
    genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]
    base = {g: 0.0 for g in genes}
    reverser = {**base, "U1": -8, "U2": -7, "D1": 8, "D2": 7}
    mimic = {**base, "U1": 8, "U2": 7, "D1": -8, "D2": -7}
    matrix = pd.DataFrame([reverser, mimic], index=["reverser", "mimic"])
    ranked = rank_drugs(["U1", "U2"], ["D1", "D2"], matrix)
    assert ranked.loc["reverser", "rank"] == 1
    assert ranked.loc["reverser", "connectivity_score"] < ranked.loc["mimic", "connectivity_score"]