Structure-binding track: scaffold + ligand-retrieval baseline

Start the structure-based binding branch (PLAN §12), baseline-first. - src/binding.py: validated RDKit ligand retrieval (morgan_fp, tanimoto, retrieve_nearest = the §12.9 engine) + dock() stub documenting the blocked ARM-Mac toolchain - scripts/binding_ligand_baseline.py: 300 drugs vs known binders - docs/structure_binding_notes.md: status, toolchain blocker, next steps - pyproject: [structure] extra (rdkit); data/raw/structures/ for PDBs Step-0 finding: retrieval engine VALIDATED on in-set classes (decitabine->azacitidine 0.62; vorinostat->scriptaid/belinostat) but the distinctive binders voxelotor/mitapivat have no analog in our 300-drug set (Tanimoto ~0.2). Needs (a) bigger library, (b) real docking (§12.3), which is blocked on the ARM-Mac docking toolchain (§12.6 pitfall 4). Structures 5E83 (Hb+voxelotor) and 8XFD (PKR+mitapivat) fetched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PLAN §12.9: leave door open for generative-guided retrieval
2026-06-23 23:53:27 +02:00 · 2026-06-23 23:43:25 +02:00 · 2026-06-23 23:40:18 +02:00 · 2026-06-23 23:31:32 +02:00 · 2026-06-23 23:19:26 +02:00 · 2026-06-23 23:05:08 +02:00
25 changed files with 2487 additions and 124 deletions
--- a/PLAN.md
+++ b/PLAN.md
@@ -426,3 +426,143 @@ This MVP exists in a broader strategic context that was built through ~7 expert
 - **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product.

 The MVP's job is to produce one credible result. Everything else follows from that.
+
+---
+
+## 12. Phase 2 track — Structure-based binding (scoped 2026-06-23)
+
+> **Status: scoped, not committed.** This is a follow-on track proposed *after* the MVP and its
+> follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what
+> those experiments empirically showed. Read §9–11 and the experiment commits first.
+
+### 12.1 Why pivot modality (the evidence, not a hunch)
+
+The expression-connectivity approach was built, validated, and pushed hard (gene-space
+expansion, cell-composition deconvolution, reference-library tau, supervised learning). The
+empirical verdict:
+
+- Connectivity **recovers hydroxyurea** (top ~6–8%) but **cannot achieve specificity** —
+  unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four
+  independent methods.
+- A supervised model on indication labels hit **0.925 CV AUC** — but it was a **degree-bias
+  mirage**: it learned drug popularity, not disease matching (it ranked hydroxyurea *231/300*).
+- The decisive test: with drug-popularity features removed, the model trained on the actual
+  drug↔disease connectivity scored **AUC 0.491 — pure chance**. **The expression-connectivity
+  modality contains essentially no disease-specific therapeutic signal for this task.**
+
+This is a *signal* problem, not a *model* problem — no amount of model sophistication (diffusion,
+GNNs, etc.) extracts signal that isn't in the data. The response is to **change modality** to one
+with a strong, physical, drug-specific signal: **does a molecule bind a sickle-relevant target?**
+A drug that binds HbS is mechanistically specific by construction — the opposite of a coincidental
+expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is
+itself a diffusion model) actually has traction, because structure has physical ground truth.
+
+### 12.2 Targets (sickle-specific, druggable, structurally characterised)
+
+Small molecules only (§2). Curated shortlist with public structures and, crucially, **known
+small-molecule binders to serve as positive controls**:
+
+| Target | Mechanism in sickle | Known binder (positive control) |
+|---|---|---|
+| Hemoglobin (HBB/HBA tetramer, HbS) | Anti-polymerisation; R-state stabiliser | **voxelotor** (binds α-Val1) |
+| PKR (PKLR, red-cell pyruvate kinase) | Activator → ↓2,3-BPG → ↑O2 affinity | **mitapivat**, etavopivat |
+| DNMT1 | HbF induction (de-repress γ-globin) | **decitabine**, azacitidine |
+| LSD1 / KDM1A | HbF induction | tranylcypromine analogues |
+| HDAC1/2 | HbF induction | vorinostat, panobinostat |
+| EHMT2 (G9a) | HbF induction | UNC0642 (tool) |
+| PDE9 | ↑cGMP, anti-adhesion | PF-04447943 (sickle trial) |
+
+Hard/excluded for v1: **BCL11A** (transcription factor, no classic pocket — the γ-globin master
+repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism.
+
+### 12.3 Method (baseline → generative co-folding)
+
+1. **Prepare structures.** Pull target structures from the PDB; AF3/Boltz-predict any missing.
+2. **Prepare ligands.** Reuse the existing ~300-drug set (we already have canonical SMILES from
+   ChEMBL); expandable to the full ChEMBL/LINCS catalogue.
+3. **Dock + score**, in increasing sophistication:
+   - **Baseline:** classical docking (AutoDock Vina / smina) — fast, CPU, well-understood.
+   - **Generative co-folding:** an open AlphaFold3-class model — **Boltz-2** (predicts the
+     protein–ligand complex *and* a binding-affinity estimate, MIT-licensed), **Chai-1**, or
+     **DiffDock** (a diffusion model for docking — the legitimate home for the "diffusion"
+     instinct). These predict the bound pose directly and tend to beat classical docking.
+   - Report both; the baseline keeps us honest about whether the ML model adds anything.
+
+### 12.4 Validation (a real recovery test, like §6 Week 4)
+
+Pre-register before scoring: **the known structure-based sickle drugs must rank as top binders to
+their targets** — voxelotor→hemoglobin, mitapivat→PKR, decitabine→DNMT1. Negative controls
+(unrelated drugs) must *not* bind these pockets. This is a cleaner recovery test than the
+expression one, because binding is mechanistically specific — it should not have the
+coincidental-reverser problem that sank the connectivity approach.
+
+### 12.5 The real prize — integrate, don't replace
+
+The long-term value is **both modalities together**: a candidate that *reverses the disease
+signature* (expression) **and** *binds a sickle-relevant target* (structure) is far more credible
+than either alone. Structure supplies the specificity the expression layer lacks; expression
+supplies the systems-level, target-agnostic view structure lacks. The platform thesis (§11) —
+two databases + a matching engine — extends naturally to a third (structures) feeding the same
+confidence-tiered data layer.
+
+### 12.6 Honest pitfalls (do not ignore)
+
+1. **Binding ≠ efficacy.** A molecule can bind and do nothing therapeutic. Structure-based hits
+   are still hypotheses (cf. §9.7).
+2. **Only covers the enzyme/pocket subset.** Sickle's biggest lever (γ-globin reactivation via
+   BCL11A) is largely transcriptional and not small-molecule-tractable — structure-based screening
+   is blind to it. Be explicit about coverage.
+3. **Docking/affinity accuracy is limited.** Pose prediction is decent; absolute affinity is hard.
+   Validate on known binders before trusting novel scores.
+4. **Compute.** AF3-class models are GPU-heavy; the local Mac Studio (§2) may not suffice — this
+   track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule.
+5. **Moat.** Structures and tools are public; the proprietary value is the curated target list,
+   the integration with the expression layer, and provenance/tiering — not the docker.
+
+### 12.7 Explicitly NOT in this track
+
+Free energy perturbation / MD-based affinity; covalent docking; **de novo generation of molecules
+as final candidates to synthesise** (design, not repurposing — but see §12.9 for the
+generate-then-retrieve hybrid, which *is* repurposing); BCL11A or any non-pocket target;
+biologics; combination binding.
+
+### 12.8 Open decisions before committing
+
+- **Tooling:** classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend:
+  baseline first, for an honest reference — the lesson of the whole expression arc.)
+- **Compute:** secure a GPU environment (the "all local" §2 assumption breaks here).
+- **Scope of v1:** the 7-target shortlist above, or start with just Hb + PKR (the two with the
+  cleanest positive controls) to de-risk the harness before scaling targets.
+
+### 12.9 Door left open — generative-guided retrieval (generate → match existing)
+
+A legitimate way to bring generative models *into the repurposing frame* (vs the design frame
+excluded in §12.7): don't generate molecules to synthesise — generate them as a **search beacon**.
+
+**The idea.** Use a pocket-conditioned generative model (target-conditioned diffusion — e.g.
+TargetDiff, DiffSBDD, Pocket2Mol) to propose idealised binders for a sickle target. Then retrieve
+the **nearest existing drugs** to those proposals by chemical similarity (Tanimoto over Morgan
+fingerprints, or a learned molecular embedding). The retrieved neighbours — real, already-approved
+or clinical molecules — are the repurposing candidates. The generated molecule is never made; it
+only *defines a region of chemical space* worth searching. This is the user-proposed framing and
+it is sound: "generate the ideal, then find what we already have nearby."
+
+**Why it could add value.** It can point at scaffolds / regions a known-binder-seeded or
+brute-force docking sweep would miss, and it makes the target's binding requirements explicit as
+geometry rather than as a single reference ligand.
+
+**Honest caveats (why it's a *door*, not a commitment).**
+1. **Generated molecules are often synthetically unrealistic / invalid** — which is exactly why
+   they must be used only as beacons, never as candidates.
+2. **Similarity ≠ activity.** Activity cliffs mean a near-neighbour of a good binder can be inert.
+   So retrieved neighbours do **not** bypass validation — they must still be docked/scored (§12.3)
+   and clear the binding recovery test (§12.4). The generative step *proposes*; it does not *prove*.
+3. **Marginal-value question.** Directly docking the whole existing library (§12.3) already covers
+   chemical space. Whether generate-then-retrieve beats that — by efficiency or by surfacing
+   non-obvious scaffolds — is an open empirical question and needs a head-to-head before it earns
+   real investment.
+4. **Only as good as the pocket conditioning** of the generator, and the chemistry of the target.
+
+**Status:** explore only *after* the §12.3–12.4 docking harness works and is validated on the
+known binders — same discipline as everywhere else: prove the baseline, then test whether the
+fancier method adds anything.
--- a/data/raw/structures/.gitkeep
+++ b/data/raw/structures/.gitkeep
--- a/docs/data_sources.md
+++ b/docs/data_sources.md
@@ -10,17 +10,63 @@
 | MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD |
 | Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD |
 | OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
-| GEO | https://www.ncbi.nlm.nih.gov/geo/ | GEOparse, FTP | Public domain | Expression data | TBD | TBD |
-| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets | TBD | TBD |
-| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures | TBD | TBD |
+| GEO (GSE35007) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 | GEOparse, FTP | Public domain | Disease signature (study 1) | GPL10558 | 2026-06-23 |
+| GEO (GSE16728) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 | GEOparse, FTP | Public domain | Disease signature (study 2) | GPL570 | 2026-06-23 |
+| ChEMBL | https://www.ebi.ac.uk/chembl | chembl_webresource_client | CC BY-SA 3.0 | Drug structures, MoA, targets | API (live) | 2026-06-23 |
+| LINCS L1000 Phase I | GSE92742 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. L-glutamine) | GSE92742 | 2026-06-23 |
+| LINCS L1000 Phase II | GSE70138 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. hydroxyurea) | GSE70138 | 2026-06-23 |
 | ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
 | FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
 | Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |

-## Chosen GEO dataset
+## Chosen GEO datasets (disease signature, Tier A via 2-study concordance)

-_Document the chosen study fully: accession, platform, n per group, publication, why it was
-selected over the alternatives (GSE53441, GSE35007, …)._
+The signature is the cross-study concordance of two independent whole-blood studies (genes
+significant at q<0.05 in **both** with the same direction). Whole-blood tissue was required so
+concordance is meaningful; the two differ by platform and population, which strengthens
+robustness.
+
+| Study | Platform | Tissue | Disease group | Healthy group | n disease / healthy |
+|---|---|---|---|---|---|
+| **GSE35007** | Illumina HumanHT-12 V4 (GPL10558) | whole blood | hb phenotype = SS | hb phenotype = AA | 190 / 12 |
+| **GSE16728** | Affymetrix HG-U133 Plus 2.0 (GPL570) | whole blood (PAXgene) | sickle-cell patient | control | 10 / 10 |
+
+- DE method: per-gene Welch t-test + Benjamini–Hochberg (microarray, pure Python).
+- Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
+- Result: 16,208 genes tested in both → **671 concordant** (444 up / 227 down). Signature =
+  top 250 up + all 227 down by worst-case q-value.
+- **Rejected candidates:** GSE53441 (PBMC — tissue mismatch with the whole-blood anchor);
+  GSE84633/GSE84634 (PBMC, no healthy controls).
+- **Tier caveat:** GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict
+  n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.
+
+Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
+`scripts/week1_finalize.py` (mygene mapping + persist).
+
+## Drug profiles (Week 2)
+
+300-drug set (`drug_set_v1.csv`), composed and restricted to LINCS-scorable compounds:
+
+| Inclusion reason | n | Notes |
+|---|---|---|
+| ground_truth | 2 | hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I) |
+| related_mechanism | 32 | HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories |
+| negative_control | 26 | antifungals, antihistamines, antibiotics, hormones |
+| general_sample | 240 | random from LINCS catalog, seed=42 |
+
+- **LINCS signatures:** per-drug consensus = mean of Level-5 MODZ z-scores across the drug's
+  sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH
+  phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
+- **ChEMBL:** matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds
+  38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
+- **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
+  not-single-source rule).
+- **Gene space (v1.1):** scoring uses the full **12,328-gene** LINCS space, not just the 978
+  landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger
+  space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from
+  LINCS entirely and remain unscoreable.
+- Reproduce: `week2_curate_drugset.py` → `week2_chembl.py` → download Level-5 GCTX →
+  `week2_lincs_extract.py` → `week2_assemble.py`.

 ## Licensing note for LINCS

--- a/docs/known_limitations.md
+++ b/docs/known_limitations.md
@@ -12,9 +12,11 @@ Source: PLAN.md §9.
   cell lines (MCF7, A375, PC3, …). Signatures for non-oncology diseases may be noisy. A
   field-wide limitation, not unique to Reverso.

-3. **L-glutamine probably has no LINCS signature.** Amino acids and metabolites weren't LINCS
-   priorities. If true, the ground-truth test effectively rests on hydroxyurea alone, which is
-   weaker. _Status: TBD — record the actual finding here once LINCS is pulled (Week 2)._
+3. **L-glutamine LINCS coverage — RESOLVED, opposite of expected.** L-glutamine DOES have a
+   Phase I signature (hydroxyurea is Phase-II-only) — both ground-truth drugs are scorable. But
+   L-glutamine's connectivity is **ambiguous (WTCS=0)**: its up- and down-set enrichments share
+   a sign, so it shows no reversal. It ranks 100/300. So the ground-truth test effectively rests
+   on hydroxyurea, which itself only reaches top 13% (raw) — see the recovery test report.

 4. **Connectivity scoring surfaces broad-effect drugs as false positives.** HDAC inhibitors and
   broad kinase inhibitors often top connectivity rankings simply because they perturb many
@@ -32,8 +34,28 @@ Source: PLAN.md §9.
 7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
   to test, not discoveries. Use careful language in any write-up.

-## Drug-specific gaps (fill in during Week 2–3)
+8. **Gene-space bottleneck (v1 → fixed in v1.1).** v1 scored on only the 978 landmark genes (12%
+   signature overlap) — the main driver of the v1 failure. v1.1 uses the full 12,328-gene space
+   (85% overlap) and recovers hydroxyurea. HBG1/HBG2 remain absent from LINCS entirely.

-| Drug | Issue | Handling |
+9. **No reference-signature library for tau.** Textbook CMap tau saturated at ±100 (a coherent
+   query always out-connects random gene sets). v1.1 substitutes a per-drug specificity z-score.
+   Proper tau needs a library of real reference signatures — a v2 / curated-data item.
+
+10. **Negative-control criterion may be invalid for connectivity scoring.** Unrelated drugs
+    (norethindrone, ciprofloxacin) rank as top specific reversers — connectivity measures
+    expression reversal, not therapeutic relatedness.
+
+## Recovery test outcome
+
+Pre-registered test (**v1, confirmatory**): **FAILED** all three criteria (hydroxyurea rank
+40/top 13%; L-glutamine rank 100; 1/5 negative controls bottom-half). Post-hoc (**v1.1,
+exploratory**): hydroxyurea recovers to rank 18 (top 6%, passes), but L-glutamine (rank 213, does
+not reverse) and negative controls (2/5) still fail → overall still FAIL. See
+`recovery_test_report.md`.
+
+| Drug | Issue | v1.1 status |
 |---|---|---|
-| TBD | e.g. no LINCS signature | flagged "not scored, no signature available" |
+| hydroxyurea | needed the full gene space | rank 18 (top 6%) — recovered post-hoc |
+| L-glutamine | metabolite, no reversal signal (positive connectivity) | rank 213 — genuine negative |
+| neg controls | reverse the generic inflammation signature | 2/5 bottom-half — criterion questionable |
--- a/docs/recovery_test_report.md
+++ b/docs/recovery_test_report.md
@@ -1,13 +1,10 @@
 # Sickle Cell Repurposing — Recovery Test Report

-> **Status: DRAFT SCAFFOLD — not yet run.** Filled in during Week 4 from
-> `notebooks/05_recovery_test.ipynb`. Target length: ~2 pages, readable by a sceptical
-> pharma scientist in 5 minutes.
+> **Status: COMPLETE (v1 confirmatory + v1.1 exploratory).** Reproduce with `scripts/week1_*` →
+> `week2_*` → `week3_scoring.py` → `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.

 ## Pre-registered success criteria

-> ⚠️ **Commit this section to git _before_ running the recovery test** (PLAN.md §8, §10).
-
 The MVP passes if:

 - Hydroxyurea ranks in the **top 10%** (top 30 of 300), **AND**
@@ -15,54 +12,116 @@ The MVP passes if:
  missing LINCS signature, **AND**
 - At least **4 of 5** negative-control drugs rank in the **bottom half**.

-_Pre-registered on: TBD (date of commit)_
+_Pre-registered in the scaffold commit (`b731478`) before any scoring. **Primary (confirmatory)
+analysis = v1**: 978 landmark genes, weighted connectivity score (WTCS). The 5 negative controls
+were pre-specified by category rule without inspecting ranks._

 ---

 ## Section 1 — Methodology

-_5–6 sentences: what was built, the GEO dataset used, the drug-set composition, and the
-scoring method (CMap connectivity, Lamb 2006 / Subramanian 2017)._
+A sickle cell disease signature was built from **two whole-blood microarray studies** (GSE35007
+Illumina SS-vs-AA; GSE16728 Affymetrix patient-vs-control), keeping the **671 genes concordant**
+across both (q<0.05, same direction) → a cross-platform Tier-A signature (250 up / 227 down).
+Profiles were built for **300 small molecules** (2 ground-truth; 32 related-mechanism; 26
+negative controls; 240 random), each with a **LINCS L1000** consensus signature (mean Level-5
+MODZ across cell lines, both CMap phases). Drugs were ranked by **CMap connectivity scoring**
+(Kolmogorov-Smirnov, Lamb 2006 / Subramanian 2017): negative = reversal = candidate.
+
+**v1 (pre-registered/confirmatory):** scored on the 978 landmark genes with WTCS.
+**v1.1 (post-hoc/exploratory):** after v1 failed, two changes were made to diagnose why — (a)
+score on the full **12,328-gene** space (landmark overlap 12% → 85%, bringing the erythroid
+markers in); (b) add a **per-drug specificity z-score** (`spec_z`): how many SDs the real
+connectivity is below a null of 1,000 random queries of the same size against that drug. Because
+these changes followed inspection of the v1 result, **v1.1 is exploratory, not a confirmatory
+test of the pre-registered hypothesis.**

 ## Section 2 — Recovery test result

-| Drug | Rank | Percentile | Pass? |
-|---|---|---|---|
-| Hydroxyurea | TBD | TBD | TBD |
-| L-glutamine | TBD | TBD | TBD |
-
-Negative controls (expected: bottom half):
-
-| Control drug | Rank | Bottom half? |
+| Criterion | v1 (confirmatory) | v1.1 (exploratory) |
 |---|---|---|
-| TBD | TBD | TBD |
+| Hydroxyurea top-10% (≤30) | rank **40** (13.3%) ❌ | rank **18** (6.0%) ✅ |
+| L-glutamine top-25% (≤75) | rank 100, WTCS=0 ❌ | rank 213, spec_z=+0.98 ❌ |
+| ≥4/5 neg controls bottom-half | 1/5 ❌ | 2/5 ❌ |
+| **Overall** | **FAIL** | **FAIL** (but hydroxyurea recovered) |

-**Overall: PASS / FAIL against pre-registered criteria — TBD**
+v1.1 negative controls: clotrimazole 258 ✅, astemizole 211 ✅, azithromycin 142 ❌,
+ethinyl-estradiol 114 ❌, caffeine 77 ❌.

-## Section 3 — Top 10 candidates
+**Honest reading.** The **pre-registered test FAILED (v1).** The post-hoc v1.1 changes
+**recover hydroxyurea** (rank 40 → 18, passing top-10%) — strong evidence that the v1 failure was
+driven by the 978-landmark bottleneck, not the algorithm. But two failures survive into v1.1, and
+both are now precisely diagnosed:

-| Rank | Drug | Score | Known mechanism | Biological plausibility |
+1. **L-glutamine does not reverse the signature** (positive connectivity, spec_z=+0.98). This is
+   intrinsic to its LINCS data — a metabolite with no reversal signal — not a coverage gap. More
+   genes cannot fix it.
+2. **The negative-control criterion is arguably invalid for connectivity scoring.** Two
+   "negative controls" (norethindrone, ciprofloxacin) rank in the top 3 by spec_z. Connectivity
+   measures *expression reversal*, not *therapeutic relatedness* — an antibiotic or contraceptive
+   can still down-regulate the inflammation genes that dominate the scorable signature. The test
+   design conflates the two.
+
+A note on the calibration: textbook CMap **tau** (percentile vs a reference population) was
+implemented but **saturated at ±100** here, because a coherent real query always out-connects
+random gene sets — proper tau needs a library of *real* reference signatures, which this MVP
+lacks. The continuous `spec_z` is the workable substitute.
+
+## Section 3 — Top 10 candidates (v1.1 spec_z)
+
+| Rank | Drug | spec_z | Inclusion | Read |
 |---|---|---|---|---|
-| 1 | TBD | TBD | TBD | TBD |
+| 1 | reserpic-acid | −3.80 | random | reserpine metabolite; non-obvious |
+| 2 | norethindrone | −3.78 | **negative control** | false positive (see §2) |
+| 3 | ciprofloxacin | −3.61 | **negative control** | false positive |
+| 4 | resveratrol | −3.46 | related-mechanism | antioxidant studied in SCD — coherent |
+| 5 | BRD-K57490754 | −3.37 | random | tool compound |
+| 6 | anastrozole | −3.27 | random | aromatase inhibitor |
+| 7–10 | BRD-* / palmitoylethanolamide | ~−3.1 | random | mostly tool compounds |

-_Note: HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due
-to widespread expression effects — flag these honestly (PLAN.md §9.4)._
+That two negative controls outrank hydroxyurea is the single most informative result here — see §4.

-## Section 4 — One non-obvious candidate worth investigating
+## Section 4 — One non-obvious result worth investigating

-_A single paragraph on the most interesting result. Language must be careful: this is a
-computational hypothesis to test, not a discovery (PLAN.md §9.7)._
+The most useful finding is **not** a candidate drug but the **negative-control failure**:
+unrelated drugs (norethindrone, ciprofloxacin) score as strong specific reversers. This is a
+real, generalizable lesson — for a signature whose *scorable* portion is generic
+inflammation/metabolic genes, connectivity rewards any broad transcriptional perturbation that
+touches those genes. The honest implication: **this signature is not specific enough to
+discriminate true repurposing candidates from incidental expression reversers.** Of the
+plausibly-real hits, **resveratrol (rank 4)** — an antioxidant with prior sickle cell literature
+— is the most defensible, but it is a hypothesis, not a discovery.

 ## Section 5 — Honest limitations

- Cell-composition confound in whole-blood expression (PLAN.md §9.1)
- LINCS L1000 cell-line limitations — landmark genes measured mostly in cancer lines (§9.2)
- Missing signatures (e.g. L-glutamine) (§9.3)
- No mechanistic validation layer — discovery hypothesis generation, not validated prediction (§9.6)
+1. **Pre-registered test failed; the pass is post-hoc.** v1.1's hydroxyurea recovery is
+   exploratory and must be re-validated on a held-out disease before any claim is made.
+2. **Missing HbF axis** — HBG1/HBG2 are absent from LINCS entirely (not just landmarks), so the
+   pathway hydroxyurea acts on can never be scored by this method.
+3. **Signature specificity** — scorable genes are inflammation/metabolic; negative controls
+   reverse them too. Connectivity ≠ therapeutic relatedness.
+4. **Cell-composition confound** — the whole-blood signature is reticulocyte-dominated.
+5. **LINCS cancer-cell-line bias**, and **no reference-signature library** for proper tau.
+6. **No mechanistic validation** — all hits are computational hypotheses.

 ## Section 6 — What v2 would fix

- Cell-type deconvolution of the disease signature
- Knowledge graph fallback for missing-signature drugs
- A second disease to test generalization (the real test — sickle cell results do not prove
-  the platform generalizes, §9.5)
+- **A reference-signature library** to make tau (proper specificity calibration) work — the
+  single biggest fix to the negative-control problem, and a direct use of the curated-data moat.
+- **Cell-type deconvolution** + a non-globin-depleted RNA-seq study to recover a more specific,
+  HbF-containing signature.
+- **Signature prediction / mechanism graph** to score genes with no LINCS measurement.
+- **A second disease** to test generalization and to honestly re-validate the v1.1 method
+  (PLAN §9.5).
+
+---
+
+### Bottom line
+
+The pre-registered recovery test **failed**. Post-hoc diagnosis shows the dominant cause was a
+fixable gene-space bottleneck — correcting it **recovers hydroxyurea** — but also surfaces a
+deeper, genuine limitation: this whole-blood signature is **not specific enough** for
+connectivity scoring to separate real candidates from incidental reversers (negative controls
+rank at the top). The MVP's real deliverable is a precise, honest map of *what it takes to make
+this method work*: a more specific (deconvolved, HbF-containing) signature and a reference library
+for calibration — exactly the curated-data investments the platform thesis is built on.
--- a/docs/structure_binding_notes.md
+++ b/docs/structure_binding_notes.md
@@ -0,0 +1,34 @@
+# Structure-based binding track — working notes
+
+Branch `structure-based-binding`. Implements PLAN §12. Baseline-first, start with the two cleanest
+targets (Hemoglobin + PKR), de-risk the harness before scaling.
+
+## Status (2026-06-23)
+
+**Toolchain check (PLAN §12.6 pitfall 4, confirmed real):**
+- ✅ RDKit installs on ARM Mac — ligand side ready.
+- ❌ AutoDock Vina does NOT pip-install on ARM Mac; no docking binary available. Docking (§12.3)
+  is **blocked on toolchain** — must resolve via conda/micromamba (`vina`/`smina`), a GPU AF3-class
+  model (Boltz-2/Chai-1/DiffDock), or an x86 Vina binary under Rosetta.
+
+**Structures obtained:** `5E83` (hemoglobin + voxelotor), `8XFD` (PKR + mitapivat) in
+`data/raw/structures/`.
+
+**Step 0 — ligand-based retrieval baseline (`scripts/binding_ligand_baseline.py`):**
+RDKit Tanimoto of our 300 drugs vs known sickle binders.
+- Engine VALIDATED on in-set classes: `decitabine`→azacitidine (0.62); `vorinostat`→scriptaid
+  (0.42), belinostat (0.28). Correctly clusters DNMT1 / HDAC HbF-inducers.
+- But voxelotor / mitapivat have **no analog** in our set (max Tanimoto ~0.20–0.26). A 300-drug
+  library is too sparse to contain look-alikes of distinctive scaffolds.
+
+**Takeaways:**
+1. Ligand retrieval works but needs a **bigger drug library** to be useful for distinctive targets.
+2. The targets without in-set analogs (Hb, PKR) need **actual docking** (§12.3) — which scores
+   binding directly, no look-alike required. That is the gating next step, and it needs the
+   toolchain solved.
+
+## Next steps
+- [ ] Resolve the docking toolchain (recommend: micromamba + smina/vina, CPU, no GPU needed for baseline).
+- [ ] Dock the known binders (voxelotor→5E83, mitapivat→8XFD) as positive controls (§12.4 recovery test).
+- [ ] Expand the ligand library (full ChEMBL/LINCS) for retrieval to have reach.
+- [ ] Only then: AF3-class co-folding (Boltz-2/DiffDock) vs the docking baseline; and §12.9 generative beacon.
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -33,6 +33,12 @@ dev = [
    "pytest>=8.0",
    "ruff>=0.5",
 ]
+# Structure-based binding track (PLAN §12). Docking tool (vina/smina) is NOT pip-installable on
+# ARM Mac — install via conda/micromamba or use a GPU AF3-class model; see docs/structure_binding_notes.md.
+structure = [
+    "rdkit>=2024.3",
+    "requests>=2.31",
+]

 [tool.setuptools.packages.find]
 where = ["."]
--- a/scripts/binding_ligand_baseline.py
+++ b/scripts/binding_ligand_baseline.py
@@ -0,0 +1,66 @@
+"""Structure-based track, step 0: ligand-based retrieval baseline (PLAN §12.9 engine).
+
+Docking (§12.3) needs a toolchain that doesn't pip-install on ARM Mac (AutoDock Vina) — that's the
+next dependency to solve. Meanwhile this runs now with pure RDKit: do any of our 300 drugs sit near
+the KNOWN sickle binders (voxelotor, mitapivat, decitabine) in chemical space? This is the
+retrieval engine §12.9 would point a generative beacon at, and a sanity check on the ligand data.
+
+NOT docking and NOT a binding claim — chemical similarity only. Similarity != activity (§12.9).
+"""
+
+from __future__ import annotations
+
+import pandas as pd
+import requests
+from rdkit import Chem, DataStructs, RDLogger
+from rdkit.Chem import rdFingerprintGenerator
+
+RDLogger.DisableLog("rdApp.*")
+MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
+
+# Known sickle binders = positive-control beacons (target in parens).
+BINDERS = ["voxelotor", "mitapivat", "decitabine", "vorinostat"]
+
+
+def pubchem_smiles(name: str) -> str | None:
+    for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
+        try:
+            u = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON"
+            d = requests.get(u, timeout=30).json()["PropertyTable"]["Properties"][0]
+            if prop in d:
+                return d[prop]
+        except Exception:
+            continue
+    return None
+
+
+def fp(smi: str):
+    if not isinstance(smi, str) or smi in ("-666", ""):
+        return None
+    m = Chem.MolFromSmiles(smi)
+    return MORGAN.GetFingerprint(m) if m else None
+
+
+def main() -> None:
+    binder_smi = {b: pubchem_smiles(b) for b in BINDERS}
+    print("known-binder SMILES:", {k: (v[:34] + "..." if v else "MISSING") for k, v in binder_smi.items()})
+
+    drugs = pd.read_csv("data/processed/drug_set_v1.csv")[["pert_iname", "canonical_smiles", "inclusion_reason"]]
+    reason = dict(zip(drugs.pert_iname, drugs.inclusion_reason))
+    drug_fp = {r.pert_iname: fp(r.canonical_smiles) for r in drugs.itertuples()}
+    drug_fp = {k: v for k, v in drug_fp.items() if v is not None}
+    print(f"fingerprinted {len(drug_fp)}/{len(drugs)} drugs\n")
+
+    for b, smi in binder_smi.items():
+        bfp = fp(smi)
+        if bfp is None:
+            print(f"{b}: no SMILES\n"); continue
+        sims = sorted(((DataStructs.TanimotoSimilarity(bfp, v), k) for k, v in drug_fp.items()), reverse=True)
+        print(f"nearest drugs to {b}:")
+        for s, k in sims[:5]:
+            print(f"    {s:.3f}  {k:22s} [{reason.get(k,'?')}]")
+        print()
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/exp_deconv_signature.py
+++ b/scripts/exp_deconv_signature.py
@@ -0,0 +1,137 @@
+"""Experiment: composition-adjusted sickle signature, to fix specificity (option 1).
+
+The v1 signature is confounded by cell composition (SS patients have very different WBC/RBC
+than AA controls). GSE35007 *measured* those counts per sample, so we adjust the differential
+expression for them directly (a measured-covariate alternative to estimated deconvolution):
+
+    expression ~ disease + WBC + RBC + MCV + age + sex     (per gene, vectorized OLS)
+
+We compare the composition-ADJUSTED signature against an UNADJUSTED single-study signature
+(same samples, model without the covariates), score both with the v1.1 engine (full gene space
+ spec_z), and report the recovery test for each. Writes nothing to committed artifacts.
+"""
+
+from __future__ import annotations
+
+import warnings
+from pathlib import Path
+
+import GEOparse
+import numpy as np
+import pandas as pd
+from scipy.stats import false_discovery_control
+from scipy.stats import t as tdist
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.disease import collapse_probes_to_symbols  # noqa: E402
+from src.scoring import tau_calibrate  # noqa: E402
+
+warnings.filterwarnings("ignore")
+PROCESSED = Path("data/processed")
+NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
+SYMBOL_COLS = ["Symbol", "ILMN_Gene", "Gene Symbol", "GeneSymbol"]
+
+
+def cval(gsm, key):
+    for c in gsm.metadata.get("characteristics_ch1", []):
+        if c.lower().startswith(key.lower()):
+            return c.split(":", 1)[1].strip()
+    return None
+
+
+def load_data():
+    gse = GEOparse.get_GEO(geo="GSE35007", destdir="data/raw/geo", silent=True)
+    meta = []
+    for gid, g in gse.gsms.items():
+        meta.append({"gsm": gid, "hb": cval(g, "hb phenotype"), "wbc": cval(g, "white blood cells"),
+                     "rbc": cval(g, "red blood cells"), "mcv": cval(g, "mean corpuscular volume"),
+                     "age": cval(g, "age"), "sex": cval(g, "Sex")})
+    meta = pd.DataFrame(meta).set_index("gsm")
+    for c in ["wbc", "rbc", "mcv", "age"]:
+        meta[c] = pd.to_numeric(meta[c], errors="coerce")
+    meta["disease"] = meta["hb"].map({"SS": 1.0, "AA": 0.0})
+    meta["sex_m"] = (meta["sex"] == "M").astype(float)
+    keep = meta[meta["hb"].isin(["SS", "AA"])].dropna(subset=["wbc", "rbc", "mcv", "age", "disease"])
+
+    expr = pd.DataFrame({gid: gse.gsms[gid].table.set_index("ID_REF")["VALUE"] for gid in keep.index})
+    expr = expr.apply(pd.to_numeric, errors="coerce").dropna(how="any")
+    if float(np.nanmax(expr.to_numpy())) > 50:
+        expr = np.log2(expr.clip(lower=0) + 1.0)
+
+    gpl = list(gse.gpls.values())[0]
+    col = next((c for c in SYMBOL_COLS if c in gpl.table.columns), None)
+    sym = gpl.table.set_index("ID")[col].astype(str).str.strip().replace({"": np.nan, "nan": np.nan})
+    return expr, keep, sym.dropna()
+
+
+def ols_de(expr, design, disease_idx):
+    """Vectorized per-gene OLS; return DE table (log_fc=disease coef, pvalue, qvalue)."""
+    X = design.to_numpy(dtype=float)
+    Y = expr.T.to_numpy(dtype=float)  # samples x genes
+    n, p = X.shape
+    XtX_inv = np.linalg.pinv(X.T @ X)
+    B = XtX_inv @ X.T @ Y
+    resid = Y - X @ B
+    dof = n - p
+    sigma2 = (resid ** 2).sum(0) / dof
+    se = np.sqrt(sigma2 * XtX_inv[disease_idx, disease_idx])
+    t = B[disease_idx] / se
+    pval = 2 * tdist.sf(np.abs(t), dof)
+    out = pd.DataFrame({"log_fc": B[disease_idx], "pvalue": pval}, index=expr.index).dropna()
+    out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
+    return out
+
+
+def make_signature(de, sym, expr, top_n=250):
+    de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
+    sig = de_sym[de_sym["qvalue"] < 0.05]
+    up = sig[sig["log_fc"] > 0].nsmallest(top_n, "qvalue").index.tolist()
+    down = sig[sig["log_fc"] < 0].nsmallest(top_n, "qvalue").index.tolist()
+    return up, down
+
+
+def evaluate(label, up, down, lincs):
+    ranked = tau_calibrate(up, down, lincs, n_null=1000)
+    n = len(ranked)
+    top10, top25, half = int(n * .10), int(n * .25), n // 2
+    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
+    ranked = ranked.join(profiles[["inclusion_reason"]])
+    hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
+    negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
+    n_bottom = sum(r > half for r in negs.values())
+    n_ov = len((set(up) | set(down)) & set(lincs.columns))
+    print(f"\n### {label}: {len(up)} up / {len(down)} down (query overlap {n_ov})")
+    print(f"  hydroxyurea rank {hu}/{n} (top {100*hu/n:.1f}%)  top-10%? {hu <= top10}")
+    print(f"  L-glutamine rank {glut}/{n} (top {100*glut/n:.1f}%)  top-25%? {glut <= top25}")
+    print(f"  neg controls bottom-half: {n_bottom}/5  {negs}")
+    print("  top 8: " + ", ".join(
+        f"{name}[{r['inclusion_reason'][:3]}]" for name, r in ranked.nsmallest(8, "spec_z").iterrows()))
+    return ranked
+
+
+def main():
+    expr, meta, sym = load_data()
+    print(f"loaded {expr.shape[1]} samples x {expr.shape[0]} probes; "
+          f"{int(meta.disease.sum())} SS / {int((meta.disease==0).sum())} AA")
+    lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
+
+    base = pd.DataFrame({"intercept": 1.0, "disease": meta["disease"]}, index=meta.index)
+    adj = base.assign(wbc=meta["wbc"], rbc=meta["rbc"], mcv=meta["mcv"], age=meta["age"], sex_m=meta["sex_m"])
+
+    de_unadj = ols_de(expr, base, disease_idx=1)
+    de_adj = ols_de(expr, adj, disease_idx=1)
+
+    up_u, dn_u = make_signature(de_unadj, sym, expr)
+    up_a, dn_a = make_signature(de_adj, sym, expr)
+
+    # how much does adjustment change the gene set?
+    overlap = len((set(up_u) | set(dn_u)) & (set(up_a) | set(dn_a)))
+    print(f"\nsignature gene overlap unadjusted vs adjusted: {overlap}/{len(set(up_u)|set(dn_u))}")
+
+    evaluate("UNADJUSTED (GSE35007 only)", up_u, dn_u, lincs)
+    evaluate("COMPOSITION-ADJUSTED", up_a, dn_a, lincs)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/exp_genespace.py
+++ b/scripts/exp_genespace.py
@@ -0,0 +1,102 @@
+"""Experiment (v1.1): re-score on a larger LINCS gene space and re-run the recovery test.
+
+v1 used only the 978 landmark genes (12% signature overlap). This re-slices the SAME GCTX files
+to the BING space (~10,174) and the full 12,328-gene space, re-aggregates per-drug consensus
+signatures, re-scores connectivity, and evaluates the pre-registered recovery criteria — so we
+can see whether hydroxyurea recovers. Writes nothing to the committed v1 artifacts.
+"""
+
+from __future__ import annotations
+
+import gzip
+import io
+import json
+from pathlib import Path
+
+import pandas as pd
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.scoring import rank_drugs  # noqa: E402
+
+LINCS = Path("data/raw/lincs")
+PROCESSED = Path("data/processed")
+GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
+SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
+NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
+
+
+def read_gz(name):
+    return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
+
+
+def gene_ids_for_space(space: str):
+    g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
+    if space == "bing":
+        g = g[g.pr_is_bing == 1]
+    # 'all' -> keep everything
+    ids = [str(x) for x in g.pr_gene_id]
+    id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
+    return ids, id_to_symbol
+
+
+def extract(space, drug_names, gene_ids, id_to_symbol):
+    from cmapPy.pandasGEXpress.parse import parse
+    frames = []
+    for ph in (1, 2):
+        sig = read_gz(SIG_INFO[ph])
+        sig = sig[(sig.pert_type == "trt_cp") & (sig.pert_iname.isin(drug_names))]
+        if sig.empty:
+            continue
+        gct = parse(str(GCTX[ph]), rid=gene_ids, cid=sig.sig_id.tolist())
+        data = gct.data_df
+        s2d = dict(zip(sig.sig_id, sig.pert_iname))
+        frames.append(data.T.groupby(data.columns.map(s2d)).mean())
+        print(f"    [{space}] phase {ph}: {sig.pert_iname.nunique()} drugs sliced", flush=True)
+    combined = pd.concat(frames).groupby(level=0).mean()
+    combined.columns = [id_to_symbol.get(c, c) for c in combined.columns]
+    combined = combined.loc[:, ~combined.columns.duplicated()]  # drop dup symbols
+    return combined
+
+
+def evaluate(space, sig_matrix, up, down):
+    landmark_overlap = None
+    ranked = rank_drugs(up, down, sig_matrix)
+    n = len(ranked)
+    top10, top25, half = int(n * 0.10), int(n * 0.25), n // 2
+    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
+    ranked = ranked.join(profiles[["inclusion_reason"]])
+
+    hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
+    glut_s = ranked.loc["glutamine", "connectivity_score"]
+    n_overlap = len((set(up) | set(down)) & set(sig_matrix.columns))
+    negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
+    n_bottom = sum(r > half for r in negs.values())
+
+    print(f"\n=== gene space: {space.upper()} ({sig_matrix.shape[1]} genes; query overlap {n_overlap}) ===")
+    print(f"  hydroxyurea: rank {hu}/{n} (top {100*hu/n:.1f}%)  top-10%? {hu <= top10}")
+    print(f"  L-glutamine: rank {glut}/{n} (top {100*glut/n:.1f}%), WTCS={glut_s:.3f}  top-25%? {glut <= top25}")
+    print(f"  neg controls in bottom half: {n_bottom}/5  {negs}")
+    crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
+    print(f"  OVERALL: {'PASS' if crit else 'FAIL'}")
+    print("  top 8:")
+    for name, r in ranked.nsmallest(8, "connectivity_score").iterrows():
+        print(f"    {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} [{r['inclusion_reason']}]")
+    return ranked
+
+
+def main():
+    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
+    up = [g["gene"] for g in sig["up_regulated"]]
+    down = [g["gene"] for g in sig["down_regulated"]]
+    drug_names = set(pd.read_csv(PROCESSED / "drug_set_v1.csv").pert_iname)
+
+    for space in ("bing", "all"):
+        print(f"\n>>> extracting {space} ...", flush=True)
+        ids, id2sym = gene_ids_for_space(space)
+        mat = extract(space, drug_names, ids, id2sym)
+        evaluate(space, mat, up, down)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/phaseA_reference_tau.py
+++ b/scripts/phaseA_reference_tau.py
@@ -0,0 +1,118 @@
+"""Phase A: calibrate sickle connectivity against a REAL disease-signature reference population.
+
+The v1.1 tau saturated because it used random gene-set nulls. Proper specificity calibration
+asks: does a drug reverse SICKLE more than it reverses diseases in general? We download a library
+of real disease signatures (Enrichr "Disease Signatures from GEO", up+down), compute each drug's
+connectivity to every reference disease, and express its sickle connectivity as a z-score within
+that per-drug reference distribution. Broad-effect drugs (reverse everything) -> z~0 -> down-ranked.
+
+Re-runs the recovery test and compares to v1.1 (random-null). Writes nothing committed.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+import requests
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.scoring import _ks_connectivity  # noqa: E402
+
+PROCESSED = Path("data/processed")
+RAW = Path("data/raw/disease_sigs")
+ENRICHR = "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName={}"
+LIBS = {"up": "Disease_Signatures_from_GEO_up_2014", "down": "Disease_Signatures_from_GEO_down_2014"}
+NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
+
+
+def fetch_gmt(name: str) -> dict[str, list[str]]:
+    RAW.mkdir(parents=True, exist_ok=True)
+    path = RAW / f"{name}.gmt"
+    if not path.exists():
+        r = requests.get(ENRICHR.format(name), timeout=120)
+        r.raise_for_status()
+        path.write_text(r.text)
+    out = {}
+    for line in path.read_text().splitlines():
+        parts = line.split("\t")
+        if len(parts) < 3:
+            continue
+        term = parts[0]
+        genes = [g.split(",")[0].strip().upper() for g in parts[2:] if g.strip()]
+        out[term] = genes
+    print(f"  {name}: {path.stat().st_size/1e6:.1f} MB, {len(out)} terms")
+    return out
+
+
+def build_reference() -> list[dict]:
+    up = fetch_gmt(LIBS["up"])
+    down = fetch_gmt(LIBS["down"])
+    shared = set(up) & set(down)
+    refs = []
+    for term in shared:
+        if "sickle" in term.lower():
+            continue  # exclude the target disease from its own reference population
+        refs.append({"name": term, "up": up[term], "down": down[term]})
+    print(f"reference disease signatures (paired up+down, sickle excluded): {len(refs)}")
+    return refs
+
+
+def cols_for(genes, gene_to_col):
+    return np.array([gene_to_col[g] for g in set(genes) if g in gene_to_col], dtype=int)
+
+
+def main():
+    import json
+    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
+    sk_up = [g["gene"] for g in sig["up_regulated"]]
+    sk_down = [g["gene"] for g in sig["down_regulated"]]
+
+    lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
+    genes = list(lincs.columns)
+    gene_to_col = {g: i for i, g in enumerate(genes)}
+    n = len(genes)
+    R = lincs.rank(axis=1, ascending=False).to_numpy()
+
+    refs = build_reference()
+    # connectivity of every drug to every reference disease -> (n_drugs, n_refs)
+    C = np.empty((R.shape[0], len(refs)))
+    for j, d in enumerate(refs):
+        C[:, j] = _ks_connectivity(R, cols_for(d["up"], gene_to_col), cols_for(d["down"], gene_to_col), n)
+    # drop reference diseases with too few mapped genes (degenerate columns)
+    mapped = np.array([len(cols_for(d["up"], gene_to_col)) + len(cols_for(d["down"], gene_to_col)) for d in refs])
+    keep = mapped >= 10
+    C = C[:, keep]
+    print(f"usable reference diseases (>=10 mapped genes): {keep.sum()}")
+
+    real = _ks_connectivity(R, cols_for(sk_up, gene_to_col), cols_for(sk_down, gene_to_col), n)
+    ref_mean, ref_std = C.mean(axis=1), C.std(axis=1)
+    ref_std[ref_std == 0] = np.nan
+    spec_z = (real - ref_mean) / ref_std  # negative = reverses sickle more than diseases-in-general
+
+    ranked = pd.DataFrame({"spec_z": spec_z, "connectivity": real}, index=lincs.index).sort_values("spec_z")
+    ranked.insert(0, "rank", range(1, len(ranked) + 1))
+    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
+    ranked = ranked.join(profiles[["inclusion_reason"]])
+
+    N = len(ranked)
+    top10, top25, half = int(N * .10), int(N * .25), N // 2
+    hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
+    negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
+    n_bottom = sum(r > half for r in negs.values())
+
+    print("\n==== RECOVERY TEST (reference-calibrated tau) ====")
+    print(f"  hydroxyurea rank {hu}/{N} (top {100*hu/N:.1f}%)  top-10%? {hu <= top10}  z={ranked.loc['hydroxyurea','spec_z']:.2f}")
+    print(f"  L-glutamine rank {glut}/{N} (top {100*glut/N:.1f}%)  top-25%? {glut <= top25}")
+    print(f"  neg controls bottom-half: {n_bottom}/5  {negs}")
+    crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
+    print(f"  OVERALL: {'PASS' if crit else 'FAIL'}")
+    print("\n  top 12 by reference-calibrated z:")
+    for name, r in ranked.nsmallest(12, "spec_z").iterrows():
+        print(f"    {int(r['rank']):2d} {name:20s} z={r['spec_z']:6.2f} [{r['inclusion_reason']}]")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/phaseD_supervised.py
+++ b/scripts/phaseD_supervised.py
@@ -0,0 +1,137 @@
+"""Phase D: supervised cross-disease repurposing — can labels break the specificity ceiling?
+
+Connectivity alone can't tell therapeutic from coincidental reversal. A supervised model trained
+on known drug-disease pairs CAN learn that pattern — given features that expose drug "broadness"
+(a drug that reverses everything is non-specific). We train on 839 GEO disease signatures with
+Repurposing-Hub indications as labels, evaluate with disease-grouped CV, then apply to HELD-OUT
+sickle and check whether the coincidental reversers (norethindrone, ciprofloxacin) finally drop.
+
+Baseline = rank by raw connectivity (the single conn feature). Win = model down-ranks the
+negative controls vs baseline while keeping hydroxyurea high.
+"""
+
+from __future__ import annotations
+
+import json
+import re
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+from sklearn.ensemble import GradientBoostingClassifier
+from sklearn.model_selection import GroupKFold, cross_val_predict
+from sklearn.metrics import roc_auc_score
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.scoring import _ks_connectivity  # noqa: E402
+
+PROCESSED = Path("data/processed")
+SIGS = Path("data/raw/disease_sigs")
+NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
+ID_RE = re.compile(r"\s*(c\d{6,}|doid[-\s]?\d+|gse\d+|umls\S+)\s*", re.I)
+
+
+def clean_disease(term: str) -> str:
+    return ID_RE.sub(" ", term).strip().lower()
+
+
+def load_disease_sigs():
+    def parse(name):
+        out = {}
+        for line in (SIGS / name).read_text().splitlines():
+            p = line.split("\t")
+            if len(p) < 3:
+                continue
+            out[p[0]] = [g.split(",")[0].strip().upper() for g in p[2:] if g.strip()]
+        return out
+    up = parse("Disease_Perturbations_from_GEO_up.gmt")
+    down = parse("Disease_Perturbations_from_GEO_down.gmt")
+    terms = sorted(set(up) & set(down))
+    return [{"term": t, "disease": clean_disease(t), "up": up[t], "down": down[t]} for t in terms]
+
+
+def cols_for(genes, g2c):
+    return np.array([g2c[g] for g in set(genes) if g in g2c], dtype=int)
+
+
+def featurize(conn_col, C):
+    """Per-(drug,disease) features for one disease column conn_col, given full matrix C."""
+    drug_mean, drug_std = C.mean(1), C.std(1)
+    drug_min = C.min(1)
+    broad = (C < -0.10).mean(1)  # fraction of diseases this drug strongly reverses
+    dmean, dstd = conn_col.mean(), conn_col.std() or 1.0
+    return np.column_stack([
+        conn_col, drug_mean, drug_std, drug_min, broad,
+        np.full_like(conn_col, dmean),
+        (conn_col - drug_mean) / np.where(drug_std == 0, 1, drug_std),  # specificity within drug
+        (conn_col - dmean) / dstd,                                       # specificity within disease
+    ])
+
+
+FEATS = ["conn", "drug_mean", "drug_std", "drug_min", "broadness",
+         "disease_mean", "z_within_drug", "z_within_disease"]
+
+
+def main():
+    lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
+    drugs = list(lincs.index)
+    g2c = {g: i for i, g in enumerate(lincs.columns)}
+    n = len(lincs.columns)
+    R = lincs.rank(axis=1, ascending=False).to_numpy()
+
+    refs = [d for d in load_disease_sigs()
+            if len(cols_for(d["up"], g2c)) + len(cols_for(d["down"], g2c)) >= 10]
+    C = np.column_stack([_ks_connectivity(R, cols_for(d["up"], g2c), cols_for(d["down"], g2c), n) for d in refs])
+    print(f"{len(drugs)} drugs x {len(refs)} disease signatures; connectivity matrix {C.shape}")
+
+    # labels from Repurposing Hub indications
+    hub = pd.read_csv("data/raw/repurposing_drugs.txt", sep="\t", comment="!", low_memory=False)
+    ind = {r.pert_iname: [i.strip().lower() for i in re.split(r"[|,]", r.indication) if len(i.strip()) > 3]
+           for r in hub.itertuples() if isinstance(r.indication, str)}
+    drug_idx = {d: i for i, d in enumerate(drugs)}
+
+    X_rows, y, grp = [], [], []
+    for j, d in enumerate(refs):
+        feats = featurize(C[:, j], C)
+        dz = d["disease"]
+        for i, drug in enumerate(drugs):
+            inds = ind.get(drug, [])
+            label = int(any(dz == k or (len(dz) > 4 and dz in k) or (len(k) > 4 and k in dz) for k in inds))
+            X_rows.append(feats[i]); y.append(label); grp.append(j)
+    X = np.array(X_rows); y = np.array(y); grp = np.array(grp)
+    print(f"pairs: {len(y)}, positives: {y.sum()} ({100*y.mean():.2f}%)")
+
+    # disease-grouped CV (generalize to unseen diseases)
+    clf = GradientBoostingClassifier(random_state=0)
+    proba = cross_val_predict(clf, X, y, cv=GroupKFold(5), groups=grp, method="predict_proba")[:, 1]
+    print(f"disease-grouped CV AUC: {roc_auc_score(y, proba):.3f}  (conn-only AUC: {roc_auc_score(y, X[:,0]*-1):.3f})")
+
+    clf.fit(X, y)
+    print("feature importances: " + ", ".join(f"{f}={imp:.2f}" for f, imp in
+          sorted(zip(FEATS, clf.feature_importances_), key=lambda t: -t[1])))
+
+    # apply to HELD-OUT sickle
+    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
+    sk = _ks_connectivity(R, cols_for([g["gene"] for g in sig["up_regulated"]], g2c),
+                          cols_for([g["gene"] for g in sig["down_regulated"]], g2c), n)
+    Xs = featurize(sk, C)
+    p_sickle = clf.predict_proba(Xs)[:, 1]
+
+    res = pd.DataFrame({"model_p": p_sickle, "conn": sk}, index=drugs)
+    res["model_rank"] = res["model_p"].rank(ascending=False).astype(int)
+    res["conn_rank"] = res["conn"].rank(ascending=True).astype(int)  # most negative = best
+    N = len(res)
+    print(f"\n{'drug':14s} {'model_rank':>10s} {'conn_rank(base)':>16s}")
+    for d in ["hydroxyurea", "glutamine"] + NEG5 + ["norethindrone", "ciprofloxacin"]:
+        if d in res.index:
+            r = res.loc[d]
+            print(f"  {d:14s} {int(r['model_rank']):6d}/{N}      {int(r['conn_rank']):6d}/{N}")
+    nb_model = sum(res.loc[d, "model_rank"] > N/2 for d in NEG5 if d in res.index)
+    nb_conn = sum(res.loc[d, "conn_rank"] > N/2 for d in NEG5 if d in res.index)
+    print(f"\nneg controls bottom-half: model {nb_model}/5  vs  baseline {nb_conn}/5")
+    print("model top 10:", ", ".join(res.sort_values('model_p', ascending=False).head(10).index))
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/week1_explore.py
+++ b/scripts/week1_explore.py
@@ -0,0 +1,139 @@
+"""Week 1 exploratory run: build per-study DE + concordance for the whole-blood pair.
+
+Read-only decision aid (NOT the final pipeline): downloads GSE35007 + GSE16728, runs DE on
+each, collapses to HGNC symbols, computes concordance, and reports whether the sanity-gate
+genes survive — so we can choose Tier A (concordance pair) vs Tier B (GSE35007 alone) with
+real numbers. Intermediate DE tables are cached under data/raw/geo/.
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+import GEOparse
+import numpy as np
+import pandas as pd
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.disease import (  # noqa: E402
+    DISEASE_LABEL,
+    HEALTHY_LABEL,
+    collapse_probes_to_symbols,
+    compute_differential_expression,
+    concordance_filter,
+)
+
+GEO_DIR = Path("data/raw/geo")
+GEO_DIR.mkdir(parents=True, exist_ok=True)
+
+SANITY_GENES = ["HBG1", "HBG2", "ALAS2", "CA1", "AHSP", "SLC4A1", "HBB", "ERAF"]
+SYMBOL_COL_CANDIDATES = [
+    "Symbol", "ILMN_Gene", "Gene Symbol", "GENE_SYMBOL", "Gene symbol",
+    "gene_assignment", "GeneSymbol",
+]
+
+
+def log(msg: str) -> None:
+    print(msg, flush=True)
+
+
+def get_symbol_map(gpl) -> pd.Series:
+    tbl = gpl.table
+    col = next((c for c in SYMBOL_COL_CANDIDATES if c in tbl.columns), None)
+    if col is None:
+        raise RuntimeError(f"No symbol column in GPL {gpl.name}; columns={list(tbl.columns)[:15]}")
+    s = tbl.set_index("ID")[col].astype(str).str.strip()
+    # gene_assignment fields are '// SYMBOL // ...'; take the first real token.
+    if col == "gene_assignment":
+        s = s.str.split("//").str[1].str.strip()
+    s = s.replace({"": np.nan, "nan": np.nan, "---": np.nan})
+    return s.dropna()
+
+
+def build_expression(gse, sample_ids: list[str]) -> pd.DataFrame:
+    cols = {}
+    for gsm_id in sample_ids:
+        tbl = gse.gsms[gsm_id].table
+        cols[gsm_id] = tbl.set_index("ID_REF")["VALUE"]
+    return pd.DataFrame(cols)
+
+
+def char_value(gsm, key: str) -> str | None:
+    for c in gsm.metadata.get("characteristics_ch1", []):
+        if c.lower().startswith(key.lower()):
+            return c.split(":", 1)[1].strip()
+    return None
+
+
+def run_study(gse, groups: dict[str, str], label: str) -> pd.DataFrame:
+    """groups: gsm_id -> 'disease'/'healthy'. Returns symbol-level DE table."""
+    sample_ids = list(groups.keys())
+    log(f"[{label}] {len(sample_ids)} samples "
+        f"({sum(v==DISEASE_LABEL for v in groups.values())} disease / "
+        f"{sum(v==HEALTHY_LABEL for v in groups.values())} healthy)")
+    expr = build_expression(gse, sample_ids).apply(pd.to_numeric, errors="coerce").dropna(how="all")
+    sample_groups = pd.Series(groups)
+    de = compute_differential_expression(expr, sample_groups, method="welch")
+    gpl = list(gse.gpls.values())[0]
+    sym = get_symbol_map(gpl)
+    de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
+    log(f"[{label}] probes->symbols: {len(de)} probes -> {len(de_sym)} symbols; "
+        f"sig q<0.05: {(de_sym['qvalue']<0.05).sum()}")
+    de_sym.to_parquet(GEO_DIR / f"de_{label}.parquet")
+    return de_sym
+
+
+def main() -> None:
+    # --- GSE35007: whole blood, hb phenotype SS (disease) vs AA (healthy) ---
+    log("downloading GSE35007 ...")
+    gse1 = GEOparse.get_GEO(geo="GSE35007", destdir=str(GEO_DIR), silent=True)
+    g1 = {}
+    for gsm_id, gsm in gse1.gsms.items():
+        hb = char_value(gsm, "hb phenotype")
+        if hb == "SS":
+            g1[gsm_id] = DISEASE_LABEL
+        elif hb == "AA":
+            g1[gsm_id] = HEALTHY_LABEL
+    de1 = run_study(gse1, g1, "GSE35007")
+
+    # --- GSE16728: whole blood only (PAXgene preps; exclude PBMC), patient vs control ---
+    log("downloading GSE16728 ...")
+    gse2 = GEOparse.get_GEO(geo="GSE16728", destdir=str(GEO_DIR), silent=True)
+    g2 = {}
+    for gsm_id, gsm in gse2.gsms.items():
+        prep = char_value(gsm, "rna prep") or ""
+        subj = char_value(gsm, "subject") or ""
+        if "PBMC" in prep:
+            continue  # whole-blood only
+        if subj.lower().startswith("sickle"):
+            g2[gsm_id] = DISEASE_LABEL
+        elif subj.lower().startswith("control"):
+            g2[gsm_id] = HEALTHY_LABEL
+    de2 = run_study(gse2, g2, "GSE16728")
+
+    # --- Concordance ---
+    keep, summary = concordance_filter(de1, de2)
+    log("\n==== CONCORDANCE (whole-blood pair) ====")
+    log(f"genes tested in both: {summary.n_genes_tested}")
+    log(f"concordant (q<0.05 both + same sign): {summary.n_concordant} "
+        f"(up={summary.n_up}, down={summary.n_down})")
+    keep.to_parquet(GEO_DIR / "concordant_genes.parquet")
+
+    log("\n==== SANITY-GATE GENES ====")
+    for g in SANITY_GENES:
+        in1 = g in de1.index
+        in2 = g in de2.index
+        row1 = de1.loc[g] if in1 else None
+        row2 = de2.loc[g] if in2 else None
+        concord = "YES" if g in keep.index else "no"
+        d1 = f"lfc={row1['log_fc']:+.2f},q={row1['qvalue']:.1e}" if in1 else "absent"
+        d2 = f"lfc={row2['log_fc']:+.2f},q={row2['qvalue']:.1e}" if in2 else "absent"
+        log(f"  {g:7s} | GSE35007 {d1:28s} | GSE16728 {d2:28s} | concordant={concord}")
+
+    log("\n==== TOP 15 CONCORDANT (by worst-case q) ====")
+    log(keep.head(15).to_string())
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/week1_finalize.py
+++ b/scripts/week1_finalize.py
@@ -0,0 +1,121 @@
+"""Finalize the Week 1 sickle cell signature (Tier A, whole-blood concordance pair).
+
+Reads the cached concordant gene table, maps symbols -> Entrez/Ensembl via mygene, attaches
+two-study provenance + concordance summary, and persists
+data/processed/sickle_cell_signature_v1.json.
+
+Tier note: GSE16728 is exactly 10/group, which fails assign_tier()'s strict n>10. Tier A is
+assigned explicitly here on the basis of cross-study, cross-platform, cross-population
+concordance (the founder's decision), with the n=10 / prep-merge caveat documented in the
+tier rationale and limitations.
+"""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+import mygene
+import pandas as pd
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.disease import (  # noqa: E402
+    ConcordanceSummary,
+    SignatureProvenance,
+    StudyProvenance,
+    build_signature,
+    persist_signature,
+)
+from src.provenance import ConfidenceTier  # noqa: E402
+
+CREATED_DATE = "2026-06-23"
+GEO_DIR = Path("data/raw/geo")
+
+
+def map_ids(symbols: list[str]) -> pd.DataFrame:
+    mg = mygene.MyGeneInfo()
+    res = mg.querymany(
+        symbols, scopes="symbol", fields="entrezgene,ensembl.gene",
+        species="human", as_dataframe=True, verbose=False,
+    )
+    res = res[~res.index.duplicated(keep="first")]
+
+    def _ensembl(v):
+        if isinstance(v, list) and v:
+            v = v[0]
+        if isinstance(v, dict):
+            return v.get("gene")
+        return v
+
+    id_map = pd.DataFrame(index=res.index)
+    id_map["entrez_id"] = res["entrezgene"] if "entrezgene" in res.columns else None
+    ens_col = "ensembl.gene" if "ensembl.gene" in res.columns else ("ensembl" if "ensembl" in res.columns else None)
+    id_map["ensembl_id"] = res[ens_col].map(_ensembl) if ens_col else None
+    return id_map
+
+
+def main() -> None:
+    concordant = pd.read_parquet(GEO_DIR / "concordant_genes.parquet")
+    print(f"loaded {len(concordant)} concordant genes "
+          f"({(concordant['log_fc']>0).sum()} up / {(concordant['log_fc']<0).sum()} down)")
+
+    id_map = map_ids(list(concordant.index))
+    print(f"mygene mapped {id_map['entrez_id'].notna().sum()}/{len(id_map)} symbols to Entrez")
+
+    provenance = SignatureProvenance(
+        studies=[
+            StudyProvenance(
+                geo_accession="GSE35007", n_disease=190, n_healthy=12,
+                platform="Illumina HumanHT-12 V4 (GPL10558)",
+                tissue="whole blood", method="welch",
+            ),
+            StudyProvenance(
+                geo_accession="GSE16728", n_disease=10, n_healthy=10,
+                platform="Affymetrix HG-U133 Plus 2.0 (GPL570)",
+                tissue="whole blood (PAXgene; globin-reduced + total RNA preps merged)",
+                method="welch",
+            ),
+        ],
+        concordance=ConcordanceSummary(
+            n_genes_tested=16208, n_concordant=671, n_up=444, n_down=227,
+        ),
+        created_date=CREATED_DATE,
+    )
+
+    tier_rationale = (
+        "Measured RNA expression concordant across two independent peer-reviewed whole-blood "
+        "studies on different platforms (Illumina GPL10558, Affymetrix GPL570) and populations "
+        "(pediatric West-African GSE35007; adult GSE16728). Tier A rests on cross-study / "
+        "cross-platform concordance. Caveat: GSE16728 is exactly 10/group (two PAXgene preps "
+        "merged), which does not meet the strict n>10 per-study threshold on its own."
+    )
+    limitations = [
+        "Whole-blood expression is partly driven by cell-composition differences "
+        "(reticulocytosis in sickle patients), not pure disease-state regulation; the "
+        "signature is dominated by erythroid markers (CA1, AHSP, SLC4A1). v2 needs cell-type "
+        "deconvolution.",
+        "Fetal-hemoglobin genes (HBG1/HBG2) are NOT captured: flat in GSE35007 and removed by "
+        "GSE16728's globin-depleted RNA prep. The signature therefore does not directly encode "
+        "the HbF axis that hydroxyurea acts on, which may weaken the hydroxyurea recovery test.",
+        "GSE35007 is highly imbalanced (SS n=190 vs AA n=12); the healthy arm is small.",
+        "GSE16728 whole-blood arm is small (10/group) and merges two RNA-prep batches.",
+        "Cross-platform probe->symbol collapse (keep max-mean-expression probe) loses "
+        "isoform-level resolution and drops genes absent from either platform.",
+    ]
+
+    signature = build_signature(
+        concordant, provenance,
+        tier=ConfidenceTier.A,
+        tier_rationale=tier_rationale,
+        limitations=limitations,
+        id_map=id_map,
+        top_n=250,
+    )
+    path = persist_signature(signature)
+    print(f"wrote {path}")
+    print(f"signature: {len(signature.up_regulated)} up / {len(signature.down_regulated)} down, "
+          f"tier {signature.confidence_tier.value}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/week2_assemble.py
+++ b/scripts/week2_assemble.py
@@ -0,0 +1,91 @@
+"""Week 2, task 4: assemble drug_profiles_v1.parquet (PLAN §6).
+
+Joins the curated drug set + ChEMBL enrichment + LINCS consensus signatures into one profile
+table. Each drug carries a confidence tier: LINCS is a single source, so signature-backed drugs
+are Tier B at best (assign_tier with single_source=True); drugs with no signature are Tier C and
+marked not-scored (not dropped silently — PLAN §6 Week 3 task 2).
+
+The 978-gene signature order is the column order of lincs_signatures_v1.parquet (landmark
+symbols); each profile's `lincs_signature` is that vector (or null).
+"""
+
+from __future__ import annotations
+
+import ast
+from pathlib import Path
+
+import pandas as pd
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.drugs import persist_drug_profiles  # noqa: E402
+from src.provenance import ConfidenceTier, assign_tier  # noqa: E402
+
+PROCESSED = Path("data/processed")
+DRUG_SET = PROCESSED / "drug_set_v1.csv"
+CHEMBL = Path("data/raw/chembl/chembl_enrichment.parquet")
+LINCS_SIG = PROCESSED / "lincs_signatures_v1.parquet"
+
+
+def main() -> None:
+    drugs = pd.read_csv(DRUG_SET)
+    chembl = pd.read_parquet(CHEMBL)[
+        ["pert_iname", "chembl_id", "pref_name", "smiles", "mechanism_of_action", "targets"]
+    ]
+    sigs = pd.read_parquet(LINCS_SIG)  # rows=pert_iname, cols=978 landmark symbols
+    gene_order = list(sigs.columns)
+
+    df = drugs.merge(chembl, on="pert_iname", how="left")
+
+    rows = []
+    for r in df.itertuples():
+        has_sig = r.pert_iname in sigs.index
+        vector = sigs.loc[r.pert_iname].tolist() if has_sig else None
+        # LINCS = single source => Tier B max when measured; no signature => Tier C.
+        tier = assign_tier(
+            is_measured=has_sig, n_per_group=None, peer_reviewed=True, single_source=True
+        ) if has_sig else ConfidenceTier.C
+        targets = r.targets
+        if isinstance(targets, str):
+            try:
+                targets = ast.literal_eval(targets)
+            except (ValueError, SyntaxError):
+                targets = []
+        elif hasattr(targets, "tolist"):  # numpy ndarray from parquet round-trip
+            targets = targets.tolist()
+        elif targets is None or (not isinstance(targets, (list, tuple))):
+            targets = []
+        rows.append({
+            "name": r.pert_iname,
+            "chembl_id": r.chembl_id if pd.notna(r.chembl_id) else None,
+            "pref_name": r.pref_name if pd.notna(r.pref_name) else None,
+            "inchikey": r.inchi_key if pd.notna(r.inchi_key) else None,
+            "smiles": r.smiles if pd.notna(r.smiles) else None,
+            "targets": list(targets),
+            "mechanism_of_action": r.mechanism_of_action if pd.notna(r.mechanism_of_action) else None,
+            "inclusion_reason": r.inclusion_reason,
+            "lincs_phase": r.phase,
+            "scored": has_sig,
+            "lincs_signature": vector,
+            "confidence_tier": tier.value,
+        })
+
+    profiles = pd.DataFrame(rows)
+    # Persist the gene order alongside, so Week-3 scoring can align the vectors.
+    (PROCESSED / "lincs_gene_order.txt").write_text("\n".join(gene_order))
+    path = persist_drug_profiles(profiles)
+
+    print(f"drug_profiles_v1: {len(profiles)} drugs")
+    print(f"  scored (have LINCS signature): {profiles['scored'].sum()}")
+    print(f"  not scored: {(~profiles['scored']).sum()}")
+    print("  by inclusion reason (scored rate):")
+    print(profiles.groupby("inclusion_reason")["scored"].agg(["sum", "count"]).to_string())
+    print("  tier split:", profiles["confidence_tier"].value_counts().to_dict())
+    for gt in ["hydroxyurea", "glutamine"]:
+        row = profiles[profiles["name"] == gt]
+        print(f"  ground truth '{gt}': scored={bool(row['scored'].iloc[0]) if len(row) else 'ABSENT'}")
+    print(f"wrote {path}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/week2_chembl.py
+++ b/scripts/week2_chembl.py
@@ -0,0 +1,99 @@
+"""Week 2, task 2: enrich the drug set with ChEMBL structure/target/mechanism data.
+
+Drugs are matched to ChEMBL by the InChIKey already carried from LINCS pert_info (reliable),
+then mechanism-of-action and target names are pulled. Compounds absent from ChEMBL (many
+research/tool compounds in the random arm) keep null ChEMBL fields — they still have LINCS
+signatures for scoring; only the Week-3 mechanistic prior won't apply. Output cached to
+data/raw/chembl/chembl_enrichment.parquet.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pandas as pd
+from chembl_webresource_client.new_client import new_client
+
+DRUG_SET = Path("data/processed/drug_set_v1.csv")
+OUT = Path("data/raw/chembl/chembl_enrichment.parquet")
+BATCH = 40
+
+
+def chunks(seq, n):
+    for i in range(0, len(seq), n):
+        yield seq[i:i + n]
+
+
+def main() -> None:
+    drugs = pd.read_csv(DRUG_SET)
+    inchikeys = sorted({k for k in drugs["inchi_key"].dropna() if isinstance(k, str) and len(k) > 10})
+    print(f"{len(drugs)} drugs; {len(inchikeys)} usable InChIKeys to resolve")
+
+    molecule = new_client.molecule
+    mechanism = new_client.mechanism
+    target = new_client.target
+
+    # 1) InChIKey -> ChEMBL molecule (id, name, smiles)
+    mol_rows = []
+    for i, batch in enumerate(chunks(inchikeys, BATCH)):
+        res = molecule.filter(molecule_structures__standard_inchi_key__in=batch).only(
+            ["molecule_chembl_id", "pref_name", "molecule_structures"])
+        for m in res:
+            ms = m.get("molecule_structures") or {}
+            mol_rows.append({
+                "chembl_id": m["molecule_chembl_id"],
+                "pref_name": m.get("pref_name"),
+                "smiles": ms.get("canonical_smiles"),
+                "inchi_key": ms.get("standard_inchi_key"),
+            })
+        print(f"  molecules batch {i+1}: cumulative {len(mol_rows)} hits", flush=True)
+    mols = pd.DataFrame(mol_rows).drop_duplicates("inchi_key")
+    chembl_ids = sorted(mols["chembl_id"].unique())
+    print(f"resolved {len(mols)} molecules -> {len(chembl_ids)} ChEMBL ids")
+
+    # 2) ChEMBL id -> mechanism of action + target ids
+    mech_rows = []
+    for batch in chunks(chembl_ids, BATCH):
+        for m in mechanism.filter(molecule_chembl_id__in=batch).only(
+                ["molecule_chembl_id", "mechanism_of_action", "target_chembl_id"]):
+            mech_rows.append(m)
+    mech = pd.DataFrame(mech_rows)
+    print(f"mechanism records: {len(mech)}")
+
+    # 3) target id -> name
+    tgt_names = {}
+    if not mech.empty:
+        tids = sorted({t for t in mech["target_chembl_id"].dropna().unique()})
+        for batch in chunks(tids, BATCH):
+            for t in target.filter(target_chembl_id__in=batch).only(["target_chembl_id", "pref_name"]):
+                tgt_names[t["target_chembl_id"]] = t.get("pref_name")
+
+    # aggregate mechanism/targets per molecule
+    def agg(df):
+        moa = sorted({x for x in df["mechanism_of_action"].dropna()})
+        tns = sorted({tgt_names.get(t) for t in df["target_chembl_id"].dropna() if tgt_names.get(t)})
+        return pd.Series({"mechanism_of_action": "; ".join(moa) or None, "targets": tns})
+
+    if not mech.empty:
+        per_mol = mech.groupby("molecule_chembl_id").apply(agg, include_groups=False).reset_index()
+        per_mol = per_mol.rename(columns={"molecule_chembl_id": "chembl_id"})
+        mols = mols.merge(per_mol, on="chembl_id", how="left")
+    else:
+        mols["mechanism_of_action"] = None
+        mols["targets"] = None
+
+    # join back to the drug set on inchi_key
+    enriched = drugs.merge(mols, on="inchi_key", how="left", suffixes=("", "_chembl"))
+    OUT.parent.mkdir(parents=True, exist_ok=True)
+    enriched.to_parquet(OUT, index=False)
+
+    n_resolved = enriched["chembl_id"].notna().sum()
+    n_moa = enriched["mechanism_of_action"].notna().sum()
+    print(f"\nenriched {len(enriched)} drugs: {n_resolved} matched ChEMBL, {n_moa} have MoA")
+    print(f"by reason, ChEMBL match rate:")
+    print(enriched.assign(matched=enriched["chembl_id"].notna()).groupby("inclusion_reason")["matched"].mean().round(2).to_string())
+    print(f"wrote {OUT}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/week2_curate_drugset.py
+++ b/scripts/week2_curate_drugset.py
@@ -0,0 +1,131 @@
+"""Week 2, task 1: curate the deliberately-composed ~300-drug set (PLAN §6).
+
+Composition: 2 ground-truth + ~50 related-mechanism + ~50 negative controls + ~200 random.
+The universe is restricted to compounds that actually have a LINCS Level-5 signature (in
+Phase I and/or Phase II), so every curated drug is scorable. Output: drug_set_v1.csv.
+"""
+
+from __future__ import annotations
+
+import gzip
+import io
+from pathlib import Path
+
+import pandas as pd
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src import RANDOM_SEED  # noqa: E402
+
+LINCS = Path("data/raw/lincs")
+OUT = Path("data/processed/drug_set_v1.csv")
+
+GROUND_TRUTH = ["hydroxyurea", "glutamine"]  # glutamine == L-glutamine in LINCS
+
+# Curated by mechanism (PLAN §6). Intersected with the LINCS catalog below, so misses are
+# silently dropped — we keep whatever actually has a signature.
+RELATED_MECHANISM = [
+    # HbF inducers / epigenetic
+    "decitabine", "azacitidine", "vorinostat", "panobinostat", "romidepsin", "entinostat",
+    "mocetinostat", "belinostat", "pomalidomide", "lenalidomide", "thalidomide", "apicidin",
+    "trichostatin-a", "scriptaid", "valproic-acid",
+    # NO / vascular
+    "sildenafil", "tadalafil", "nitroprusside",
+    # antioxidants
+    "n-acetyl-cysteine", "resveratrol", "curcumin", "quercetin", "sulforaphane",
+    # anti-inflammatory studied in SCD
+    "dexamethasone", "prednisolone", "hydrocortisone", "ibuprofen", "indomethacin",
+    "sulfasalazine", "montelukast", "aspirin",
+    # iron / heme / SCD-adjacent
+    "hemin", "deferoxamine", "deferasirox", "simvastatin", "atorvastatin", "ticagrelor",
+]
+
+NEGATIVE_CONTROL = [
+    # antifungals
+    "fluconazole", "ketoconazole", "itraconazole", "clotrimazole", "terbinafine", "miconazole",
+    # antihistamines
+    "loratadine", "cetirizine", "fexofenadine", "diphenhydramine", "chlorpheniramine",
+    "astemizole",
+    # antibiotics
+    "amoxicillin", "ciprofloxacin", "doxycycline", "trimethoprim", "azithromycin", "tetracycline",
+    "nitrofurantoin",
+    # hormones / contraceptives
+    "levonorgestrel", "ethinyl-estradiol", "norethindrone", "medroxyprogesterone-acetate",
+    # misc unrelated
+    "omeprazole", "ranitidine", "loperamide", "caffeine", "acetaminophen", "lidocaine",
+]
+
+# Fill the random sample so the total set is ~300 (the denominator the pre-registered
+# recovery-test thresholds assume: "top 30 of 300"). Curated mechanism/control drugs are
+# capped by what LINCS actually contains, so the random arm absorbs the remainder.
+TARGET_TOTAL = 300
+
+
+def load_catalog() -> pd.DataFrame:
+    """Compounds with >=1 Level-5 signature, annotated with phase + inchi/smiles."""
+
+    def read_gz(fn, **kw):
+        return pd.read_csv(io.BytesIO(gzip.decompress(Path(fn).read_bytes())), sep="\t", **kw)
+
+    sig1 = read_gz(LINCS / "GSE92742_sig_info.txt.gz", low_memory=False)
+    sig2 = read_gz(LINCS / "GSE70138_sig_info.txt.gz", low_memory=False)
+    cp1 = set(sig1[sig1["pert_type"] == "trt_cp"]["pert_iname"])
+    cp2 = set(sig2[sig2["pert_type"] == "trt_cp"]["pert_iname"])
+
+    pert1 = read_gz(LINCS / "GSE92742_pert_info.txt.gz", low_memory=False)
+    pert2 = read_gz(LINCS / "GSE70138_pert_info.txt.gz", low_memory=False)
+    info = pd.concat([pert1, pert2], ignore_index=True)
+    info = info[info["pert_type"] == "trt_cp"].drop_duplicates("pert_iname", keep="first")
+    info = info.set_index("pert_iname")
+
+    names = cp1 | cp2
+    rows = []
+    for nm in names:
+        phase = "both" if nm in cp1 and nm in cp2 else ("P1" if nm in cp1 else "P2")
+        rec = info.loc[nm] if nm in info.index else None
+        rows.append({
+            "pert_iname": nm,
+            "phase": phase,
+            "pert_id": rec["pert_id"] if rec is not None else None,
+            "inchi_key": rec["inchi_key"] if rec is not None else None,
+            "canonical_smiles": rec["canonical_smiles"] if rec is not None else None,
+        })
+    return pd.DataFrame(rows).set_index("pert_iname")
+
+
+def pick(catalog: pd.DataFrame, names: list[str], reason: str) -> pd.DataFrame:
+    present = [n for n in names if n in catalog.index]
+    missing = [n for n in names if n not in catalog.index]
+    if missing:
+        print(f"  [{reason}] {len(present)}/{len(names)} in LINCS; dropped: {missing}")
+    out = catalog.loc[present].copy()
+    out["inclusion_reason"] = reason
+    return out
+
+
+def main() -> None:
+    catalog = load_catalog()
+    print(f"LINCS scorable compound universe: {len(catalog)}")
+
+    gt = pick(catalog, GROUND_TRUTH, "ground_truth")
+    rel = pick(catalog, RELATED_MECHANISM, "related_mechanism")
+    neg = pick(catalog, NEGATIVE_CONTROL, "negative_control")
+
+    chosen = pd.concat([gt, rel, neg])
+    remaining = catalog.drop(index=chosen.index)
+    n_random = TARGET_TOTAL - len(chosen)
+    rand = remaining.sample(n=n_random, random_state=RANDOM_SEED).copy()
+    rand["inclusion_reason"] = "general_sample"
+
+    drug_set = pd.concat([gt, rel, neg, rand]).reset_index()
+    OUT.parent.mkdir(parents=True, exist_ok=True)
+    drug_set.to_csv(OUT, index=False)
+
+    print(f"\ndrug_set_v1.csv: {len(drug_set)} drugs")
+    print(drug_set["inclusion_reason"].value_counts().to_string())
+    print(f"phase split:\n{drug_set['phase'].value_counts().to_string()}")
+    print(f"wrote {OUT}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/week2_lincs_extract.py
+++ b/scripts/week2_lincs_extract.py
@@ -0,0 +1,107 @@
+"""Week 2, task 3: extract per-drug LINCS L1000 consensus signatures.
+
+For each drug in the set, slice its Level-5 MODZ signatures (978 landmark genes x its sig_ids)
+out of the big GCTX via cmapPy, then aggregate to ONE consensus 978-vector per drug by mean
+across its sig_ids (the "MODZ aggregation across cell lines/replicates" of PLAN §6). hydroxyurea
+lives in Phase II, L-glutamine in Phase I, so both phases are processed and merged.
+
+Output: data/processed/lincs_signatures_v1.parquet (rows = pert_iname, cols = 978 landmark
+gene symbols, values = mean MODZ z-score).
+
+Usage:
+    python scripts/week2_lincs_extract.py --phase 2   # test on Phase II (already downloaded)
+    python scripts/week2_lincs_extract.py             # both phases (needs Phase I gunzipped)
+"""
+
+from __future__ import annotations
+
+import argparse
+import gzip
+import io
+from pathlib import Path
+
+import pandas as pd
+from cmapPy.pandasGEXpress.parse import parse
+
+LINCS = Path("data/raw/lincs")
+DRUG_SET = Path("data/processed/drug_set_v1.csv")
+OUT = Path("data/processed/lincs_signatures_v1.parquet")
+
+GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
+SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
+
+
+def read_gz_tsv(name: str) -> pd.DataFrame:
+    return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
+
+
+def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
+    """Gene row-ids + id->symbol map for the scored gene space.
+
+    v1.1: use the FULL 12,328-gene space (landmark + inferred), not just the 978 landmarks.
+    This lifts disease-signature overlap from 12% to ~85% and brings the erythroid markers into
+    scoring (see docs/recovery_test_report.md). Inferred genes are model-predicted (noisier).
+    """
+    g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
+    ids = [str(x) for x in g["pr_gene_id"]]
+    id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
+    return ids, id_to_symbol
+
+
+def extract_phase(phase: int, drug_names: set[str], landmark_ids: list[str]) -> pd.DataFrame:
+    """Return DataFrame: rows=pert_iname, cols=landmark gene_id (str), one mean vector per drug."""
+    sig = read_gz_tsv(SIG_INFO[phase])
+    sig = sig[(sig["pert_type"] == "trt_cp") & (sig["pert_iname"].isin(drug_names))]
+    if sig.empty:
+        return pd.DataFrame()
+    sig_ids = sig["sig_id"].tolist()
+    print(f"[phase {phase}] {sig['pert_iname'].nunique()} drugs, {len(sig_ids)} signatures to slice", flush=True)
+
+    gctoo = parse(str(GCTX[phase]), rid=landmark_ids, cid=sig_ids)
+    data = gctoo.data_df  # rows=gene_id, cols=sig_id
+    sig_to_drug = dict(zip(sig["sig_id"], sig["pert_iname"]))
+    # mean across each drug's sig_ids -> one consensus vector per drug
+    per_drug = data.T.groupby(data.columns.map(sig_to_drug)).mean()
+    print(f"[phase {phase}] aggregated to {len(per_drug)} drug consensus vectors", flush=True)
+    return per_drug  # rows=pert_iname, cols=gene_id
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--phase", type=int, choices=[1, 2], default=None, help="single phase (test)")
+    args = ap.parse_args()
+
+    drugs = pd.read_csv(DRUG_SET)
+    drug_names = set(drugs["pert_iname"])
+    landmark_ids, id_to_symbol = landmark_ids_and_symbols()
+
+    phases = [args.phase] if args.phase else [1, 2]
+    frames = []
+    for ph in phases:
+        if not GCTX[ph].exists():
+            print(f"[phase {ph}] {GCTX[ph]} missing — skipping")
+            continue
+        frames.append(extract_phase(ph, drug_names, landmark_ids))
+
+    frames = [f for f in frames if not f.empty]
+    if not frames:
+        print("no signatures extracted")
+        return
+    # A drug present in both phases: average the two phase consensus vectors.
+    combined = pd.concat(frames).groupby(level=0).mean()
+    combined.columns = [id_to_symbol.get(c, c) for c in combined.columns]  # gene_id -> symbol
+
+    covered = sorted(set(combined.index))
+    missing = sorted(drug_names - set(covered))
+    print(f"\nsignatures extracted for {len(covered)}/{len(drug_names)} drugs")
+    for gt in ["hydroxyurea", "glutamine"]:
+        print(f"  ground truth '{gt}': {'OK' if gt in covered else 'MISSING'}")
+    if args.phase is None:
+        combined.to_parquet(OUT)
+        print(f"wrote {OUT} ({combined.shape[0]} drugs x {combined.shape[1]} landmark genes)")
+        if missing:
+            print(f"{len(missing)} drugs without signature (will be marked not-scored in Week 3)")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/week3_scoring.py
+++ b/scripts/week3_scoring.py
@@ -0,0 +1,82 @@
+"""Week 3 (v1.1): connectivity scoring over the full gene space with tau calibration.
+
+Primary ranking is now **tau** (KS connectivity expressed as a signed percentile vs a null of
+random queries) — this calibrates out broad-effect drugs that connect to random signatures too,
+the specificity fix. The weighted connectivity score (WTCS) is retained as a reference column,
+and a secondary ranking blends tau with the sickle-pathway mechanistic prior.
+
+Output: data/results/ranked_candidates_v1.csv.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pandas as pd
+
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from src.scoring import (  # noqa: E402
+    connectivity_score, mechanistic_prior, normalize_scores, persist_ranking, tau_calibrate,
+)
+
+PROCESSED = Path("data/processed")
+N_NULL = 1000
+PRIOR_LAMBDA = 0.5  # spec_z credit per matched sickle pathway, for the blended ranking
+
+
+def main() -> None:
+    sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
+    up = [g["gene"] for g in sig["up_regulated"]]
+    down = [g["gene"] for g in sig["down_regulated"]]
+
+    sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")  # drug x 12,328 genes
+    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
+
+    n_up = len(set(up) & set(sig_matrix.columns))
+    n_down = len(set(down) & set(sig_matrix.columns))
+    print(f"gene space: {sig_matrix.shape[1]} genes; query overlap {n_up} up + {n_down} down = {n_up + n_down}")
+
+    # primary: tau calibration
+    print(f"computing tau over {N_NULL} random-query permutations ...", flush=True)
+    ranked = tau_calibrate(up, down, sig_matrix, n_null=N_NULL)
+
+    # reference: weighted connectivity score (WTCS) + NCS
+    wtcs = pd.Series({d: connectivity_score(up, down, sig_matrix.loc[d]) for d in sig_matrix.index},
+                     name="connectivity_score")
+    ranked["connectivity_score"] = wtcs
+    ranked["normalized_score"] = normalize_scores(wtcs)
+
+    # metadata + mechanistic prior
+    ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
+    ranked["mechanistic_prior"] = ranked["targets"].apply(
+        lambda t: mechanistic_prior(list(t) if t is not None else []))
+    ranked["known_targets"] = ranked["targets"].apply(
+        lambda t: "; ".join(t) if t is not None and len(t) else "")
+    ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})
+
+    # secondary, prior-weighted ranking (relevant drugs pushed toward more-negative spec_z)
+    ranked["blended_score"] = ranked["spec_z"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
+    ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)
+
+    out = ranked.rename_axis("drug_name").reset_index()[[
+        "rank", "drug_name", "chembl_id", "spec_z", "tau", "connectivity_ks", "connectivity_score",
+        "inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
+    ]]
+    path = persist_ranking(out)
+    print(f"wrote {path} ({len(out)} drugs)")
+
+    print("\n--- sanity peek (spec_z ranking) ---")
+    for gt in ["hydroxyurea", "glutamine"]:
+        r = ranked.loc[gt]
+        print(f"  {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {100*r['rank']/len(ranked):.0f}%), "
+              f"spec_z={r['spec_z']:.2f}")
+    print("  top 10 by spec_z:")
+    for name, r in ranked.nsmallest(10, "spec_z").iterrows():
+        print(f"    {int(r['rank']):2d} {name:18s} z={r['spec_z']:6.2f} [{r['inclusion_reason']:16s}] "
+              f"{str(r['known_targets'])[:38]}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/week4_recovery_test.py
+++ b/scripts/week4_recovery_test.py
@@ -0,0 +1,81 @@
+"""Week 4: formal recovery test against the pre-registered criteria (PLAN §6).
+
+Pre-registered criteria (committed in docs/recovery_test_report.md before this run):
+  - hydroxyurea in top 10% (top 30 of 300), AND
+  - L-glutamine in top 25% (top 75) OR documented unscorable due to missing LINCS signature, AND
+  - >=4 of 5 pre-specified negative controls in the bottom half.
+
+The 5 negative controls are pre-specified here by a category rule (one per category, alphabetically
+first available) so the choice does not peek at ranks. Primary ranking = raw connectivity.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pandas as pd
+
+RANKED = Path("data/results/ranked_candidates_v1.csv")
+
+# One per unrelated category, alphabetical-first — chosen without looking at ranks.
+NEG_CONTROL_CATEGORIES = {
+    "antifungal": ["clotrimazole", "fluconazole", "itraconazole", "ketoconazole", "miconazole", "terbinafine"],
+    "antihistamine": ["astemizole", "cetirizine", "diphenhydramine", "fexofenadine", "loratadine"],
+    "antibiotic": ["azithromycin", "ciprofloxacin", "doxycycline", "tetracycline", "trimethoprim"],
+    "hormone": ["ethinyl-estradiol", "levonorgestrel", "medroxyprogesterone-acetate", "norethindrone"],
+    "misc": ["caffeine", "lidocaine", "loperamide", "omeprazole", "ranitidine"],
+}
+
+
+def main() -> None:
+    df = pd.read_csv(RANKED).set_index("drug_name")
+    n = len(df)
+    top10_cut, top25_cut, half = int(n * 0.10), int(n * 0.25), n // 2
+
+    def rk(name):
+        return int(df.loc[name, "rank"]) if name in df.index else None
+
+    hu, glut = rk("hydroxyurea"), rk("glutamine")
+    glut_z = df.loc["glutamine", "spec_z"]
+
+    # pick negative controls present in the ranking
+    negs = {}
+    for cat, options in NEG_CONTROL_CATEGORIES.items():
+        pick = next((d for d in options if d in df.index), None)
+        if pick:
+            negs[pick] = (cat, rk(pick))
+
+    print("=" * 60)
+    print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}")
+    print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%)  -> top-10%? {hu <= top10_cut}")
+    print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), spec_z={glut_z:+.2f}  "
+          f"-> top-25%? {glut <= top25_cut}  (positive z => does not reverse; has a signature)")
+    print("\nnegative controls (pre-specified, 1 per category):")
+    n_bottom = 0
+    for d, (cat, r) in negs.items():
+        in_bottom = r > half
+        n_bottom += in_bottom
+        print(f"  {d:18s} [{cat:13s}] rank {r:3d}  bottom-half? {in_bottom}")
+    print(f"  -> {n_bottom}/5 in bottom half (need >=4)")
+
+    crit_hu = hu <= top10_cut
+    crit_glut = glut <= top25_cut
+    crit_neg = n_bottom >= 4
+    overall = crit_hu and crit_glut and crit_neg
+    print(f"\nCRITERIA: hydroxyurea={crit_hu}, L-glutamine={crit_glut}, neg-controls={crit_neg}")
+    print(f"OVERALL (raw ranking): {'PASS' if overall else 'FAIL'}")
+
+    # secondary prior-weighted view (reported, not the primary criterion)
+    hu_b = int(df.loc["hydroxyurea", "blended_rank"])
+    print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} "
+          f"(top {100*hu_b/n:.1f}%)")
+
+    print("\n--- TOP 10 (primary spec_z ranking) ---")
+    top10 = df.sort_values("rank").head(10)
+    for name, r in top10.iterrows():
+        print(f"  {int(r['rank']):2d}  {name:18s} z={r['spec_z']:+.2f}  "
+              f"[{r['inclusion_reason']}]  {str(r['known_targets'])[:45]}")
+
+
+if __name__ == "__main__":
+    main()
--- a/src/binding.py
+++ b/src/binding.py
@@ -0,0 +1,89 @@
+"""Structure-based binding track (PLAN §12).
+
+Two capabilities:
+- ligand-based retrieval (RDKit, works now): find existing drugs near a query molecule in
+  chemical space — validated, and the engine behind §12.9 generative-guided retrieval.
+- structure-based docking (§12.3): score whether a ligand binds a target pocket. Blocked on an
+  ARM-Mac docking toolchain (AutoDock Vina does not pip-install); see ``dock`` for options.
+
+Caveat carried throughout: chemical similarity != activity, and docking != efficacy (§12.6).
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from rdkit import Chem, DataStructs, RDLogger
+from rdkit.Chem import rdFingerprintGenerator
+
+RDLogger.DisableLog("rdApp.*")
+_MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
+
+STRUCT_DIR = Path("data/raw/structures")
+
+# Known sickle small-molecule binders, by target (positive controls for the §12.4 recovery test).
+KNOWN_BINDERS = {
+    "hemoglobin": "voxelotor",
+    "PKR": "mitapivat",
+    "DNMT1": "decitabine",
+    "HDAC": "vorinostat",
+}
+
+# Curated target structures (PLAN §12.2). Add PDB ids as the harness grows.
+TARGET_PDB = {
+    "hemoglobin": "5E83",  # hemoglobin + voxelotor (GBT440)
+    "PKR": "8XFD",         # pyruvate kinase R + mitapivat
+}
+
+
+def morgan_fp(smiles: str):
+    """Morgan (ECFP4) fingerprint, or None for invalid / missing SMILES ('-666', '')."""
+    if not isinstance(smiles, str) or smiles in ("-666", ""):
+        return None
+    mol = Chem.MolFromSmiles(smiles)
+    return _MORGAN.GetFingerprint(mol) if mol else None
+
+
+def tanimoto(smiles_a: str, smiles_b: str) -> float | None:
+    fa, fb = morgan_fp(smiles_a), morgan_fp(smiles_b)
+    if fa is None or fb is None:
+        return None
+    return DataStructs.TanimotoSimilarity(fa, fb)
+
+
+def retrieve_nearest(
+    query_smiles: str,
+    library: dict[str, str],
+    top_n: int = 5,
+) -> list[tuple[float, str]]:
+    """Rank a library of {name: smiles} by Tanimoto similarity to a query molecule.
+
+    This is the §12.9 retrieval step: the query may be a known binder (positive-control beacon)
+    or a generated idealised binder; the returned existing drugs are repurposing candidates that
+    STILL require docking/validation (similarity != activity).
+    """
+    qfp = morgan_fp(query_smiles)
+    if qfp is None:
+        raise ValueError("invalid query SMILES")
+    sims = []
+    for name, smi in library.items():
+        fp = morgan_fp(smi)
+        if fp is not None:
+            sims.append((DataStructs.TanimotoSimilarity(qfp, fp), name))
+    return sorted(sims, reverse=True)[:top_n]
+
+
+def dock(target: str, ligand_smiles: str) -> float:
+    """Dock a ligand into a target pocket and return a binding score (PLAN §12.3).
+
+    Blocked: AutoDock Vina does not pip-install on ARM Mac and no docking binary is on PATH.
+    Resolve the toolchain first (one of):
+      - conda/micromamba: ``vina`` (conda-forge) or ``smina`` (bioconda), osx-arm64 builds
+      - an AF3-class co-folding model on GPU: Boltz-2 / Chai-1 / DiffDock (also predicts affinity)
+      - x86 Vina binary under Rosetta 2
+    Then: fetch TARGET_PDB[target], define the pocket box, prep the ligand (Meeko), score.
+    """
+    raise NotImplementedError(
+        "Docking toolchain unresolved on ARM Mac (PLAN §12.6 pitfall 4 / §12.8). "
+        "See docstring for options."
+    )
--- a/src/disease.py
+++ b/src/disease.py
@@ -1,19 +1,27 @@
 """Disease signature construction.

 Week 1 (PLAN.md §6). Builds a Tier-A sickle cell signature from GEO expression data via
-differential expression, then persists it with full provenance to
-``data/processed/sickle_cell_signature_v1.json``.
+differential expression on TWO studies, keeping only genes that are concordant across both
+(the multi-source evidence that earns Tier A — see project memory). Persists with full
+provenance to ``data/processed/sickle_cell_signature_v1.json``.

-This module defines the persisted schema (pydantic) and the construction stubs. The actual
-data pull + differential expression is driven from ``notebooks/02_disease_signature.ipynb``.
+Pipeline (driven from ``notebooks/02_disease_signature.ipynb``):
+    per study:  compute_differential_expression -> collapse_probes_to_symbols
+    across:     concordance_filter -> build_signature -> persist_signature
+
+Conventions:
+    - log_fc > 0  =>  up-regulated in DISEASE vs HEALTHY (Week 1 refinement R1)
+    - cross-study join key is the HGNC gene symbol (R2)
 """

 from __future__ import annotations

 from pathlib import Path

+import numpy as np
 import pandas as pd
 from pydantic import BaseModel, Field
+from scipy.stats import false_discovery_control, ttest_ind

 from . import PIPELINE_VERSION, PROCESSED_DIR
 from .provenance import ConfidenceTier
@@ -22,6 +30,10 @@ from .provenance import ConfidenceTier
 TOP_N_PER_DIRECTION = 250
 QVALUE_CUTOFF = 0.05

+# Group labels used throughout. The disease group is the "treatment" arm in DE contrasts.
+DISEASE_LABEL = "disease"
+HEALTHY_LABEL = "healthy"
+

 class GeneEntry(BaseModel):
    """A single differentially expressed gene in the signature."""
@@ -33,15 +45,37 @@ class GeneEntry(BaseModel):
    qvalue: float


-class SignatureProvenance(BaseModel):
-    """Provenance block for a disease signature (PLAN.md §6 schema)."""
+class StudyProvenance(BaseModel):
+    """Provenance for one contributing GEO study (Week 1 refinement R6)."""

    geo_accession: str
    n_disease: int
    n_healthy: int
    platform: str
-    method: str = Field(..., description="Differential expression method, e.g. 'limma', 'deseq2'.")
+    tissue: str
+    method: str = Field(..., description="DE method: 'welch' (microarray) or 'deseq2' (RNA-seq).")
+
+
+class ConcordanceSummary(BaseModel):
+    """How the two studies were reconciled into the signature (R6)."""
+
+    rule: str = Field(
+        default="q<0.05 in both studies AND same sign of log_fc in both",
+        description="The concordance rule applied across studies.",
+    )
+    n_genes_tested: int = Field(..., description="Genes present in both studies after symbol collapse.")
+    n_concordant: int
+    n_up: int
+    n_down: int
+
+
+class SignatureProvenance(BaseModel):
+    """Provenance block for a disease signature (revised for 2-study concordance, R6)."""
+
+    studies: list[StudyProvenance]
+    concordance: ConcordanceSummary
    created_date: str
+    pipeline_version: str = PIPELINE_VERSION


 class DiseaseSignature(BaseModel):
@@ -58,44 +92,213 @@ class DiseaseSignature(BaseModel):
    limitations: list[str]


+def _looks_like_linear_intensity(expression: pd.DataFrame) -> bool:
+    """Heuristic: microarray data still on a linear scale has a large dynamic range.
+
+    log2-transformed intensities are typically <~20; raw intensities run into the thousands.
+    """
+    return float(np.nanmax(expression.to_numpy())) > 50.0
+
+
+def _welch_de(expression: pd.DataFrame, groups: pd.Series) -> pd.DataFrame:
+    """Per-gene Welch t-test + Benjamini-Hochberg, the limma-equivalent for microarray.
+
+    Assumes ``expression`` is genes (rows) x samples (columns), on (or coercible to) a log2
+    scale. ``log_fc`` is mean(disease) - mean(healthy) (R1 sign convention).
+    """
+    expr = expression.copy()
+    if _looks_like_linear_intensity(expr):
+        # log2(x+1) guards against zeros/negatives while keeping the transform standard.
+        expr = np.log2(expr.clip(lower=0) + 1.0)
+
+    disease_cols = groups.index[groups == DISEASE_LABEL]
+    healthy_cols = groups.index[groups == HEALTHY_LABEL]
+    disease = expr[disease_cols].to_numpy()
+    healthy = expr[healthy_cols].to_numpy()
+
+    log_fc = disease.mean(axis=1) - healthy.mean(axis=1)
+    # Welch's t-test (unequal variance) per gene across samples.
+    _, pvalue = ttest_ind(disease, healthy, axis=1, equal_var=False, nan_policy="omit")
+
+    out = pd.DataFrame({"log_fc": log_fc, "pvalue": pvalue}, index=expr.index)
+    out = out.dropna(subset=["pvalue"])
+    out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
+    return out.sort_values("qvalue")
+
+
+def _deseq2_de(counts: pd.DataFrame, groups: pd.Series) -> pd.DataFrame:
+    """RNA-seq differential expression via pydeseq2.
+
+    ``counts`` is genes (rows) x samples (columns) of raw integer counts. Returns the same
+    schema as ``_welch_de`` (log_fc = log2FC of disease vs healthy, plus pvalue/qvalue).
+    """
+    from pydeseq2.dds import DeseqDataSet
+    from pydeseq2.ds import DeseqStats
+
+    # pydeseq2 wants samples (rows) x genes (cols), with an aligned metadata frame.
+    counts_t = counts.T.round().astype(int)
+    metadata = pd.DataFrame({"condition": groups.reindex(counts_t.index).to_numpy()}, index=counts_t.index)
+
+    dds = DeseqDataSet(counts=counts_t, metadata=metadata, design="~condition", quiet=True)
+    dds.deseq2()
+    stats = DeseqStats(dds, contrast=["condition", DISEASE_LABEL, HEALTHY_LABEL], quiet=True)
+    stats.summary()
+    res = stats.results_df.rename(columns={"log2FoldChange": "log_fc", "pvalue": "pvalue", "padj": "qvalue"})
+    return res[["log_fc", "pvalue", "qvalue"]].dropna(subset=["qvalue"]).sort_values("qvalue")
+
+
 def compute_differential_expression(
    expression: pd.DataFrame,
    sample_groups: pd.Series,
    *,
    method: str,
 ) -> pd.DataFrame:
-    """Compute gene-level log fold change and adjusted p-values.
-
-    For RNA-seq use ``pydeseq2``; for microarray log2-transform/normalize and use a
-    limma-equivalent (PLAN.md §6, Week 1 task 4).
+    """Compute gene-level log fold change and adjusted p-values for one study.

    Args:
-        expression: Genes (rows) x samples (columns) expression matrix.
-        sample_groups: Per-sample group label ('disease' / 'healthy'), indexed by sample.
-        method: 'deseq2' (RNA-seq) or 'limma' (microarray).
+        expression: Genes (rows) x samples (columns). Microarray intensities or RNA-seq counts.
+        sample_groups: Per-sample 'disease'/'healthy' label, indexed by sample id (column).
+        method: 'welch' (microarray, PLAN choice) or 'deseq2' (RNA-seq).

    Returns:
-        A table indexed by gene with at least ``log_fc`` and ``qvalue`` columns.
+        Table indexed by gene/probe with ``log_fc``, ``pvalue``, ``qvalue`` (R1 sign convention).
    """
-    raise NotImplementedError("Differential expression: implement in Week 1 (notebook 02).")
+    groups = sample_groups.reindex(expression.columns)
+    if groups.isna().any():
+        missing = list(groups.index[groups.isna()])
+        raise ValueError(f"sample_groups missing labels for samples: {missing[:5]}...")
+
+    if method == "welch":
+        return _welch_de(expression, groups)
+    if method == "deseq2":
+        return _deseq2_de(expression, groups)
+    raise ValueError(f"Unknown method {method!r}; expected 'welch' or 'deseq2'.")
+
+
+def collapse_probes_to_symbols(
+    de_table: pd.DataFrame,
+    probe_to_symbol: pd.Series,
+    expression_for_ranking: pd.DataFrame | None = None,
+) -> pd.DataFrame:
+    """Collapse a probe-level DE table to one row per HGNC symbol (R2).
+
+    When multiple probes map to the same symbol, keep the probe with the highest mean
+    expression (the standard collapseRows heuristic) if ``expression_for_ranking`` is given;
+    otherwise keep the probe with the smallest qvalue.
+
+    Args:
+        de_table: DE table indexed by probe id (from ``compute_differential_expression``).
+        probe_to_symbol: Probe id -> HGNC symbol mapping.
+        expression_for_ranking: Optional genes x samples matrix to rank duplicate probes.
+
+    Returns:
+        DE table indexed by HGNC symbol.
+    """
+    df = de_table.copy()
+    df["symbol"] = probe_to_symbol.reindex(df.index)
+    df = df.dropna(subset=["symbol"])
+
+    if expression_for_ranking is not None:
+        rank_key = expression_for_ranking.reindex(df.index).mean(axis=1)
+        df = df.assign(_rank=rank_key).sort_values("_rank", ascending=False)
+    else:
+        df = df.sort_values("qvalue", ascending=True)
+
+    collapsed = df[~df["symbol"].duplicated(keep="first")].set_index("symbol")
+    return collapsed.drop(columns=[c for c in ("_rank",) if c in collapsed.columns])
+
+
+def concordance_filter(
+    de_a: pd.DataFrame,
+    de_b: pd.DataFrame,
+    *,
+    qvalue_cutoff: float = QVALUE_CUTOFF,
+) -> tuple[pd.DataFrame, ConcordanceSummary]:
+    """Keep only genes concordant across two symbol-level DE tables (R3).
+
+    A gene qualifies iff q<``qvalue_cutoff`` in BOTH studies AND log_fc has the same sign in
+    both. Reported ``log_fc`` = mean of the two; ``qvalue`` = max of the two (worst case).
+    Ranked by ``max(q_a, q_b)`` ascending.
+
+    Returns:
+        (concordant_table indexed by symbol with log_fc/qvalue, ConcordanceSummary).
+    """
+    merged = de_a.join(de_b, lsuffix="_a", rsuffix="_b", how="inner")
+    n_tested = len(merged)
+
+    sig_both = (merged["qvalue_a"] < qvalue_cutoff) & (merged["qvalue_b"] < qvalue_cutoff)
+    same_sign = np.sign(merged["log_fc_a"]) == np.sign(merged["log_fc_b"])
+    keep = merged[sig_both & same_sign].copy()
+
+    keep["log_fc"] = (keep["log_fc_a"] + keep["log_fc_b"]) / 2.0
+    keep["qvalue"] = keep[["qvalue_a", "qvalue_b"]].max(axis=1)
+    keep = keep[["log_fc", "qvalue"]].sort_values("qvalue")
+
+    summary = ConcordanceSummary(
+        n_genes_tested=n_tested,
+        n_concordant=len(keep),
+        n_up=int((keep["log_fc"] > 0).sum()),
+        n_down=int((keep["log_fc"] < 0).sum()),
+    )
+    return keep, summary


 def build_signature(
-    de_table: pd.DataFrame,
+    concordant_table: pd.DataFrame,
    provenance: SignatureProvenance,
    *,
    tier: ConfidenceTier,
    tier_rationale: str,
    limitations: list[str],
+    id_map: pd.DataFrame | None = None,
    top_n: int = TOP_N_PER_DIRECTION,
-    qvalue_cutoff: float = QVALUE_CUTOFF,
 ) -> DiseaseSignature:
-    """Assemble a ``DiseaseSignature`` from a differential expression table.
+    """Assemble a ``DiseaseSignature`` from the concordant gene table.

-    Takes the top ``top_n`` up- and down-regulated genes (by qvalue, cut at
-    ``qvalue_cutoff``) per PLAN.md §6, Week 1 task 5.
+    Takes the top ``top_n`` up- and down-regulated genes by qvalue per direction (R3). If
+    fewer than ``top_n`` qualify in a direction, takes all available (the caller should have
+    logged the count). ``id_map`` optionally provides 'entrez_id'/'ensembl_id' columns indexed
+    by symbol (from ``mygene``).
+
+    Args:
+        concordant_table: Output of ``concordance_filter`` (indexed by symbol).
+        provenance: Fully populated ``SignatureProvenance`` (both studies + concordance).
+        tier: Confidence tier (Tier A only if genuinely multi-source).
+        tier_rationale: One-line justification of the tier.
+        limitations: Honest limitations list (R8).
+        id_map: Optional symbol -> {entrez_id, ensembl_id} table.
+        top_n: Max genes per direction.
+
+    Returns:
+        A populated ``DiseaseSignature``.
    """
-    raise NotImplementedError("Signature assembly: implement in Week 1 (notebook 02).")
+
+    def _entries(rows: pd.DataFrame) -> list[GeneEntry]:
+        entries: list[GeneEntry] = []
+        for symbol, row in rows.iterrows():
+            ids = id_map.loc[symbol] if (id_map is not None and symbol in id_map.index) else None
+            entries.append(
+                GeneEntry(
+                    gene=str(symbol),
+                    entrez_id=(str(ids["entrez_id"]) if ids is not None and pd.notna(ids.get("entrez_id")) else None),
+                    ensembl_id=(str(ids["ensembl_id"]) if ids is not None and pd.notna(ids.get("ensembl_id")) else None),
+                    log_fc=float(row["log_fc"]),
+                    qvalue=float(row["qvalue"]),
+                )
+            )
+        return entries
+
+    up = concordant_table[concordant_table["log_fc"] > 0].sort_values("qvalue").head(top_n)
+    down = concordant_table[concordant_table["log_fc"] < 0].sort_values("qvalue").head(top_n)
+
+    return DiseaseSignature(
+        up_regulated=_entries(up),
+        down_regulated=_entries(down),
+        provenance=provenance,
+        confidence_tier=tier,
+        tier_rationale=tier_rationale,
+        limitations=limitations,
+    )


 def persist_signature(signature: DiseaseSignature, out_path: Path | None = None) -> Path:
--- a/src/scoring.py
+++ b/src/scoring.py
@@ -1,33 +1,65 @@
-"""CMap-style connectivity scoring — the matching engine.
+"""CMap-style connectivity scoring — the matching engine (Week 3, PLAN §6).

-Week 3 (PLAN.md §6). Scores each drug's LINCS signature against the disease signature using
-weighted Kolmogorov-Smirnov enrichment (Lamb 2006 / Subramanian 2017). Strongly *negative*
-connectivity = strong reversal of the disease signature = candidate match.
+Scores each drug's LINCS consensus signature against the disease signature using the weighted
+Kolmogorov-Smirnov / GSEA enrichment statistic (Lamb 2006; Subramanian 2017). The query is the
+disease up/down gene sets; the reference is each drug's 978 landmark genes ranked by z-score.

-Uses ``cmapPy`` as the reference implementation. ``tests/test_scoring.py`` verifies the
-implementation against a known reference.
+Sign convention (PLAN §6): strongly **negative** connectivity = strong **reversal** of the
+disease signature = candidate match. A drug that down-regulates the disease's up-genes and
+up-regulates its down-genes scores negative.
 """

 from __future__ import annotations

 from pathlib import Path

+import numpy as np
 import pandas as pd
-from pydantic import BaseModel

-from . import RESULTS_DIR
+from . import RANDOM_SEED, RESULTS_DIR
+
+# Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
+# Keys are pathway categories; values are substrings matched (case-insensitive) against a
+# drug's ChEMBL target names.
+SICKLE_PATHWAYS: dict[str, tuple[str, ...]] = {
+    "hbf_epigenetic": ("histone deacetylase", "hdac", "methyltransferase", "dnmt",
+                        "ribonucleoside-diphosphate reductase", "ribonucleotide reductase"),
+    "hemoglobin": ("hemoglobin", "globin"),
+    "no_signaling": ("nitric oxide", "guanylate cyclase", "phosphodiesterase 5", "pde5"),
+    "inflammation": ("cyclooxygenase", "prostaglandin", "nf-kappa", "interleukin",
+                     "leukotriene", "selectin", "tumor necrosis factor"),
+    "oxidative_stress": ("glutathione", "superoxide", "nadph oxidase", "thioredoxin", "nrf2"),
+}


-class ConnectivityResult(BaseModel):
-    """Connectivity score for a single drug against the disease signature."""
+def _enrichment_score(drug_profile: pd.Series, gene_set: set[str], weight: float = 1.0) -> float:
+    """Weighted GSEA/KS enrichment score of ``gene_set`` in a drug's ranked profile.

-    chembl_id: str
-    drug_name: str
-    connectivity_score: float | None  # None when the drug has no LINCS signature.
-    normalized_score: float | None = None
-    p_value: float | None = None
-    scored: bool  # False => no signature available, not scored (do not skip silently).
-    n_genes_overlap: int | None = None
+    The profile is ranked by z-score (most up-regulated first). Hits increment a running sum in
+    proportion to ``|z|**weight``; misses decrement uniformly. ES is the max signed deviation
+    from zero. ES>0 => set enriched among up-regulated genes; ES<0 => among down-regulated.
+    """
+    s = drug_profile.sort_values(ascending=False)
+    genes = s.index.to_numpy()
+    vals = s.to_numpy(dtype=float)
+    hit = np.fromiter((g in gene_set for g in genes), dtype=bool, count=len(genes))
+
+    n_hit = int(hit.sum())
+    n = len(genes)
+    if n_hit == 0 or n_hit == n:
+        return 0.0
+
+    w = (np.abs(vals) ** weight) * hit
+    sum_hit = w.sum()
+    if sum_hit == 0:
+        return 0.0
+
+    inc = w / sum_hit
+    dec = (~hit) / (n - n_hit)
+    running = np.cumsum(inc - dec)
+
+    hi, lo = running.max(), running.min()
+    return float(hi if abs(hi) >= abs(lo) else lo)


 def connectivity_score(
@@ -35,46 +67,157 @@ def connectivity_score(
    down_genes: list[str],
    drug_signature: pd.Series,
 ) -> float:
-    """Weighted KS connectivity score for one drug vs the disease up/down gene sets.
+    """Weighted connectivity score (WTCS) for one drug vs the disease up/down sets.

-    Only the intersection of disease-signature genes and LINCS landmark genes is scored;
-    callers must record the overlap count (PLAN.md §6, Week 3 task 2).
-
-    Args:
-        up_genes: Disease up-regulated gene identifiers.
-        down_genes: Disease down-regulated gene identifiers.
-        drug_signature: Drug's expression vector indexed by gene identifier.
-
-    Returns:
-        Connectivity score in roughly [-1, 1]; strongly negative = strong reversal.
+    Only query genes present in the drug's profile index (the 978 landmarks) are used — callers
+    should record the overlap count (PLAN §6 Week 3 task 2). Returns the WTCS: if the two
+    enrichment scores share a sign the result is 0 (ambiguous), else ``(ES_up - ES_down)/2``.
+    Negative => reversal => candidate.
    """
-    raise NotImplementedError("Connectivity scoring: implement in Week 3 (notebook 04).")
+    profile_genes = set(drug_signature.index)
+    up = set(up_genes) & profile_genes
+    down = set(down_genes) & profile_genes
+
+    es_up = _enrichment_score(drug_signature, up)
+    es_down = _enrichment_score(drug_signature, down)
+
+    if np.sign(es_up) == np.sign(es_down):
+        return 0.0
+    return (es_up - es_down) / 2.0
+
+
+def normalize_scores(scores: pd.Series) -> pd.Series:
+    """Signed normalization (NCS, Subramanian 2017): divide by the mean magnitude of same-sign
+    scores, so positive and negative tails are separately scaled to a mean magnitude of 1."""
+    out = scores.astype(float).copy()
+    pos_mean = scores[scores > 0].mean()
+    neg_mean = scores[scores < 0].abs().mean()
+    if pos_mean and not np.isnan(pos_mean):
+        out[scores > 0] = scores[scores > 0] / pos_mean
+    if neg_mean and not np.isnan(neg_mean):
+        out[scores < 0] = scores[scores < 0] / neg_mean
+    return out


 def rank_drugs(
-    signature_up: list[str],
-    signature_down: list[str],
-    drug_profiles: pd.DataFrame,
+    up_genes: list[str],
+    down_genes: list[str],
+    signature_matrix: pd.DataFrame,
 ) -> pd.DataFrame:
-    """Score and rank all drugs against the disease signature.
+    """Score and rank all drugs (rows of ``signature_matrix``: drug x landmark-gene z-scores).

-    Drugs without a LINCS signature are marked ``scored=False`` and excluded from the ranking
-    rather than dropped silently (PLAN.md §6, Week 3 task 2).
-
-    Returns a ranked table with the columns described in PLAN.md §6 (rank, drug_name,
-    chembl_id, connectivity_score, normalized_score, p_value, inclusion_reason,
-    known_targets, mechanism_summary).
+    Returns a table indexed by drug with ``rank`` (1 = strongest reversal = most negative),
+    ``connectivity_score`` and ``normalized_score``. Drugs are expected to all have signatures
+    here; signature-less drugs are handled (marked not-scored) by the orchestration layer per
+    PLAN §6 Week 3 task 2.
    """
-    raise NotImplementedError("Drug ranking: implement in Week 3 (notebook 04).")
+    scores = pd.Series(
+        {drug: connectivity_score(up_genes, down_genes, signature_matrix.loc[drug])
+         for drug in signature_matrix.index},
+        name="connectivity_score",
+    )
+    df = pd.DataFrame({"connectivity_score": scores, "normalized_score": normalize_scores(scores)})
+    df = df.sort_values("connectivity_score")  # most negative (reversal) first
+    df.insert(0, "rank", range(1, len(df) + 1))
+    return df


 def mechanistic_prior(targets: list[str]) -> float:
-    """Prior weight for a drug based on sickle-cell-relevant target pathways.
+    """Count of sickle-cell-relevant pathway categories a drug's targets hit (PLAN §6 task 3).

-    Pathways of interest: HbF regulation, hemoglobin, NO signaling, inflammation, oxidative
-    stress (PLAN.md §6, Week 3 task 3). Used to build the secondary, prior-weighted ranking.
+    Higher = more mechanistically plausible. Used to build the secondary, prior-weighted ranking
+    alongside the raw connectivity ranking.
    """
-    raise NotImplementedError("Mechanistic prior: implement in Week 3 (notebook 04).")
+    if not targets:
+        return 0.0
+    text = " ; ".join(str(t) for t in targets).lower()
+    return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))
+
+
+# --- KS connectivity + tau calibration (v1.1) -----------------------------------------------
+# Unweighted Kolmogorov-Smirnov connectivity (Lamb 2006) is O(k) per query (depends only on the
+# ranks of the query genes), which makes a permutation null over many random queries cheap. tau
+# expresses each drug's real connectivity as a signed percentile within its own null — so
+# broad-effect drugs that connect to *random* signatures too get down-weighted (specificity).
+
+
+def _ks_es(rank_matrix: np.ndarray, query_cols: np.ndarray, n_genes: int) -> np.ndarray:
+    """Vectorized unweighted KS enrichment score of a query gene set for every drug.
+
+    ``rank_matrix`` is (n_drugs, n_genes) of 1..N rank positions (1 = most up-regulated).
+    Returns one ES per drug; ES>0 => query enriched among up-regulated genes.
+    """
+    k = len(query_cols)
+    if k == 0:
+        return np.zeros(rank_matrix.shape[0])
+    p = np.sort(rank_matrix[:, query_cols], axis=1)  # (n_drugs, k), positions ascending
+    j = np.arange(1, k + 1)
+    a = (j / k - p / n_genes).max(axis=1)
+    b = (p / n_genes - (j - 1) / k).max(axis=1)
+    return np.where(a >= b, a, -b)
+
+
+def _ks_connectivity(rank_matrix: np.ndarray, up_cols: np.ndarray, down_cols: np.ndarray,
+                     n_genes: int) -> np.ndarray:
+    """KS connectivity per drug: (ES_up - ES_down)/2. Negative=reversal.
+
+    Note: unlike WTCS, this does NOT zero same-sign (ambiguous) connections — same-sign ES
+    partially cancel and land near 0 naturally. Hard-zeroing would collapse the random-query
+    null to a spike at 0 and make tau saturate, so the continuous form is required for calibration.
+    """
+    es_up = _ks_es(rank_matrix, up_cols, n_genes)
+    es_down = _ks_es(rank_matrix, down_cols, n_genes)
+    return (es_up - es_down) / 2.0
+
+
+def tau_calibrate(
+    up_genes: list[str],
+    down_genes: list[str],
+    signature_matrix: pd.DataFrame,
+    n_null: int = 1000,
+    seed: int = RANDOM_SEED,
+) -> pd.DataFrame:
+    """Rank drugs by tau: each drug's KS connectivity as a signed percentile vs a null of
+    random queries of the same size (PLAN §6; CMap tau, Subramanian 2017).
+
+    tau in [-100, 100]: -100 => reverses our signature more specifically than any random query
+    (strong, specific candidate); ~0 => connects to our signature no more than to random ones
+    (broad-effect / non-specific). Ranked by tau ascending (rank 1 = most specific reversal).
+    """
+    genes = list(signature_matrix.columns)
+    gene_to_col = {g: i for i, g in enumerate(genes)}
+    n = len(genes)
+    rank_matrix = signature_matrix.rank(axis=1, ascending=False).to_numpy()
+
+    up_cols = np.array([gene_to_col[g] for g in set(up_genes) if g in gene_to_col], dtype=int)
+    down_cols = np.array([gene_to_col[g] for g in set(down_genes) if g in gene_to_col], dtype=int)
+    real = _ks_connectivity(rank_matrix, up_cols, down_cols, n)
+
+    rng = np.random.default_rng(seed)
+    k_up, k_down = len(up_cols), len(down_cols)
+    null = np.empty((rank_matrix.shape[0], n_null))
+    for m in range(n_null):
+        samp = rng.choice(n, size=k_up + k_down, replace=False)
+        null[:, m] = _ks_connectivity(rank_matrix, samp[:k_up], samp[k_up:], n)
+
+    null_mean = null.mean(axis=1)
+    null_std = null.std(axis=1)
+    null_std[null_std == 0] = np.nan
+    # Per-drug standardized connectivity: how many SDs the real reversal is below what random
+    # queries of the same size produce against THIS drug. Continuous (no percentile floor), so it
+    # discriminates within the strong-reversal tail. Negative = specific reversal.
+    spec_z = (real - null_mean) / null_std
+
+    frac = (null <= real[:, None]).mean(axis=1)
+    tau = 100.0 * (2.0 * frac - 1.0)  # retained for reference; saturates at +/-100 in the tail
+
+    df = pd.DataFrame(
+        {"connectivity_ks": real, "null_mean": null_mean, "spec_z": spec_z, "tau": tau},
+        index=signature_matrix.index,
+    )
+    df = df.sort_values("spec_z")  # most negative z = most specific reversal
+    df.insert(0, "rank", range(1, len(df) + 1))
+    return df


 def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:
--- a/tests/test_disease.py
+++ b/tests/test_disease.py
@@ -0,0 +1,122 @@
+"""Tests for the Week 1 signature-construction logic on synthetic data.
+
+These verify the DE / probe-collapse / concordance math without touching the network, so the
+pipeline is trustworthy before it is pointed at real GEO studies.
+"""
+
+from __future__ import annotations
+
+import numpy as np
+import pandas as pd
+import pytest
+
+from src.disease import (
+    DISEASE_LABEL,
+    HEALTHY_LABEL,
+    ConcordanceSummary,
+    SignatureProvenance,
+    StudyProvenance,
+    build_signature,
+    collapse_probes_to_symbols,
+    compute_differential_expression,
+    concordance_filter,
+)
+from src.provenance import ConfidenceTier
+
+
+def _synthetic_study(seed: int, n_per_group: int = 12) -> tuple[pd.DataFrame, pd.Series]:
+    """Genes x samples matrix where UP is higher and DOWN is lower in disease."""
+    rng = np.random.default_rng(seed)
+    genes = ["UP1", "UP2", "DOWN1", "DOWN2", "NOISE1", "NOISE2"]
+    samples = [f"d{i}" for i in range(n_per_group)] + [f"h{i}" for i in range(n_per_group)]
+    groups = pd.Series([DISEASE_LABEL] * n_per_group + [HEALTHY_LABEL] * n_per_group, index=samples)
+
+    base = rng.normal(8.0, 0.3, size=(len(genes), len(samples)))
+    df = pd.DataFrame(base, index=genes, columns=samples)
+    disease = groups == DISEASE_LABEL
+    df.loc["UP1", disease] += 3.0
+    df.loc["UP2", disease] += 2.0
+    df.loc["DOWN1", disease] -= 3.0
+    df.loc["DOWN2", disease] -= 2.0
+    return df, groups
+
+
+def test_welch_de_recovers_direction_and_significance():
+    expr, groups = _synthetic_study(seed=1)
+    de = compute_differential_expression(expr, groups, method="welch")
+
+    assert de.loc["UP1", "log_fc"] > 0 and de.loc["UP1", "qvalue"] < 0.05
+    assert de.loc["DOWN1", "log_fc"] < 0 and de.loc["DOWN1", "qvalue"] < 0.05
+    # Pure-noise genes should not be significant.
+    assert de.loc["NOISE1", "qvalue"] > 0.05
+
+
+def test_compute_de_rejects_unlabelled_samples():
+    expr, groups = _synthetic_study(seed=2)
+    with pytest.raises(ValueError):
+        compute_differential_expression(expr, groups.iloc[:-3], method="welch")
+
+
+def test_collapse_probes_keeps_highest_mean_expression():
+    de = pd.DataFrame(
+        {"log_fc": [1.0, 2.0, -1.0], "pvalue": [0.1, 0.2, 0.3], "qvalue": [0.1, 0.2, 0.3]},
+        index=["probeA1", "probeA2", "probeB1"],
+    )
+    probe_to_symbol = pd.Series({"probeA1": "GENEA", "probeA2": "GENEA", "probeB1": "GENEB"})
+    expr = pd.DataFrame(
+        {"s1": [1.0, 100.0, 5.0], "s2": [1.0, 100.0, 5.0]},
+        index=["probeA1", "probeA2", "probeB1"],
+    )
+    collapsed = collapse_probes_to_symbols(de, probe_to_symbol, expression_for_ranking=expr)
+
+    assert set(collapsed.index) == {"GENEA", "GENEB"}
+    # probeA2 has the higher mean expression, so its log_fc (2.0) should win.
+    assert collapsed.loc["GENEA", "log_fc"] == 2.0
+
+
+def test_concordance_filter_keeps_only_agreeing_genes():
+    de_a = pd.DataFrame(
+        {"log_fc": [2.0, -2.0, 1.5, 0.1], "qvalue": [0.001, 0.001, 0.2, 0.001]},
+        index=["UP1", "DOWN1", "WEAK", "DISAGREE"],
+    )
+    de_b = pd.DataFrame(
+        {"log_fc": [1.8, -2.2, 1.4, -0.1], "qvalue": [0.002, 0.002, 0.2, 0.002]},
+        index=["UP1", "DOWN1", "WEAK", "DISAGREE"],
+    )
+    keep, summary = concordance_filter(de_a, de_b)
+
+    assert set(keep.index) == {"UP1", "DOWN1"}  # WEAK fails q-cut; DISAGREE flips sign
+    assert keep.loc["UP1", "log_fc"] == pytest.approx(1.9)  # mean of the two
+    assert keep.loc["UP1", "qvalue"] == pytest.approx(0.002)  # max of the two
+    assert isinstance(summary, ConcordanceSummary)
+    assert summary.n_genes_tested == 4 and summary.n_concordant == 2
+    assert summary.n_up == 1 and summary.n_down == 1
+
+
+def test_build_signature_splits_directions_and_respects_top_n():
+    concordant = pd.DataFrame(
+        {
+            "log_fc": [3.0, 2.0, 1.0, -1.0, -2.0],
+            "qvalue": [0.001, 0.002, 0.003, 0.004, 0.005],
+        },
+        index=["UP1", "UP2", "UP3", "DOWN1", "DOWN2"],
+    )
+    prov = SignatureProvenance(
+        studies=[
+            StudyProvenance(geo_accession="GSE1", n_disease=12, n_healthy=12,
+                            platform="P", tissue="whole blood", method="welch"),
+            StudyProvenance(geo_accession="GSE2", n_disease=15, n_healthy=11,
+                            platform="P", tissue="whole blood", method="welch"),
+        ],
+        concordance=ConcordanceSummary(n_genes_tested=100, n_concordant=5, n_up=3, n_down=2),
+        created_date="2026-06-23",
+    )
+    sig = build_signature(
+        concordant, prov, tier=ConfidenceTier.A,
+        tier_rationale="Two-study concordance", limitations=["cell-composition confound"],
+        top_n=2,
+    )
+
+    assert [g.gene for g in sig.up_regulated] == ["UP1", "UP2"]  # top 2 up by qvalue
+    assert [g.gene for g in sig.down_regulated] == ["DOWN1", "DOWN2"]
+    assert sig.confidence_tier == ConfidenceTier.A
--- a/tests/test_scoring.py
+++ b/tests/test_scoring.py
@@ -1,14 +1,13 @@
 """Tests for the matching engine and provenance logic.

-The headline test (PLAN.md §6, Week 3 task 4) verifies connectivity scoring against a known
-reference within tolerance; it is marked xfail until the scorer is implemented in Week 3.
-
-The tier-assignment tests run today — they pin the rules from PLAN.md §3 so the most
+Connectivity tests (PLAN.md §6, Week 3 task 4) pin the weighted-KS scorer against hand-built
+reference profiles. The tier-assignment tests pin the rules from PLAN.md §3 so the most
 commercially important design decision can't silently drift.
 """

 from __future__ import annotations

+import pandas as pd
 import pytest

 from src.provenance import ConfidenceTier, assign_tier
@@ -52,14 +51,103 @@ class TestAssignTier:
        assert assign_tier(**kwargs) == ConfidenceTier.B


-@pytest.mark.xfail(reason="Connectivity scoring implemented in Week 3 (notebook 04).", strict=True)
-def test_connectivity_score_matches_reference():
-    """Verify connectivity scoring against a CMap/cmapPy reference within tolerance.
+class TestConnectivityScore:
+    """Reference checks for the weighted-KS connectivity score (PLAN §6 Week 3 task 4).

-    PLAN.md §6, Week 3 task 4. Replace this body with a known reference example
-    (disease up/down sets + drug signature -> expected score) once the scorer exists.
+    Query: up = {U1, U2}, down = {D1, D2}. We build drug profiles with a known relationship to
+    the query and assert the sign/ordering the CMap convention requires.
    """
-    from src.scoring import connectivity_score

-    score = connectivity_score(up_genes=[], down_genes=[], drug_signature=None)  # noqa
-    assert score == pytest.approx(0.0, abs=1e-6)
+    UP = ["U1", "U2"]
+    DOWN = ["D1", "D2"]
+
+    @staticmethod
+    def _profile(values: dict[str, float]) -> pd.Series:
+        # 20 filler genes at ~0 so the query genes sit clearly at the extremes.
+        base = {f"N{i}": 0.01 * ((i % 5) - 2) for i in range(20)}
+        base.update(values)
+        return pd.Series(base)
+
+    def test_perfect_reversal_is_strongly_negative(self):
+        from src.scoring import connectivity_score
+        # Drug pushes disease-up genes DOWN (very negative) and disease-down genes UP (very
+        # positive) => reversal => negative connectivity.
+        prof = self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7})
+        assert connectivity_score(self.UP, self.DOWN, prof) < -0.4
+
+    def test_perfect_mimic_is_strongly_positive(self):
+        from src.scoring import connectivity_score
+        prof = self._profile({"U1": 8, "U2": 7, "D1": -8, "D2": -7})
+        assert connectivity_score(self.UP, self.DOWN, prof) > 0.4
+
+    def test_reversal_beats_mimic_and_null(self):
+        from src.scoring import connectivity_score
+        rev = connectivity_score(self.UP, self.DOWN, self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7}))
+        mimic = connectivity_score(self.UP, self.DOWN, self._profile({"U1": 8, "U2": 7, "D1": -8, "D2": -7}))
+        null = connectivity_score(self.UP, self.DOWN, self._profile({"U1": 0.2, "U2": -0.1, "D1": 0.1, "D2": -0.2}))
+        assert rev < null < mimic
+        assert abs(null) < abs(rev)
+
+    def test_same_sign_enrichment_returns_zero(self):
+        from src.scoring import connectivity_score
+        # Both up- and down-sets at the top => same-sign ES => ambiguous => 0 (WTCS rule).
+        prof = self._profile({"U1": 8, "U2": 7, "D1": 6, "D2": 5})
+        assert connectivity_score(self.UP, self.DOWN, prof) == 0.0
+
+    def test_genes_absent_from_profile_are_ignored(self):
+        from src.scoring import connectivity_score
+        prof = self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7})
+        # Adding a query gene not in the profile must not change the score.
+        s1 = connectivity_score(self.UP, self.DOWN, prof)
+        s2 = connectivity_score(self.UP + ["NOT_IN_PROFILE"], self.DOWN, prof)
+        assert s1 == pytest.approx(s2)
+
+
+class TestMechanisticPrior:
+    def test_counts_distinct_sickle_pathways(self):
+        from src.scoring import mechanistic_prior
+        # ribonucleotide reductase (hydroxyurea) -> hbf_epigenetic category.
+        assert mechanistic_prior(["Ribonucleoside-diphosphate reductase RR1"]) == 1.0
+        # DNMT (epigenetic) + hemoglobin -> two categories.
+        assert mechanistic_prior(["DNA (cytosine-5)-methyltransferase 1", "Hemoglobin subunit beta"]) == 2.0
+        assert mechanistic_prior([]) == 0.0
+        assert mechanistic_prior(["Some unrelated kinase"]) == 0.0
+
+
+class TestTauCalibration:
+    """tau should reward a SPECIFIC reverser and give a near-zero score to a noise drug."""
+
+    @staticmethod
+    def _matrix() -> pd.DataFrame:
+        genes = [f"U{i}" for i in range(5)] + [f"D{i}" for i in range(5)] + [f"G{i}" for i in range(40)]
+        rng_vals = {g: 0.01 * ((hash(g) % 7) - 3) for g in genes}  # tiny deterministic noise
+        # specific reverser: query-up genes at the bottom, query-down at the top, rest ~0
+        specific = dict(rng_vals)
+        for i in range(5):
+            specific[f"U{i}"] = -8 - i
+            specific[f"D{i}"] = 8 + i
+        noise = dict(rng_vals)
+        return pd.DataFrame([specific, noise], index=["specific", "noise"])[genes]
+
+    def test_specific_reverser_has_strongly_negative_tau(self):
+        from src.scoring import tau_calibrate
+        up = [f"U{i}" for i in range(5)]
+        down = [f"D{i}" for i in range(5)]
+        out = tau_calibrate(up, down, self._matrix(), n_null=300, seed=0)
+        # Ranked by spec_z (continuous); the specific reverser is the most negative.
+        assert out.loc["specific", "spec_z"] < -2
+        assert out.loc["specific", "spec_z"] < out.loc["noise", "spec_z"]
+        assert out.loc["specific", "tau"] < -50  # tau also flags it (may saturate near -100)
+        assert out.loc["specific", "rank"] == 1
+
+
+def test_rank_drugs_orders_by_reversal():
+    from src.scoring import rank_drugs
+    genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]
+    base = {g: 0.0 for g in genes}
+    reverser = {**base, "U1": -8, "U2": -7, "D1": 8, "D2": 7}
+    mimic = {**base, "U1": 8, "U2": 7, "D1": -8, "D2": -7}
+    matrix = pd.DataFrame([reverser, mimic], index=["reverser", "mimic"])
+    ranked = rank_drugs(["U1", "U2"], ["D1", "D2"], matrix)
+    assert ranked.loc["reverser", "rank"] == 1
+    assert ranked.loc["reverser", "connectivity_score"] < ranked.loc["mimic", "connectivity_score"]
Author	SHA1	Message	Date
Junior B.	817bcda7dc	Structure-binding track: scaffold + ligand-retrieval baseline Start the structure-based binding branch (PLAN §12), baseline-first. - src/binding.py: validated RDKit ligand retrieval (morgan_fp, tanimoto, retrieve_nearest = the §12.9 engine) + dock() stub documenting the blocked ARM-Mac toolchain - scripts/binding_ligand_baseline.py: 300 drugs vs known binders - docs/structure_binding_notes.md: status, toolchain blocker, next steps - pyproject: [structure] extra (rdkit); data/raw/structures/ for PDBs Step-0 finding: retrieval engine VALIDATED on in-set classes (decitabine->azacitidine 0.62; vorinostat->scriptaid/belinostat) but the distinctive binders voxelotor/mitapivat have no analog in our 300-drug set (Tanimoto ~0.2). Needs (a) bigger library, (b) real docking (§12.3), which is blocked on the ARM-Mac docking toolchain (§12.6 pitfall 4). Structures 5E83 (Hb+voxelotor) and 8XFD (PKR+mitapivat) fetched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:53:27 +02:00
Junior B.	6c2c71d73d	PLAN §12.9: leave door open for generative-guided retrieval Reframe de novo generation into the repurposing frame per the founder's idea: use a pocket-conditioned generative model (TargetDiff/DiffSBDD/ Pocket2Mol) to propose an idealised binder as a SEARCH BEACON, then retrieve the nearest EXISTING drugs by chemical similarity (Tanimoto/ embedding) as repurposing candidates — the generated molecule is never synthesised. Caveats kept honest: generated molecules used only as beacons (often synthetically invalid); similarity != activity, so retrieved neighbours still must be docked + pass the binding recovery test; open question whether it beats brute-force docking the existing library. Explore only after the §12.3-12.4 docking baseline is validated. §12.7 exclusion reworded to point here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:43:25 +02:00
Junior B.	7449dbeefb	Scope Phase 2 structure-based binding track into PLAN (§12) Add a scoped (not committed) follow-on track pivoting modality from expression-connectivity to structure-based drug-target binding, motivated by the empirical finding that the expression modality is signal-dead for this task (relational-only supervised AUC = 0.49, chance). §12 covers: the evidence for the pivot, a sickle-specific druggable target shortlist with known-binder positive controls (Hb/voxelotor, PKR/mitapivat, DNMT1/decitabine, LSD1, HDAC, EHMT2, PDE9), method (classical docking baseline -> AF3-class co-folding: Boltz-2/Chai-1/DiffDock), a pre-registered binding recovery test, integration with the expression layer as the real prize, honest pitfalls (binding != efficacy, BCL11A untractable, GPU breaks the all-local assumption), and open decisions before committing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:40:18 +02:00
Junior B.	649f617019	Phase D: supervised cross-disease (0.925 AUC degree-bias mirage) Train GradientBoosting on 300 drugs x 839 GEO disease signatures with Repurposing-Hub indications as labels (432 positives), disease-grouped CV. Finding: 0.925 CV AUC looks like a win but is a MIRAGE. Feature importances are all drug-level (drug_std 0.33, drug_mean 0.30, broadness 0.17); drug-disease connectivity importance = 0.01. The model learned a drug-POPULARITY prior, not disease-specific matching. On held-out sickle it ranks hydroxyurea 231/300 (worse than baseline) and tops out with promiscuous drugs (dexamethasone, methotrexate). Classic degree-bias trap. Connectivity also has ~chance AUC (0.51) for predicting approved indications. Both obvious approaches now fail instructively: unsupervised = specificity ceiling; naive supervised = degree bias. Real progress needs degree- debiased training + much larger clean labels (a research effort). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:31:32 +02:00
Junior B.	0ce688449d	Phase A: reference-library tau (negative result on specificity) Calibrate sickle connectivity against a real disease-signature reference population (Enrichr Disease_Signatures_from_GEO, 141 diseases) instead of random gene-set nulls — proper CMap tau. Finding: hydroxyurea still recovers (rank 25, top 8%), but negative- control specificity is UNCHANGED (2/5; norethindrone + ciprofloxacin still top). The reference-calibrated ranking is nearly identical to the random-null ranking. Third independent fix (after gene-space expansion and composition adjustment) that recovers hydroxyurea but does NOT fix specificity. Conclusion: unsupervised connectivity has a specificity ceiling — it cannot distinguish therapeutic reversal from coincidental transcriptional anti-correlation. Breaking it needs external signal (supervised labels or mechanistic filtering), not more calibration. Disease-signature library cached at data/raw/disease_sigs/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:19:26 +02:00
Junior B.	b62048614d	Experiment: composition-adjusted signature (negative result) Test whether fixing the cell-composition confound rescues the recovery test. GSE35007 measured WBC/RBC/MCV per sample, so adjust DE directly for them (per-gene OLS: expression ~ disease + WBC + RBC + MCV + age + sex) — a measured-covariate deconvolution, compared vs unadjusted. Finding (negative, informative): adjustment HURT hydroxyurea (rank 20 -> 35, fails top-10%) — it was recovering partly via composition genes — and did NOT fix negative-control specificity (still 2/5). Only gain: top hits become more mechanistically coherent (resveratrol, aspirin enter). Conclusion: cleaning the disease signature does not rescue connectivity; the binding constraint is matching-side (needs a reference-signature library for proper specificity calibration), which is a multi-disease investment. v1.1 signature NOT replaced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:05:08 +02:00
Junior B.	3417f85eb1	v1.1: full gene space + specificity z-score; hydroxyurea recovers Post-hoc improvement after the pre-registered v1 recovery test failed. Two changes, diagnosing v1's failure: - score on the full 12,328-gene LINCS space (week2_lincs_extract.py), lifting signature overlap from 12% to 85% (brings erythroid markers in) - src/scoring.py: KS connectivity + per-drug specificity z-score (spec_z = SDs below a 1,000 random-query null). Primary ranking is now spec_z. (Textbook tau saturated at +/-100 for a coherent query — documented; needs a reference-signature library, a v2 item.) - week3_scoring.py: spec_z primary + WTCS reference + prior-blended - tests: tau/spec_z calibration test; 19 passing - scripts/exp_genespace.py: the BING vs all-12,328 comparison Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%), confirming the v1 failure was the landmark bottleneck not the algorithm. Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite), and negative controls (norethindrone, ciprofloxacin) rank top-3 — connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory, not a confirmatory test; reported as such in recovery_test_report.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 22:57:30 +02:00
Junior B.	72f1a49de6	Week 4: recovery test (FAIL, reported honestly) + 2-page report Run the formal recovery test against the pre-registered criteria and write the deliverable report (PLAN §6 Week 4): - week4_recovery_test.py: evaluate hydroxyurea/L-glutamine + 5 pre-specified negative controls vs the committed criteria - recovery_test_report.md: methodology, FAIL result with diagnosis, top-10, lisinopril as the non-obvious candidate, limitations, v2 - known_limitations.md: L-glutamine coverage resolved, 12%-overlap driver, recovery outcome table Outcome: FAIL on all 3 criteria (hydroxyurea top 13%, L-glutamine WTCS=0, 1/5 negative controls bottom-half). Root cause is signature/ assay data limitations (lost erythroid+HbF axis, 12% landmark overlap), not the matching algorithm — reported straight per the project ethos. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 22:38:56 +02:00
Junior B.	fd4591949c	Week 3: CMap connectivity scoring engine + ranked candidates Implement the matching engine (PLAN §6 Week 3): - src/scoring.py: weighted-KS/GSEA enrichment, weighted connectivity score (WTCS, Lamb 2006 / Subramanian 2017), signed NCS normalization, rank_drugs, and a sickle-pathway mechanistic prior - tests/test_scoring.py: real reference tests for the scorer (perfect reversal<null<mimic, same-sign->0, absent-gene invariance) + prior - week3_scoring.py: score 300 drugs -> ranked_candidates_v1.csv with a raw ranking and a secondary mechanistic-prior-weighted ranking Preliminary (formal recovery test is Week 4): hydroxyurea raw rank 40/300 (top 13%, just misses pre-registered top-10%), blended rank 7; L-glutamine WTCS=0 (ambiguous). Notably anti-inflammatory SCD drugs cluster in the raw top tier — the engine reverses the inflammation axis, not the erythroid axis, traceable to the 12% landmark-overlap caveat. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 22:34:56 +02:00
Junior B.	47b0094079	Week 2: 300-drug profiles with LINCS signatures + ChEMBL Build the drug profile dataset (PLAN §6 Week 2): - week2_curate_drugset.py: 300-drug set (2 ground-truth + 32 related- mechanism + 26 negative-control + 240 random), restricted to LINCS-scorable compounds, seed=42 - week2_chembl.py: InChIKey->ChEMBL match (145/300), MoA + targets - week2_lincs_extract.py: cmapPy-slice both Level-5 GCTX phases to 978 landmark genes, mean-aggregate per drug to one consensus signature - week2_assemble.py: join into drug_profiles_v1.parquet, Tier B (LINCS single-source), scored flag per PLAN §6 Week 3 task 2 - docs/data_sources.md: drug set composition + LINCS/ChEMBL provenance Results (all gitignored data): 300/300 drugs scored, both ground-truth drugs present (hydroxyurea Phase II = CHEMBL467, L-glutamine Phase I). Key caveat recorded: only 56/477 (12%) of the disease signature genes are LINCS landmarks, so Week-3 scoring uses a 30-up/26-down query. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 22:25:00 +02:00
Junior B.	c7b6649d31	Week 1: Tier-A sickle cell signature via 2-study concordance Implement and run the Week 1 disease-signature pipeline: - src/disease.py: Welch t-test + BH DE (microarray), probe->symbol collapse, cross-study concordance filter, 2-study provenance schema - scripts/week1_explore.py: download GSE35007 + GSE16728, DE + concordance - scripts/week1_finalize.py: mygene ID mapping + persist signature - tests/test_disease.py: synthetic-data tests for DE/collapse/concordance - docs/data_sources.md: chosen datasets, group defs, reproduction steps Result: sickle_cell_signature_v1.json (gitignored), Tier A, 250 up / 227 down genes from 671 concordant (GSE35007 Illumina whole blood SS/AA + GSE16728 Affymetrix whole blood patient/control). Documented caveats: missing HbF axis (globin depletion) and reticulocyte composition confound. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 20:43:54 +02:00