From 3417f85eb1bf26795377fa5de5f802d0b61c55fb Mon Sep 17 00:00:00 2001 From: "Junior B." Date: Tue, 23 Jun 2026 22:57:30 +0200 Subject: [PATCH] v1.1: full gene space + specificity z-score; hydroxyurea recovers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Post-hoc improvement after the pre-registered v1 recovery test failed. Two changes, diagnosing v1's failure: - score on the full 12,328-gene LINCS space (week2_lincs_extract.py), lifting signature overlap from 12% to 85% (brings erythroid markers in) - src/scoring.py: KS connectivity + per-drug specificity z-score (spec_z = SDs below a 1,000 random-query null). Primary ranking is now spec_z. (Textbook tau saturated at +/-100 for a coherent query — documented; needs a reference-signature library, a v2 item.) - week3_scoring.py: spec_z primary + WTCS reference + prior-blended - tests: tau/spec_z calibration test; 19 passing - scripts/exp_genespace.py: the BING vs all-12,328 comparison Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%), confirming the v1 failure was the landmark bottleneck not the algorithm. Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite), and negative controls (norethindrone, ciprofloxacin) rank top-3 — connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory, not a confirmatory test; reported as such in recovery_test_report.md. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/data_sources.md | 7 +- docs/known_limitations.md | 32 +++--- docs/recovery_test_report.md | 174 ++++++++++++++++----------------- scripts/exp_genespace.py | 102 +++++++++++++++++++ scripts/week2_lincs_extract.py | 12 ++- scripts/week3_scoring.py | 74 +++++++------- scripts/week4_recovery_test.py | 12 +-- src/scoring.py | 88 ++++++++++++++++- tests/test_scoring.py | 27 +++++ 9 files changed, 378 insertions(+), 150 deletions(-) create mode 100644 scripts/exp_genespace.py diff --git a/docs/data_sources.md b/docs/data_sources.md index b6079eb..cfbf896 100644 --- a/docs/data_sources.md +++ b/docs/data_sources.md @@ -61,9 +61,10 @@ Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then 38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action. - **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's not-single-source rule). -- **Signature↔landmark overlap:** only 56/477 (12%) of the disease signature genes are LINCS - landmarks, so connectivity scoring (Week 3) uses a 30-up/26-down query. The erythroid hallmark - genes (CA1, AHSP, SLC4A1, HBG) are NOT landmarks. This is a key limitation for the recovery test. +- **Gene space (v1.1):** scoring uses the full **12,328-gene** LINCS space, not just the 978 + landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger + space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from + LINCS entirely and remain unscoreable. - Reproduce: `week2_curate_drugset.py` → `week2_chembl.py` → download Level-5 GCTX → `week2_lincs_extract.py` → `week2_assemble.py`. diff --git a/docs/known_limitations.md b/docs/known_limitations.md index 0338912..21c6857 100644 --- a/docs/known_limitations.md +++ b/docs/known_limitations.md @@ -34,20 +34,28 @@ Source: PLAN.md §9. 7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses to test, not discoveries. Use careful language in any write-up. -8. **Only 12% of the signature is LINCS-scorable (56/477 genes).** The 978 landmark genes (from - cancer cell lines) miss the erythroid hallmark genes (CA1, AHSP, SLC4A1, HBG). Connectivity - scoring runs on a thin inflammation/metabolic slice — the single biggest driver of the - recovery-test failure. v2 fix: signature prediction or a mechanism graph to score the other 88%. +8. **Gene-space bottleneck (v1 → fixed in v1.1).** v1 scored on only the 978 landmark genes (12% + signature overlap) — the main driver of the v1 failure. v1.1 uses the full 12,328-gene space + (85% overlap) and recovers hydroxyurea. HBG1/HBG2 remain absent from LINCS entirely. -## Recovery test outcome (Week 4) +9. **No reference-signature library for tau.** Textbook CMap tau saturated at ±100 (a coherent + query always out-connects random gene sets). v1.1 substitutes a per-drug specificity z-score. + Proper tau needs a library of real reference signatures — a v2 / curated-data item. -The MVP **failed** all three pre-registered criteria on the primary raw ranking (hydroxyurea -rank 40/top 13%; L-glutamine rank 100/WTCS=0; 1/5 negative controls in bottom half). The failure -is fully attributable to signature/assay data limitations above, not the matching algorithm. See +10. **Negative-control criterion may be invalid for connectivity scoring.** Unrelated drugs + (norethindrone, ciprofloxacin) rank as top specific reversers — connectivity measures + expression reversal, not therapeutic relatedness. + +## Recovery test outcome + +Pre-registered test (**v1, confirmatory**): **FAILED** all three criteria (hydroxyurea rank +40/top 13%; L-glutamine rank 100; 1/5 negative controls bottom-half). Post-hoc (**v1.1, +exploratory**): hydroxyurea recovers to rank 18 (top 6%, passes), but L-glutamine (rank 213, does +not reverse) and negative controls (2/5) still fail → overall still FAIL. See `recovery_test_report.md`. -| Drug | Issue | Handling | +| Drug | Issue | v1.1 status | |---|---|---| -| hydroxyurea | HbF mechanism not in scorable gene space | scored (rank 40); recovered only by prior-weighted ranking | -| L-glutamine | signature present but WTCS ambiguous (=0) | scored (rank 100); no reversal signal | -| all 300 | had LINCS signatures | 0 marked "not scored" — coverage was not the issue; specificity was | +| hydroxyurea | needed the full gene space | rank 18 (top 6%) — recovered post-hoc | +| L-glutamine | metabolite, no reversal signal (positive connectivity) | rank 213 — genuine negative | +| neg controls | reverse the generic inflammation signature | 2/5 bottom-half — criterion questionable | diff --git a/docs/recovery_test_report.md b/docs/recovery_test_report.md index db3276c..fc506dc 100644 --- a/docs/recovery_test_report.md +++ b/docs/recovery_test_report.md @@ -1,7 +1,7 @@ # Sickle Cell Repurposing — Recovery Test Report -> **Status: COMPLETE.** Reproduce with `scripts/week1_*` → `week2_*` → `week3_scoring.py` → -> `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist. +> **Status: COMPLETE (v1 confirmatory + v1.1 exploratory).** Reproduce with `scripts/week1_*` → +> `week2_*` → `week3_scoring.py` → `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist. ## Pre-registered success criteria @@ -12,118 +12,116 @@ The MVP passes if: missing LINCS signature, **AND** - At least **4 of 5** negative-control drugs rank in the **bottom half**. -_Pre-registered in the scaffold commit (`b731478`) before any scoring was run. Primary ranking -= raw connectivity. The 5 negative controls were pre-specified by category rule (one per -category, alphabetically first available) without inspecting ranks._ +_Pre-registered in the scaffold commit (`b731478`) before any scoring. **Primary (confirmatory) +analysis = v1**: 978 landmark genes, weighted connectivity score (WTCS). The 5 negative controls +were pre-specified by category rule without inspecting ranks._ --- ## Section 1 — Methodology -We built a sickle cell disease signature from **two independent whole-blood microarray studies** -(GSE35007, Illumina, SS vs AA; GSE16728, Affymetrix, patient vs control), keeping the **671 -genes concordant** (q<0.05, same direction) across both — a cross-platform, cross-population -Tier-A signature (250 up / 227 down). We built profiles for **300 small molecules** (2 -ground-truth: hydroxyurea, L-glutamine; 32 related-mechanism; 26 negative controls; 240 random), -each with a consensus **LINCS L1000** signature (mean of Level-5 MODZ z-scores across cell -lines, 978 landmark genes, both CMap phases). We ranked drugs by **CMap connectivity scoring** -(weighted-KS, Lamb 2006 / Subramanian 2017): strongly negative = strong reversal of the disease -signature = candidate. A secondary ranking blends connectivity with a mechanistic prior over -sickle-relevant target pathways. +A sickle cell disease signature was built from **two whole-blood microarray studies** (GSE35007 +Illumina SS-vs-AA; GSE16728 Affymetrix patient-vs-control), keeping the **671 genes concordant** +across both (q<0.05, same direction) → a cross-platform Tier-A signature (250 up / 227 down). +Profiles were built for **300 small molecules** (2 ground-truth; 32 related-mechanism; 26 +negative controls; 240 random), each with a **LINCS L1000** consensus signature (mean Level-5 +MODZ across cell lines, both CMap phases). Drugs were ranked by **CMap connectivity scoring** +(Kolmogorov-Smirnov, Lamb 2006 / Subramanian 2017): negative = reversal = candidate. -## Section 2 — Recovery test result — **FAIL** (primary ranking) +**v1 (pre-registered/confirmatory):** scored on the 978 landmark genes with WTCS. +**v1.1 (post-hoc/exploratory):** after v1 failed, two changes were made to diagnose why — (a) +score on the full **12,328-gene** space (landmark overlap 12% → 85%, bringing the erythroid +markers in); (b) add a **per-drug specificity z-score** (`spec_z`): how many SDs the real +connectivity is below a null of 1,000 random queries of the same size against that drug. Because +these changes followed inspection of the v1 result, **v1.1 is exploratory, not a confirmatory +test of the pre-registered hypothesis.** -| Drug | Rank | Percentile | Pass? | -|---|---|---|---| -| Hydroxyurea | 40 / 300 | top 13.3% | ❌ (needs top 30) | -| L-glutamine | 100 / 300 | top 33.3% | ❌ (WTCS=0, ambiguous; has a signature so not "missing") | +## Section 2 — Recovery test result -Negative controls (pre-specified; expected: bottom half): +| Criterion | v1 (confirmatory) | v1.1 (exploratory) | +|---|---|---| +| Hydroxyurea top-10% (≤30) | rank **40** (13.3%) ❌ | rank **18** (6.0%) ✅ | +| L-glutamine top-25% (≤75) | rank 100, WTCS=0 ❌ | rank 213, spec_z=+0.98 ❌ | +| ≥4/5 neg controls bottom-half | 1/5 ❌ | 2/5 ❌ | +| **Overall** | **FAIL** | **FAIL** (but hydroxyurea recovered) | -| Control | Category | Rank | Bottom half? | -|---|---|---|---| -| clotrimazole | antifungal | 89 | ❌ | -| astemizole | antihistamine | 291 | ✅ | -| azithromycin | antibiotic | 82 | ❌ | -| ethinyl-estradiol | hormone | 98 | ❌ | -| caffeine | misc | 84 | ❌ | +v1.1 negative controls: clotrimazole 258 ✅, astemizole 211 ✅, azithromycin 142 ❌, +ethinyl-estradiol 114 ❌, caffeine 77 ❌. -**Only 1/5 negative controls in the bottom half (need ≥4).** +**Honest reading.** The **pre-registered test FAILED (v1).** The post-hoc v1.1 changes +**recover hydroxyurea** (rank 40 → 18, passing top-10%) — strong evidence that the v1 failure was +driven by the 978-landmark bottleneck, not the algorithm. But two failures survive into v1.1, and +both are now precisely diagnosed: -**Overall: FAIL on all three pre-registered criteria.** This is reported as-is, without -adjustment. For context only (not the pre-registered criterion): the secondary -mechanistic-prior ranking places hydroxyurea at **rank 7 (top 2.3%)** — but that ranking uses -prior knowledge of the drug's target, so it cannot be claimed as a blind recovery. +1. **L-glutamine does not reverse the signature** (positive connectivity, spec_z=+0.98). This is + intrinsic to its LINCS data — a metabolite with no reversal signal — not a coverage gap. More + genes cannot fix it. +2. **The negative-control criterion is arguably invalid for connectivity scoring.** Two + "negative controls" (norethindrone, ciprofloxacin) rank in the top 3 by spec_z. Connectivity + measures *expression reversal*, not *therapeutic relatedness* — an antibiotic or contraceptive + can still down-regulate the inflammation genes that dominate the scorable signature. The test + design conflates the two. -**Why it failed — the honest diagnosis.** The disease signature is dominated by erythroid / -reticulocyte biology (CA1, AHSP, SLC4A1) and the HbF axis that hydroxyurea actually acts on -(HBG1/HBG2) was lost (flat in GSE35007; removed by GSE16728's globin-depleted prep). Worse, -only **56 of 477 signature genes (12%) are LINCS landmark genes** — and none of the erythroid -hallmark genes are. So connectivity scoring ran on a thin, inflammation-heavy 30-up/26-down -query. The engine is effectively scoring reversal of sickle's *inflammation* axis, not its -*erythroid* axis — which is why hydroxyurea (an HbF inducer / antiproliferative) is not -recovered, and why unrelated drugs get spurious mild-reversal scores (poor specificity). +A note on the calibration: textbook CMap **tau** (percentile vs a reference population) was +implemented but **saturated at ±100** here, because a coherent real query always out-connects +random gene sets — proper tau needs a library of *real* reference signatures, which this MVP +lacks. The continuous `spec_z` is the workable substitute. -## Section 3 — Top 10 candidates (raw connectivity) +## Section 3 — Top 10 candidates (v1.1 spec_z) -| Rank | Drug | Score | Known target / mechanism | Plausibility | +| Rank | Drug | spec_z | Inclusion | Read | |---|---|---|---|---| -| 1 | laropiprant | −0.417 | Prostaglandin D2 receptor antagonist | Anti-inflammatory — coherent with inflammation-axis reversal | -| 2 | BRD-K62768824 | −0.396 | (tool compound, no annotation) | Likely broad-effect false positive | -| 3 | BRD-K71353154 | −0.393 | (tool compound) | Likely false positive | -| 4 | lisinopril | −0.358 | ACE inhibitor | **Non-obvious; see §4** | -| 5 | BRD-K53443165 | −0.358 | (tool compound) | Likely false positive | -| 6 | talnetant | −0.347 | Neurokinin-3 (NK3) receptor antagonist | No obvious sickle rationale | -| 7 | BRD-K46936109 | −0.342 | (tool compound) | Likely false positive | -| 8 | lawsone | −0.340 | Naphthoquinone (henna pigment) | No obvious rationale; possible redox effect | -| 9 | BRD-K85763971 | −0.338 | (tool compound) | Likely false positive | -| 10 | BRD-K36516410 | −0.323 | (tool compound) | Likely false positive | +| 1 | reserpic-acid | −3.80 | random | reserpine metabolite; non-obvious | +| 2 | norethindrone | −3.78 | **negative control** | false positive (see §2) | +| 3 | ciprofloxacin | −3.61 | **negative control** | false positive | +| 4 | resveratrol | −3.46 | related-mechanism | antioxidant studied in SCD — coherent | +| 5 | BRD-K57490754 | −3.37 | random | tool compound | +| 6 | anastrozole | −3.27 | random | aromatase inhibitor | +| 7–10 | BRD-* / palmitoylethanolamide | ~−3.1 | random | mostly tool compounds | -As anticipated (PLAN §9.4), the raw top-10 is dominated by unannotated broad-effect tool -compounds — these are **not** credible candidates and are not over-interpreted. +That two negative controls outrank hydroxyurea is the single most informative result here — see §4. -## Section 4 — One non-obvious candidate worth investigating +## Section 4 — One non-obvious result worth investigating -**Lisinopril (ACE inhibitor), rank 4.** This is the most interesting non-obvious hit: ACE -inhibitors are already used clinically in sickle cell disease for **renal protection** -(reducing albuminuria / progression of sickle nephropathy), via mechanisms independent of the -HbF pathway. Surfacing an agent with a genuine, mechanistically distinct sickle-cell rationale — -from an inflammation/vascular-flavoured signature — is a small but real signal that the matching -approach can point at non-obvious biology. **This is a computational hypothesis, not a -discovery**, and the connectivity rationale here (inflammation-axis reversal) is not the same as -lisinopril's known renal mechanism, so the match should be treated as suggestive only. +The most useful finding is **not** a candidate drug but the **negative-control failure**: +unrelated drugs (norethindrone, ciprofloxacin) score as strong specific reversers. This is a +real, generalizable lesson — for a signature whose *scorable* portion is generic +inflammation/metabolic genes, connectivity rewards any broad transcriptional perturbation that +touches those genes. The honest implication: **this signature is not specific enough to +discriminate true repurposing candidates from incidental expression reversers.** Of the +plausibly-real hits, **resveratrol (rank 4)** — an antioxidant with prior sickle cell literature +— is the most defensible, but it is a hypothesis, not a discovery. ## Section 5 — Honest limitations -1. **Cell-composition confound** — the whole-blood signature is dominated by reticulocyte/ - erythroid markers (composition, not pure disease-state regulation). v2 needs deconvolution. -2. **Missing HbF axis** — HBG1/HBG2 absent (globin depletion + flat in GSE35007), so the - signature cannot encode the pathway hydroxyurea acts on. -3. **12% signature↔landmark overlap** — only 56/477 genes are LINCS landmarks; the erythroid - hallmark genes are not scorable. The query collapses to a generic inflammation/metabolic slice. -4. **LINCS cell-line bias** — landmark signatures come from cancer cell lines (PLAN §9.2); poorly - suited to a blood disease. -5. **Poor negative-control specificity** — unrelated drugs received mild reversal scores; the - thin query yields a noisy connectivity distribution. -6. **No mechanistic validation** — these are connectivity hypotheses, not validated predictions. +1. **Pre-registered test failed; the pass is post-hoc.** v1.1's hydroxyurea recovery is + exploratory and must be re-validated on a held-out disease before any claim is made. +2. **Missing HbF axis** — HBG1/HBG2 are absent from LINCS entirely (not just landmarks), so the + pathway hydroxyurea acts on can never be scored by this method. +3. **Signature specificity** — scorable genes are inflammation/metabolic; negative controls + reverse them too. Connectivity ≠ therapeutic relatedness. +4. **Cell-composition confound** — the whole-blood signature is reticulocyte-dominated. +5. **LINCS cancer-cell-line bias**, and **no reference-signature library** for proper tau. +6. **No mechanistic validation** — all hits are computational hypotheses. ## Section 6 — What v2 would fix -- **Cell-type deconvolution** of the disease signature to separate disease-state regulation from - composition, recovering specificity. -- **A non-globin-depleted, RNA-seq whole-blood study** to retain the HbF axis. -- **Signature prediction** (DeepCE-style) or a mechanism/knowledge graph to score the ~88% of - the signature that has no LINCS landmark — the single biggest lever on this result. -- **A second disease** to test generalization (sickle results alone do not prove the platform — - PLAN §9.5). +- **A reference-signature library** to make tau (proper specificity calibration) work — the + single biggest fix to the negative-control problem, and a direct use of the curated-data moat. +- **Cell-type deconvolution** + a non-globin-depleted RNA-seq study to recover a more specific, + HbF-containing signature. +- **Signature prediction / mechanism graph** to score genes with no LINCS measurement. +- **A second disease** to test generalization and to honestly re-validate the v1.1 method + (PLAN §9.5). --- ### Bottom line -The pipeline is reproducible end-to-end and the method is sound, but on this signature it **does -not recover the known sickle cell drugs**. The failure is fully explained by signature/assay -data limitations (erythroid biology lost; 12% landmark overlap), not by a flaw in the matching -algorithm. The most valuable output of this MVP is therefore a precise, honest map of *what data -quality the method needs to work* — which is exactly the de-risking the proof-of-concept was -meant to deliver. +The pre-registered recovery test **failed**. Post-hoc diagnosis shows the dominant cause was a +fixable gene-space bottleneck — correcting it **recovers hydroxyurea** — but also surfaces a +deeper, genuine limitation: this whole-blood signature is **not specific enough** for +connectivity scoring to separate real candidates from incidental reversers (negative controls +rank at the top). The MVP's real deliverable is a precise, honest map of *what it takes to make +this method work*: a more specific (deconvolved, HbF-containing) signature and a reference library +for calibration — exactly the curated-data investments the platform thesis is built on. diff --git a/scripts/exp_genespace.py b/scripts/exp_genespace.py new file mode 100644 index 0000000..1f3e1ac --- /dev/null +++ b/scripts/exp_genespace.py @@ -0,0 +1,102 @@ +"""Experiment (v1.1): re-score on a larger LINCS gene space and re-run the recovery test. + +v1 used only the 978 landmark genes (12% signature overlap). This re-slices the SAME GCTX files +to the BING space (~10,174) and the full 12,328-gene space, re-aggregates per-drug consensus +signatures, re-scores connectivity, and evaluates the pre-registered recovery criteria — so we +can see whether hydroxyurea recovers. Writes nothing to the committed v1 artifacts. +""" + +from __future__ import annotations + +import gzip +import io +import json +from pathlib import Path + +import pandas as pd + +import sys +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) +from src.scoring import rank_drugs # noqa: E402 + +LINCS = Path("data/raw/lincs") +PROCESSED = Path("data/processed") +GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"} +SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"} +NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"] + + +def read_gz(name): + return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False) + + +def gene_ids_for_space(space: str): + g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t") + if space == "bing": + g = g[g.pr_is_bing == 1] + # 'all' -> keep everything + ids = [str(x) for x in g.pr_gene_id] + id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()} + return ids, id_to_symbol + + +def extract(space, drug_names, gene_ids, id_to_symbol): + from cmapPy.pandasGEXpress.parse import parse + frames = [] + for ph in (1, 2): + sig = read_gz(SIG_INFO[ph]) + sig = sig[(sig.pert_type == "trt_cp") & (sig.pert_iname.isin(drug_names))] + if sig.empty: + continue + gct = parse(str(GCTX[ph]), rid=gene_ids, cid=sig.sig_id.tolist()) + data = gct.data_df + s2d = dict(zip(sig.sig_id, sig.pert_iname)) + frames.append(data.T.groupby(data.columns.map(s2d)).mean()) + print(f" [{space}] phase {ph}: {sig.pert_iname.nunique()} drugs sliced", flush=True) + combined = pd.concat(frames).groupby(level=0).mean() + combined.columns = [id_to_symbol.get(c, c) for c in combined.columns] + combined = combined.loc[:, ~combined.columns.duplicated()] # drop dup symbols + return combined + + +def evaluate(space, sig_matrix, up, down): + landmark_overlap = None + ranked = rank_drugs(up, down, sig_matrix) + n = len(ranked) + top10, top25, half = int(n * 0.10), int(n * 0.25), n // 2 + profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name") + ranked = ranked.join(profiles[["inclusion_reason"]]) + + hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"]) + glut_s = ranked.loc["glutamine", "connectivity_score"] + n_overlap = len((set(up) | set(down)) & set(sig_matrix.columns)) + negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index} + n_bottom = sum(r > half for r in negs.values()) + + print(f"\n=== gene space: {space.upper()} ({sig_matrix.shape[1]} genes; query overlap {n_overlap}) ===") + print(f" hydroxyurea: rank {hu}/{n} (top {100*hu/n:.1f}%) top-10%? {hu <= top10}") + print(f" L-glutamine: rank {glut}/{n} (top {100*glut/n:.1f}%), WTCS={glut_s:.3f} top-25%? {glut <= top25}") + print(f" neg controls in bottom half: {n_bottom}/5 {negs}") + crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4) + print(f" OVERALL: {'PASS' if crit else 'FAIL'}") + print(" top 8:") + for name, r in ranked.nsmallest(8, "connectivity_score").iterrows(): + print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} [{r['inclusion_reason']}]") + return ranked + + +def main(): + sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text()) + up = [g["gene"] for g in sig["up_regulated"]] + down = [g["gene"] for g in sig["down_regulated"]] + drug_names = set(pd.read_csv(PROCESSED / "drug_set_v1.csv").pert_iname) + + for space in ("bing", "all"): + print(f"\n>>> extracting {space} ...", flush=True) + ids, id2sym = gene_ids_for_space(space) + mat = extract(space, drug_names, ids, id2sym) + evaluate(space, mat, up, down) + + +if __name__ == "__main__": + main() diff --git a/scripts/week2_lincs_extract.py b/scripts/week2_lincs_extract.py index b84b4db..fb76baf 100644 --- a/scripts/week2_lincs_extract.py +++ b/scripts/week2_lincs_extract.py @@ -36,9 +36,15 @@ def read_gz_tsv(name: str) -> pd.DataFrame: def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]: - lm = pd.read_csv(LINCS / "landmark_genes.csv") - ids = [str(x) for x in lm["pr_gene_id"]] - id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in lm.itertuples()} + """Gene row-ids + id->symbol map for the scored gene space. + + v1.1: use the FULL 12,328-gene space (landmark + inferred), not just the 978 landmarks. + This lifts disease-signature overlap from 12% to ~85% and brings the erythroid markers into + scoring (see docs/recovery_test_report.md). Inferred genes are model-predicted (noisier). + """ + g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t") + ids = [str(x) for x in g["pr_gene_id"]] + id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()} return ids, id_to_symbol diff --git a/scripts/week3_scoring.py b/scripts/week3_scoring.py index 91c2735..5e34540 100644 --- a/scripts/week3_scoring.py +++ b/scripts/week3_scoring.py @@ -1,13 +1,11 @@ -"""Week 3: run connectivity scoring over all drugs -> ranked_candidates_v1.csv (PLAN §6). +"""Week 3 (v1.1): connectivity scoring over the full gene space with tau calibration. -Loads the disease signature + the 300 drug LINCS signatures, computes the weighted-KS -connectivity score per drug, and produces two rankings: - 1. raw connectivity (most negative = strongest reversal = rank 1) - 2. a secondary ranking blending connectivity with a mechanistic prior (sickle-relevant - target pathways), to temper broad-effect drugs (HDAC/kinase) that dominate raw rankings. +Primary ranking is now **tau** (KS connectivity expressed as a signed percentile vs a null of +random queries) — this calibrates out broad-effect drugs that connect to random signatures too, +the specificity fix. The weighted connectivity score (WTCS) is retained as a reference column, +and a secondary ranking blends tau with the sickle-pathway mechanistic prior. -The formal recovery test (ground-truth + negative-control evaluation against the pre-registered -criteria) is Week 4; this script only prints a sanity peek. +Output: data/results/ranked_candidates_v1.csv. """ from __future__ import annotations @@ -19,10 +17,13 @@ import pandas as pd import sys sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) -from src.scoring import mechanistic_prior, persist_ranking, rank_drugs # noqa: E402 +from src.scoring import ( # noqa: E402 + connectivity_score, mechanistic_prior, normalize_scores, persist_ranking, tau_calibrate, +) PROCESSED = Path("data/processed") -PRIOR_LAMBDA = 0.5 # weight of the mechanistic prior in the secondary ranking +N_NULL = 1000 +PRIOR_LAMBDA = 0.5 # spec_z credit per matched sickle pathway, for the blended ranking def main() -> None: @@ -30,52 +31,51 @@ def main() -> None: up = [g["gene"] for g in sig["up_regulated"]] down = [g["gene"] for g in sig["down_regulated"]] - sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 978 symbols + sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 12,328 genes profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name") - landmark = set(sig_matrix.columns) - n_up_ov = len(set(up) & landmark) - n_down_ov = len(set(down) & landmark) - print(f"query overlap with 978 landmarks: {n_up_ov} up + {n_down_ov} down = {n_up_ov + n_down_ov}") - print(f"scoring {len(sig_matrix)} drugs (all scored; 0 without signature)") + n_up = len(set(up) & set(sig_matrix.columns)) + n_down = len(set(down) & set(sig_matrix.columns)) + print(f"gene space: {sig_matrix.shape[1]} genes; query overlap {n_up} up + {n_down} down = {n_up + n_down}") - ranked = rank_drugs(up, down, sig_matrix) + # primary: tau calibration + print(f"computing tau over {N_NULL} random-query permutations ...", flush=True) + ranked = tau_calibrate(up, down, sig_matrix, n_null=N_NULL) - # attach metadata + mechanistic prior + # reference: weighted connectivity score (WTCS) + NCS + wtcs = pd.Series({d: connectivity_score(up, down, sig_matrix.loc[d]) for d in sig_matrix.index}, + name="connectivity_score") + ranked["connectivity_score"] = wtcs + ranked["normalized_score"] = normalize_scores(wtcs) + + # metadata + mechanistic prior ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]]) ranked["mechanistic_prior"] = ranked["targets"].apply( - lambda t: mechanistic_prior(list(t) if t is not None else []) - ) + lambda t: mechanistic_prior(list(t) if t is not None else [])) ranked["known_targets"] = ranked["targets"].apply( - lambda t: "; ".join(t) if t is not None and len(t) else "" - ) + lambda t: "; ".join(t) if t is not None and len(t) else "") ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"}) - # secondary, prior-weighted ranking: relevant drugs pushed toward better (more negative) - ranked["blended_score"] = ranked["normalized_score"] - PRIOR_LAMBDA * ranked["mechanistic_prior"] + # secondary, prior-weighted ranking (relevant drugs pushed toward more-negative spec_z) + ranked["blended_score"] = ranked["spec_z"] - PRIOR_LAMBDA * ranked["mechanistic_prior"] ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int) out = ranked.rename_axis("drug_name").reset_index()[[ - "rank", "drug_name", "chembl_id", "connectivity_score", "normalized_score", + "rank", "drug_name", "chembl_id", "spec_z", "tau", "connectivity_ks", "connectivity_score", "inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary", ]] path = persist_ranking(out) print(f"wrote {path} ({len(out)} drugs)") - # --- sanity peek (formal recovery test is Week 4) --- - print("\n--- sanity peek (raw connectivity rank) ---") + print("\n--- sanity peek (spec_z ranking) ---") for gt in ["hydroxyurea", "glutamine"]: r = ranked.loc[gt] - pct = 100 * r["rank"] / len(ranked) - print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {pct:.0f}%), " - f"score={r['connectivity_score']:.3f}") - neg = ranked[ranked["inclusion_reason"] == "negative_control"] - print(f" negative controls in bottom half: " - f"{(neg['rank'] > len(ranked) / 2).sum()}/{len(neg)}") - print("\n top 5 raw candidates:") - for name, r in ranked.nsmallest(5, "connectivity_score").iterrows(): - print(f" {int(r['rank']):3d} {name:18s} {r['connectivity_score']:+.3f} " - f"[{r['inclusion_reason']}] {r['known_targets'][:50]}") + print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {100*r['rank']/len(ranked):.0f}%), " + f"spec_z={r['spec_z']:.2f}") + print(" top 10 by spec_z:") + for name, r in ranked.nsmallest(10, "spec_z").iterrows(): + print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:6.2f} [{r['inclusion_reason']:16s}] " + f"{str(r['known_targets'])[:38]}") if __name__ == "__main__": diff --git a/scripts/week4_recovery_test.py b/scripts/week4_recovery_test.py index 43432a8..5f839b4 100644 --- a/scripts/week4_recovery_test.py +++ b/scripts/week4_recovery_test.py @@ -36,6 +36,7 @@ def main() -> None: return int(df.loc[name, "rank"]) if name in df.index else None hu, glut = rk("hydroxyurea"), rk("glutamine") + glut_z = df.loc["glutamine", "spec_z"] # pick negative controls present in the ranking negs = {} @@ -47,9 +48,8 @@ def main() -> None: print("=" * 60) print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}") print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%) -> top-10%? {hu <= top10_cut}") - glut_score = df.loc["glutamine", "connectivity_score"] - print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), WTCS={glut_score:.3f} " - f"-> top-25%? {glut <= top25_cut} (has signature, so NOT 'missing-signature unscorable')") + print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), spec_z={glut_z:+.2f} " + f"-> top-25%? {glut <= top25_cut} (positive z => does not reverse; has a signature)") print("\nnegative controls (pre-specified, 1 per category):") n_bottom = 0 for d, (cat, r) in negs.items(): @@ -70,10 +70,10 @@ def main() -> None: print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} " f"(top {100*hu_b/n:.1f}%)") - print("\n--- TOP 10 (raw connectivity) ---") - top10 = df.nsmallest(10, "connectivity_score") + print("\n--- TOP 10 (primary spec_z ranking) ---") + top10 = df.sort_values("rank").head(10) for name, r in top10.iterrows(): - print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} " + print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:+.2f} " f"[{r['inclusion_reason']}] {str(r['known_targets'])[:45]}") diff --git a/src/scoring.py b/src/scoring.py index 2ae440a..57aadde 100644 --- a/src/scoring.py +++ b/src/scoring.py @@ -16,7 +16,7 @@ from pathlib import Path import numpy as np import pandas as pd -from . import RESULTS_DIR +from . import RANDOM_SEED, RESULTS_DIR # Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3). # Keys are pathway categories; values are substrings matched (case-insensitive) against a @@ -134,6 +134,92 @@ def mechanistic_prior(targets: list[str]) -> float: return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values())) +# --- KS connectivity + tau calibration (v1.1) ----------------------------------------------- +# Unweighted Kolmogorov-Smirnov connectivity (Lamb 2006) is O(k) per query (depends only on the +# ranks of the query genes), which makes a permutation null over many random queries cheap. tau +# expresses each drug's real connectivity as a signed percentile within its own null — so +# broad-effect drugs that connect to *random* signatures too get down-weighted (specificity). + + +def _ks_es(rank_matrix: np.ndarray, query_cols: np.ndarray, n_genes: int) -> np.ndarray: + """Vectorized unweighted KS enrichment score of a query gene set for every drug. + + ``rank_matrix`` is (n_drugs, n_genes) of 1..N rank positions (1 = most up-regulated). + Returns one ES per drug; ES>0 => query enriched among up-regulated genes. + """ + k = len(query_cols) + if k == 0: + return np.zeros(rank_matrix.shape[0]) + p = np.sort(rank_matrix[:, query_cols], axis=1) # (n_drugs, k), positions ascending + j = np.arange(1, k + 1) + a = (j / k - p / n_genes).max(axis=1) + b = (p / n_genes - (j - 1) / k).max(axis=1) + return np.where(a >= b, a, -b) + + +def _ks_connectivity(rank_matrix: np.ndarray, up_cols: np.ndarray, down_cols: np.ndarray, + n_genes: int) -> np.ndarray: + """KS connectivity per drug: (ES_up - ES_down)/2. Negative=reversal. + + Note: unlike WTCS, this does NOT zero same-sign (ambiguous) connections — same-sign ES + partially cancel and land near 0 naturally. Hard-zeroing would collapse the random-query + null to a spike at 0 and make tau saturate, so the continuous form is required for calibration. + """ + es_up = _ks_es(rank_matrix, up_cols, n_genes) + es_down = _ks_es(rank_matrix, down_cols, n_genes) + return (es_up - es_down) / 2.0 + + +def tau_calibrate( + up_genes: list[str], + down_genes: list[str], + signature_matrix: pd.DataFrame, + n_null: int = 1000, + seed: int = RANDOM_SEED, +) -> pd.DataFrame: + """Rank drugs by tau: each drug's KS connectivity as a signed percentile vs a null of + random queries of the same size (PLAN §6; CMap tau, Subramanian 2017). + + tau in [-100, 100]: -100 => reverses our signature more specifically than any random query + (strong, specific candidate); ~0 => connects to our signature no more than to random ones + (broad-effect / non-specific). Ranked by tau ascending (rank 1 = most specific reversal). + """ + genes = list(signature_matrix.columns) + gene_to_col = {g: i for i, g in enumerate(genes)} + n = len(genes) + rank_matrix = signature_matrix.rank(axis=1, ascending=False).to_numpy() + + up_cols = np.array([gene_to_col[g] for g in set(up_genes) if g in gene_to_col], dtype=int) + down_cols = np.array([gene_to_col[g] for g in set(down_genes) if g in gene_to_col], dtype=int) + real = _ks_connectivity(rank_matrix, up_cols, down_cols, n) + + rng = np.random.default_rng(seed) + k_up, k_down = len(up_cols), len(down_cols) + null = np.empty((rank_matrix.shape[0], n_null)) + for m in range(n_null): + samp = rng.choice(n, size=k_up + k_down, replace=False) + null[:, m] = _ks_connectivity(rank_matrix, samp[:k_up], samp[k_up:], n) + + null_mean = null.mean(axis=1) + null_std = null.std(axis=1) + null_std[null_std == 0] = np.nan + # Per-drug standardized connectivity: how many SDs the real reversal is below what random + # queries of the same size produce against THIS drug. Continuous (no percentile floor), so it + # discriminates within the strong-reversal tail. Negative = specific reversal. + spec_z = (real - null_mean) / null_std + + frac = (null <= real[:, None]).mean(axis=1) + tau = 100.0 * (2.0 * frac - 1.0) # retained for reference; saturates at +/-100 in the tail + + df = pd.DataFrame( + {"connectivity_ks": real, "null_mean": null_mean, "spec_z": spec_z, "tau": tau}, + index=signature_matrix.index, + ) + df = df.sort_values("spec_z") # most negative z = most specific reversal + df.insert(0, "rank", range(1, len(df) + 1)) + return df + + def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path: """Write the ranked candidate list to ``data/results/ranked_candidates_v1.csv``.""" out_path = out_path or (RESULTS_DIR / "ranked_candidates_v1.csv") diff --git a/tests/test_scoring.py b/tests/test_scoring.py index 2d32785..31a4591 100644 --- a/tests/test_scoring.py +++ b/tests/test_scoring.py @@ -114,6 +114,33 @@ class TestMechanisticPrior: assert mechanistic_prior(["Some unrelated kinase"]) == 0.0 +class TestTauCalibration: + """tau should reward a SPECIFIC reverser and give a near-zero score to a noise drug.""" + + @staticmethod + def _matrix() -> pd.DataFrame: + genes = [f"U{i}" for i in range(5)] + [f"D{i}" for i in range(5)] + [f"G{i}" for i in range(40)] + rng_vals = {g: 0.01 * ((hash(g) % 7) - 3) for g in genes} # tiny deterministic noise + # specific reverser: query-up genes at the bottom, query-down at the top, rest ~0 + specific = dict(rng_vals) + for i in range(5): + specific[f"U{i}"] = -8 - i + specific[f"D{i}"] = 8 + i + noise = dict(rng_vals) + return pd.DataFrame([specific, noise], index=["specific", "noise"])[genes] + + def test_specific_reverser_has_strongly_negative_tau(self): + from src.scoring import tau_calibrate + up = [f"U{i}" for i in range(5)] + down = [f"D{i}" for i in range(5)] + out = tau_calibrate(up, down, self._matrix(), n_null=300, seed=0) + # Ranked by spec_z (continuous); the specific reverser is the most negative. + assert out.loc["specific", "spec_z"] < -2 + assert out.loc["specific", "spec_z"] < out.loc["noise", "spec_z"] + assert out.loc["specific", "tau"] < -50 # tau also flags it (may saturate near -100) + assert out.loc["specific", "rank"] == 1 + + def test_rank_drugs_orders_by_reversal(): from src.scoring import rank_drugs genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]