v1.1: full gene space + specificity z-score; hydroxyurea recovers

Post-hoc improvement after the pre-registered v1 recovery test failed.
Two changes, diagnosing v1's failure:
- score on the full 12,328-gene LINCS space (week2_lincs_extract.py),
  lifting signature overlap from 12% to 85% (brings erythroid markers in)
- src/scoring.py: KS connectivity + per-drug specificity z-score
  (spec_z = SDs below a 1,000 random-query null). Primary ranking is
  now spec_z. (Textbook tau saturated at +/-100 for a coherent query —
  documented; needs a reference-signature library, a v2 item.)
- week3_scoring.py: spec_z primary + WTCS reference + prior-blended
- tests: tau/spec_z calibration test; 19 passing
- scripts/exp_genespace.py: the BING vs all-12,328 comparison

Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%),
confirming the v1 failure was the landmark bottleneck not the algorithm.
Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite),
and negative controls (norethindrone, ciprofloxacin) rank top-3 —
connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory,
not a confirmatory test; reported as such in recovery_test_report.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-23 22:57:30 +02:00
parent 72f1a49de6
commit 3417f85eb1
9 changed files with 378 additions and 150 deletions

View File

@@ -61,9 +61,10 @@ Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action. 38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
- **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's - **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
not-single-source rule). not-single-source rule).
- **Signature↔landmark overlap:** only 56/477 (12%) of the disease signature genes are LINCS - **Gene space (v1.1):** scoring uses the full **12,328-gene** LINCS space, not just the 978
landmarks, so connectivity scoring (Week 3) uses a 30-up/26-down query. The erythroid hallmark landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger
genes (CA1, AHSP, SLC4A1, HBG) are NOT landmarks. This is a key limitation for the recovery test. space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from
LINCS entirely and remain unscoreable.
- Reproduce: `week2_curate_drugset.py``week2_chembl.py` → download Level-5 GCTX → - Reproduce: `week2_curate_drugset.py``week2_chembl.py` → download Level-5 GCTX →
`week2_lincs_extract.py``week2_assemble.py`. `week2_lincs_extract.py``week2_assemble.py`.

View File

@@ -34,20 +34,28 @@ Source: PLAN.md §9.
7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses 7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
to test, not discoveries. Use careful language in any write-up. to test, not discoveries. Use careful language in any write-up.
8. **Only 12% of the signature is LINCS-scorable (56/477 genes).** The 978 landmark genes (from 8. **Gene-space bottleneck (v1 → fixed in v1.1).** v1 scored on only the 978 landmark genes (12%
cancer cell lines) miss the erythroid hallmark genes (CA1, AHSP, SLC4A1, HBG). Connectivity signature overlap) — the main driver of the v1 failure. v1.1 uses the full 12,328-gene space
scoring runs on a thin inflammation/metabolic slice — the single biggest driver of the (85% overlap) and recovers hydroxyurea. HBG1/HBG2 remain absent from LINCS entirely.
recovery-test failure. v2 fix: signature prediction or a mechanism graph to score the other 88%.
## Recovery test outcome (Week 4) 9. **No reference-signature library for tau.** Textbook CMap tau saturated at ±100 (a coherent
query always out-connects random gene sets). v1.1 substitutes a per-drug specificity z-score.
Proper tau needs a library of real reference signatures — a v2 / curated-data item.
The MVP **failed** all three pre-registered criteria on the primary raw ranking (hydroxyurea 10. **Negative-control criterion may be invalid for connectivity scoring.** Unrelated drugs
rank 40/top 13%; L-glutamine rank 100/WTCS=0; 1/5 negative controls in bottom half). The failure (norethindrone, ciprofloxacin) rank as top specific reversers — connectivity measures
is fully attributable to signature/assay data limitations above, not the matching algorithm. See expression reversal, not therapeutic relatedness.
## Recovery test outcome
Pre-registered test (**v1, confirmatory**): **FAILED** all three criteria (hydroxyurea rank
40/top 13%; L-glutamine rank 100; 1/5 negative controls bottom-half). Post-hoc (**v1.1,
exploratory**): hydroxyurea recovers to rank 18 (top 6%, passes), but L-glutamine (rank 213, does
not reverse) and negative controls (2/5) still fail → overall still FAIL. See
`recovery_test_report.md`. `recovery_test_report.md`.
| Drug | Issue | Handling | | Drug | Issue | v1.1 status |
|---|---|---| |---|---|---|
| hydroxyurea | HbF mechanism not in scorable gene space | scored (rank 40); recovered only by prior-weighted ranking | | hydroxyurea | needed the full gene space | rank 18 (top 6%) — recovered post-hoc |
| L-glutamine | signature present but WTCS ambiguous (=0) | scored (rank 100); no reversal signal | | L-glutamine | metabolite, no reversal signal (positive connectivity) | rank 213 — genuine negative |
| all 300 | had LINCS signatures | 0 marked "not scored" — coverage was not the issue; specificity was | | neg controls | reverse the generic inflammation signature | 2/5 bottom-half — criterion questionable |

View File

@@ -1,7 +1,7 @@
# Sickle Cell Repurposing — Recovery Test Report # Sickle Cell Repurposing — Recovery Test Report
> **Status: COMPLETE.** Reproduce with `scripts/week1_*` → `week2_*` → `week3_scoring.py` → > **Status: COMPLETE (v1 confirmatory + v1.1 exploratory).** Reproduce with `scripts/week1_*` →
> `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist. > `week2_*` → `week3_scoring.py` → `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.
## Pre-registered success criteria ## Pre-registered success criteria
@@ -12,118 +12,116 @@ The MVP passes if:
missing LINCS signature, **AND** missing LINCS signature, **AND**
- At least **4 of 5** negative-control drugs rank in the **bottom half**. - At least **4 of 5** negative-control drugs rank in the **bottom half**.
_Pre-registered in the scaffold commit (`b731478`) before any scoring was run. Primary ranking _Pre-registered in the scaffold commit (`b731478`) before any scoring. **Primary (confirmatory)
= raw connectivity. The 5 negative controls were pre-specified by category rule (one per analysis = v1**: 978 landmark genes, weighted connectivity score (WTCS). The 5 negative controls
category, alphabetically first available) without inspecting ranks._ were pre-specified by category rule without inspecting ranks._
--- ---
## Section 1 — Methodology ## Section 1 — Methodology
We built a sickle cell disease signature from **two independent whole-blood microarray studies** A sickle cell disease signature was built from **two whole-blood microarray studies** (GSE35007
(GSE35007, Illumina, SS vs AA; GSE16728, Affymetrix, patient vs control), keeping the **671 Illumina SS-vs-AA; GSE16728 Affymetrix patient-vs-control), keeping the **671 genes concordant**
genes concordant** (q<0.05, same direction) across both a cross-platform, cross-population across both (q<0.05, same direction) a cross-platform Tier-A signature (250 up / 227 down).
Tier-A signature (250 up / 227 down). We built profiles for **300 small molecules** (2 Profiles were built for **300 small molecules** (2 ground-truth; 32 related-mechanism; 26
ground-truth: hydroxyurea, L-glutamine; 32 related-mechanism; 26 negative controls; 240 random), negative controls; 240 random), each with a **LINCS L1000** consensus signature (mean Level-5
each with a consensus **LINCS L1000** signature (mean of Level-5 MODZ z-scores across cell MODZ across cell lines, both CMap phases). Drugs were ranked by **CMap connectivity scoring**
lines, 978 landmark genes, both CMap phases). We ranked drugs by **CMap connectivity scoring** (Kolmogorov-Smirnov, Lamb 2006 / Subramanian 2017): negative = reversal = candidate.
(weighted-KS, Lamb 2006 / Subramanian 2017): strongly negative = strong reversal of the disease
signature = candidate. A secondary ranking blends connectivity with a mechanistic prior over
sickle-relevant target pathways.
## Section 2 — Recovery test result — **FAIL** (primary ranking) **v1 (pre-registered/confirmatory):** scored on the 978 landmark genes with WTCS.
**v1.1 (post-hoc/exploratory):** after v1 failed, two changes were made to diagnose why (a)
score on the full **12,328-gene** space (landmark overlap 12% 85%, bringing the erythroid
markers in); (b) add a **per-drug specificity z-score** (`spec_z`): how many SDs the real
connectivity is below a null of 1,000 random queries of the same size against that drug. Because
these changes followed inspection of the v1 result, **v1.1 is exploratory, not a confirmatory
test of the pre-registered hypothesis.**
| Drug | Rank | Percentile | Pass? | ## Section 2 — Recovery test result
|---|---|---|---|
| Hydroxyurea | 40 / 300 | top 13.3% | (needs top 30) |
| L-glutamine | 100 / 300 | top 33.3% | (WTCS=0, ambiguous; has a signature so not "missing") |
Negative controls (pre-specified; expected: bottom half): | Criterion | v1 (confirmatory) | v1.1 (exploratory) |
|---|---|---|
| Hydroxyurea top-10% (≤30) | rank **40** (13.3%) | rank **18** (6.0%) |
| L-glutamine top-25% (≤75) | rank 100, WTCS=0 | rank 213, spec_z=+0.98 |
| 4/5 neg controls bottom-half | 1/5 | 2/5 |
| **Overall** | **FAIL** | **FAIL** (but hydroxyurea recovered) |
| Control | Category | Rank | Bottom half? | v1.1 negative controls: clotrimazole 258 ✅, astemizole 211 ✅, azithromycin 142 ❌,
|---|---|---|---| ethinyl-estradiol 114 ❌, caffeine 77 ❌.
| clotrimazole | antifungal | 89 | |
| astemizole | antihistamine | 291 | |
| azithromycin | antibiotic | 82 | |
| ethinyl-estradiol | hormone | 98 | |
| caffeine | misc | 84 | |
**Only 1/5 negative controls in the bottom half (need ≥4).** **Honest reading.** The **pre-registered test FAILED (v1).** The post-hoc v1.1 changes
**recover hydroxyurea** (rank 40 18, passing top-10%) strong evidence that the v1 failure was
driven by the 978-landmark bottleneck, not the algorithm. But two failures survive into v1.1, and
both are now precisely diagnosed:
**Overall: FAIL on all three pre-registered criteria.** This is reported as-is, without 1. **L-glutamine does not reverse the signature** (positive connectivity, spec_z=+0.98). This is
adjustment. For context only (not the pre-registered criterion): the secondary intrinsic to its LINCS data a metabolite with no reversal signal not a coverage gap. More
mechanistic-prior ranking places hydroxyurea at **rank 7 (top 2.3%)** but that ranking uses genes cannot fix it.
prior knowledge of the drug's target, so it cannot be claimed as a blind recovery. 2. **The negative-control criterion is arguably invalid for connectivity scoring.** Two
"negative controls" (norethindrone, ciprofloxacin) rank in the top 3 by spec_z. Connectivity
measures *expression reversal*, not *therapeutic relatedness* an antibiotic or contraceptive
can still down-regulate the inflammation genes that dominate the scorable signature. The test
design conflates the two.
**Why it failed — the honest diagnosis.** The disease signature is dominated by erythroid / A note on the calibration: textbook CMap **tau** (percentile vs a reference population) was
reticulocyte biology (CA1, AHSP, SLC4A1) and the HbF axis that hydroxyurea actually acts on implemented but **saturated at ±100** here, because a coherent real query always out-connects
(HBG1/HBG2) was lost (flat in GSE35007; removed by GSE16728's globin-depleted prep). Worse, random gene sets proper tau needs a library of *real* reference signatures, which this MVP
only **56 of 477 signature genes (12%) are LINCS landmark genes** and none of the erythroid lacks. The continuous `spec_z` is the workable substitute.
hallmark genes are. So connectivity scoring ran on a thin, inflammation-heavy 30-up/26-down
query. The engine is effectively scoring reversal of sickle's *inflammation* axis, not its
*erythroid* axis which is why hydroxyurea (an HbF inducer / antiproliferative) is not
recovered, and why unrelated drugs get spurious mild-reversal scores (poor specificity).
## Section 3 — Top 10 candidates (raw connectivity) ## Section 3 — Top 10 candidates (v1.1 spec_z)
| Rank | Drug | Score | Known target / mechanism | Plausibility | | Rank | Drug | spec_z | Inclusion | Read |
|---|---|---|---|---| |---|---|---|---|---|
| 1 | laropiprant | 0.417 | Prostaglandin D2 receptor antagonist | Anti-inflammatory coherent with inflammation-axis reversal | | 1 | reserpic-acid | 3.80 | random | reserpine metabolite; non-obvious |
| 2 | BRD-K62768824 | 0.396 | (tool compound, no annotation) | Likely broad-effect false positive | | 2 | norethindrone | 3.78 | **negative control** | false positive (see §2) |
| 3 | BRD-K71353154 | 0.393 | (tool compound) | Likely false positive | | 3 | ciprofloxacin | 3.61 | **negative control** | false positive |
| 4 | lisinopril | 0.358 | ACE inhibitor | **Non-obvious; see §4** | | 4 | resveratrol | 3.46 | related-mechanism | antioxidant studied in SCD coherent |
| 5 | BRD-K53443165 | 0.358 | (tool compound) | Likely false positive | | 5 | BRD-K57490754 | 3.37 | random | tool compound |
| 6 | talnetant | 0.347 | Neurokinin-3 (NK3) receptor antagonist | No obvious sickle rationale | | 6 | anastrozole | 3.27 | random | aromatase inhibitor |
| 7 | BRD-K46936109 | 0.342 | (tool compound) | Likely false positive | | 710 | BRD-* / palmitoylethanolamide | ~3.1 | random | mostly tool compounds |
| 8 | lawsone | 0.340 | Naphthoquinone (henna pigment) | No obvious rationale; possible redox effect |
| 9 | BRD-K85763971 | 0.338 | (tool compound) | Likely false positive |
| 10 | BRD-K36516410 | 0.323 | (tool compound) | Likely false positive |
As anticipated (PLAN §9.4), the raw top-10 is dominated by unannotated broad-effect tool That two negative controls outrank hydroxyurea is the single most informative result here see §4.
compounds these are **not** credible candidates and are not over-interpreted.
## Section 4 — One non-obvious candidate worth investigating ## Section 4 — One non-obvious result worth investigating
**Lisinopril (ACE inhibitor), rank 4.** This is the most interesting non-obvious hit: ACE The most useful finding is **not** a candidate drug but the **negative-control failure**:
inhibitors are already used clinically in sickle cell disease for **renal protection** unrelated drugs (norethindrone, ciprofloxacin) score as strong specific reversers. This is a
(reducing albuminuria / progression of sickle nephropathy), via mechanisms independent of the real, generalizable lesson for a signature whose *scorable* portion is generic
HbF pathway. Surfacing an agent with a genuine, mechanistically distinct sickle-cell rationale inflammation/metabolic genes, connectivity rewards any broad transcriptional perturbation that
from an inflammation/vascular-flavoured signature is a small but real signal that the matching touches those genes. The honest implication: **this signature is not specific enough to
approach can point at non-obvious biology. **This is a computational hypothesis, not a discriminate true repurposing candidates from incidental expression reversers.** Of the
discovery**, and the connectivity rationale here (inflammation-axis reversal) is not the same as plausibly-real hits, **resveratrol (rank 4)** an antioxidant with prior sickle cell literature
lisinopril's known renal mechanism, so the match should be treated as suggestive only. is the most defensible, but it is a hypothesis, not a discovery.
## Section 5 — Honest limitations ## Section 5 — Honest limitations
1. **Cell-composition confound** the whole-blood signature is dominated by reticulocyte/ 1. **Pre-registered test failed; the pass is post-hoc.** v1.1's hydroxyurea recovery is
erythroid markers (composition, not pure disease-state regulation). v2 needs deconvolution. exploratory and must be re-validated on a held-out disease before any claim is made.
2. **Missing HbF axis** HBG1/HBG2 absent (globin depletion + flat in GSE35007), so the 2. **Missing HbF axis** HBG1/HBG2 are absent from LINCS entirely (not just landmarks), so the
signature cannot encode the pathway hydroxyurea acts on. pathway hydroxyurea acts on can never be scored by this method.
3. **12% signature↔landmark overlap** only 56/477 genes are LINCS landmarks; the erythroid 3. **Signature specificity** scorable genes are inflammation/metabolic; negative controls
hallmark genes are not scorable. The query collapses to a generic inflammation/metabolic slice. reverse them too. Connectivity therapeutic relatedness.
4. **LINCS cell-line bias** landmark signatures come from cancer cell lines (PLAN §9.2); poorly 4. **Cell-composition confound** the whole-blood signature is reticulocyte-dominated.
suited to a blood disease. 5. **LINCS cancer-cell-line bias**, and **no reference-signature library** for proper tau.
5. **Poor negative-control specificity** unrelated drugs received mild reversal scores; the 6. **No mechanistic validation** all hits are computational hypotheses.
thin query yields a noisy connectivity distribution.
6. **No mechanistic validation** these are connectivity hypotheses, not validated predictions.
## Section 6 — What v2 would fix ## Section 6 — What v2 would fix
- **Cell-type deconvolution** of the disease signature to separate disease-state regulation from - **A reference-signature library** to make tau (proper specificity calibration) work the
composition, recovering specificity. single biggest fix to the negative-control problem, and a direct use of the curated-data moat.
- **A non-globin-depleted, RNA-seq whole-blood study** to retain the HbF axis. - **Cell-type deconvolution** + a non-globin-depleted RNA-seq study to recover a more specific,
- **Signature prediction** (DeepCE-style) or a mechanism/knowledge graph to score the ~88% of HbF-containing signature.
the signature that has no LINCS landmark the single biggest lever on this result. - **Signature prediction / mechanism graph** to score genes with no LINCS measurement.
- **A second disease** to test generalization (sickle results alone do not prove the platform - **A second disease** to test generalization and to honestly re-validate the v1.1 method
PLAN §9.5). (PLAN §9.5).
--- ---
### Bottom line ### Bottom line
The pipeline is reproducible end-to-end and the method is sound, but on this signature it **does The pre-registered recovery test **failed**. Post-hoc diagnosis shows the dominant cause was a
not recover the known sickle cell drugs**. The failure is fully explained by signature/assay fixable gene-space bottleneck correcting it **recovers hydroxyurea** but also surfaces a
data limitations (erythroid biology lost; 12% landmark overlap), not by a flaw in the matching deeper, genuine limitation: this whole-blood signature is **not specific enough** for
algorithm. The most valuable output of this MVP is therefore a precise, honest map of *what data connectivity scoring to separate real candidates from incidental reversers (negative controls
quality the method needs to work* which is exactly the de-risking the proof-of-concept was rank at the top). The MVP's real deliverable is a precise, honest map of *what it takes to make
meant to deliver. this method work*: a more specific (deconvolved, HbF-containing) signature and a reference library
for calibration exactly the curated-data investments the platform thesis is built on.

102
scripts/exp_genespace.py Normal file
View File

@@ -0,0 +1,102 @@
"""Experiment (v1.1): re-score on a larger LINCS gene space and re-run the recovery test.
v1 used only the 978 landmark genes (12% signature overlap). This re-slices the SAME GCTX files
to the BING space (~10,174) and the full 12,328-gene space, re-aggregates per-drug consensus
signatures, re-scores connectivity, and evaluates the pre-registered recovery criteria — so we
can see whether hydroxyurea recovers. Writes nothing to the committed v1 artifacts.
"""
from __future__ import annotations
import gzip
import io
import json
from pathlib import Path
import pandas as pd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import rank_drugs # noqa: E402
LINCS = Path("data/raw/lincs")
PROCESSED = Path("data/processed")
GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
def read_gz(name):
return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
def gene_ids_for_space(space: str):
g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
if space == "bing":
g = g[g.pr_is_bing == 1]
# 'all' -> keep everything
ids = [str(x) for x in g.pr_gene_id]
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
return ids, id_to_symbol
def extract(space, drug_names, gene_ids, id_to_symbol):
from cmapPy.pandasGEXpress.parse import parse
frames = []
for ph in (1, 2):
sig = read_gz(SIG_INFO[ph])
sig = sig[(sig.pert_type == "trt_cp") & (sig.pert_iname.isin(drug_names))]
if sig.empty:
continue
gct = parse(str(GCTX[ph]), rid=gene_ids, cid=sig.sig_id.tolist())
data = gct.data_df
s2d = dict(zip(sig.sig_id, sig.pert_iname))
frames.append(data.T.groupby(data.columns.map(s2d)).mean())
print(f" [{space}] phase {ph}: {sig.pert_iname.nunique()} drugs sliced", flush=True)
combined = pd.concat(frames).groupby(level=0).mean()
combined.columns = [id_to_symbol.get(c, c) for c in combined.columns]
combined = combined.loc[:, ~combined.columns.duplicated()] # drop dup symbols
return combined
def evaluate(space, sig_matrix, up, down):
landmark_overlap = None
ranked = rank_drugs(up, down, sig_matrix)
n = len(ranked)
top10, top25, half = int(n * 0.10), int(n * 0.25), n // 2
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
ranked = ranked.join(profiles[["inclusion_reason"]])
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
glut_s = ranked.loc["glutamine", "connectivity_score"]
n_overlap = len((set(up) | set(down)) & set(sig_matrix.columns))
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
n_bottom = sum(r > half for r in negs.values())
print(f"\n=== gene space: {space.upper()} ({sig_matrix.shape[1]} genes; query overlap {n_overlap}) ===")
print(f" hydroxyurea: rank {hu}/{n} (top {100*hu/n:.1f}%) top-10%? {hu <= top10}")
print(f" L-glutamine: rank {glut}/{n} (top {100*glut/n:.1f}%), WTCS={glut_s:.3f} top-25%? {glut <= top25}")
print(f" neg controls in bottom half: {n_bottom}/5 {negs}")
crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
print(f" OVERALL: {'PASS' if crit else 'FAIL'}")
print(" top 8:")
for name, r in ranked.nsmallest(8, "connectivity_score").iterrows():
print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} [{r['inclusion_reason']}]")
return ranked
def main():
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
up = [g["gene"] for g in sig["up_regulated"]]
down = [g["gene"] for g in sig["down_regulated"]]
drug_names = set(pd.read_csv(PROCESSED / "drug_set_v1.csv").pert_iname)
for space in ("bing", "all"):
print(f"\n>>> extracting {space} ...", flush=True)
ids, id2sym = gene_ids_for_space(space)
mat = extract(space, drug_names, ids, id2sym)
evaluate(space, mat, up, down)
if __name__ == "__main__":
main()

View File

@@ -36,9 +36,15 @@ def read_gz_tsv(name: str) -> pd.DataFrame:
def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]: def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
lm = pd.read_csv(LINCS / "landmark_genes.csv") """Gene row-ids + id->symbol map for the scored gene space.
ids = [str(x) for x in lm["pr_gene_id"]]
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in lm.itertuples()} v1.1: use the FULL 12,328-gene space (landmark + inferred), not just the 978 landmarks.
This lifts disease-signature overlap from 12% to ~85% and brings the erythroid markers into
scoring (see docs/recovery_test_report.md). Inferred genes are model-predicted (noisier).
"""
g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
ids = [str(x) for x in g["pr_gene_id"]]
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
return ids, id_to_symbol return ids, id_to_symbol

View File

@@ -1,13 +1,11 @@
"""Week 3: run connectivity scoring over all drugs -> ranked_candidates_v1.csv (PLAN §6). """Week 3 (v1.1): connectivity scoring over the full gene space with tau calibration.
Loads the disease signature + the 300 drug LINCS signatures, computes the weighted-KS Primary ranking is now **tau** (KS connectivity expressed as a signed percentile vs a null of
connectivity score per drug, and produces two rankings: random queries) — this calibrates out broad-effect drugs that connect to random signatures too,
1. raw connectivity (most negative = strongest reversal = rank 1) the specificity fix. The weighted connectivity score (WTCS) is retained as a reference column,
2. a secondary ranking blending connectivity with a mechanistic prior (sickle-relevant and a secondary ranking blends tau with the sickle-pathway mechanistic prior.
target pathways), to temper broad-effect drugs (HDAC/kinase) that dominate raw rankings.
The formal recovery test (ground-truth + negative-control evaluation against the pre-registered Output: data/results/ranked_candidates_v1.csv.
criteria) is Week 4; this script only prints a sanity peek.
""" """
from __future__ import annotations from __future__ import annotations
@@ -19,10 +17,13 @@ import pandas as pd
import sys import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import mechanistic_prior, persist_ranking, rank_drugs # noqa: E402 from src.scoring import ( # noqa: E402
connectivity_score, mechanistic_prior, normalize_scores, persist_ranking, tau_calibrate,
)
PROCESSED = Path("data/processed") PROCESSED = Path("data/processed")
PRIOR_LAMBDA = 0.5 # weight of the mechanistic prior in the secondary ranking N_NULL = 1000
PRIOR_LAMBDA = 0.5 # spec_z credit per matched sickle pathway, for the blended ranking
def main() -> None: def main() -> None:
@@ -30,52 +31,51 @@ def main() -> None:
up = [g["gene"] for g in sig["up_regulated"]] up = [g["gene"] for g in sig["up_regulated"]]
down = [g["gene"] for g in sig["down_regulated"]] down = [g["gene"] for g in sig["down_regulated"]]
sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 978 symbols sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 12,328 genes
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name") profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
landmark = set(sig_matrix.columns) n_up = len(set(up) & set(sig_matrix.columns))
n_up_ov = len(set(up) & landmark) n_down = len(set(down) & set(sig_matrix.columns))
n_down_ov = len(set(down) & landmark) print(f"gene space: {sig_matrix.shape[1]} genes; query overlap {n_up} up + {n_down} down = {n_up + n_down}")
print(f"query overlap with 978 landmarks: {n_up_ov} up + {n_down_ov} down = {n_up_ov + n_down_ov}")
print(f"scoring {len(sig_matrix)} drugs (all scored; 0 without signature)")
ranked = rank_drugs(up, down, sig_matrix) # primary: tau calibration
print(f"computing tau over {N_NULL} random-query permutations ...", flush=True)
ranked = tau_calibrate(up, down, sig_matrix, n_null=N_NULL)
# attach metadata + mechanistic prior # reference: weighted connectivity score (WTCS) + NCS
wtcs = pd.Series({d: connectivity_score(up, down, sig_matrix.loc[d]) for d in sig_matrix.index},
name="connectivity_score")
ranked["connectivity_score"] = wtcs
ranked["normalized_score"] = normalize_scores(wtcs)
# metadata + mechanistic prior
ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]]) ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
ranked["mechanistic_prior"] = ranked["targets"].apply( ranked["mechanistic_prior"] = ranked["targets"].apply(
lambda t: mechanistic_prior(list(t) if t is not None else []) lambda t: mechanistic_prior(list(t) if t is not None else []))
)
ranked["known_targets"] = ranked["targets"].apply( ranked["known_targets"] = ranked["targets"].apply(
lambda t: "; ".join(t) if t is not None and len(t) else "" lambda t: "; ".join(t) if t is not None and len(t) else "")
)
ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"}) ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})
# secondary, prior-weighted ranking: relevant drugs pushed toward better (more negative) # secondary, prior-weighted ranking (relevant drugs pushed toward more-negative spec_z)
ranked["blended_score"] = ranked["normalized_score"] - PRIOR_LAMBDA * ranked["mechanistic_prior"] ranked["blended_score"] = ranked["spec_z"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int) ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)
out = ranked.rename_axis("drug_name").reset_index()[[ out = ranked.rename_axis("drug_name").reset_index()[[
"rank", "drug_name", "chembl_id", "connectivity_score", "normalized_score", "rank", "drug_name", "chembl_id", "spec_z", "tau", "connectivity_ks", "connectivity_score",
"inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary", "inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
]] ]]
path = persist_ranking(out) path = persist_ranking(out)
print(f"wrote {path} ({len(out)} drugs)") print(f"wrote {path} ({len(out)} drugs)")
# --- sanity peek (formal recovery test is Week 4) --- print("\n--- sanity peek (spec_z ranking) ---")
print("\n--- sanity peek (raw connectivity rank) ---")
for gt in ["hydroxyurea", "glutamine"]: for gt in ["hydroxyurea", "glutamine"]:
r = ranked.loc[gt] r = ranked.loc[gt]
pct = 100 * r["rank"] / len(ranked) print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {100*r['rank']/len(ranked):.0f}%), "
print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {pct:.0f}%), " f"spec_z={r['spec_z']:.2f}")
f"score={r['connectivity_score']:.3f}") print(" top 10 by spec_z:")
neg = ranked[ranked["inclusion_reason"] == "negative_control"] for name, r in ranked.nsmallest(10, "spec_z").iterrows():
print(f" negative controls in bottom half: " print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:6.2f} [{r['inclusion_reason']:16s}] "
f"{(neg['rank'] > len(ranked) / 2).sum()}/{len(neg)}") f"{str(r['known_targets'])[:38]}")
print("\n top 5 raw candidates:")
for name, r in ranked.nsmallest(5, "connectivity_score").iterrows():
print(f" {int(r['rank']):3d} {name:18s} {r['connectivity_score']:+.3f} "
f"[{r['inclusion_reason']}] {r['known_targets'][:50]}")
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -36,6 +36,7 @@ def main() -> None:
return int(df.loc[name, "rank"]) if name in df.index else None return int(df.loc[name, "rank"]) if name in df.index else None
hu, glut = rk("hydroxyurea"), rk("glutamine") hu, glut = rk("hydroxyurea"), rk("glutamine")
glut_z = df.loc["glutamine", "spec_z"]
# pick negative controls present in the ranking # pick negative controls present in the ranking
negs = {} negs = {}
@@ -47,9 +48,8 @@ def main() -> None:
print("=" * 60) print("=" * 60)
print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}") print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}")
print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%) -> top-10%? {hu <= top10_cut}") print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%) -> top-10%? {hu <= top10_cut}")
glut_score = df.loc["glutamine", "connectivity_score"] print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), spec_z={glut_z:+.2f} "
print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), WTCS={glut_score:.3f} " f"-> top-25%? {glut <= top25_cut} (positive z => does not reverse; has a signature)")
f"-> top-25%? {glut <= top25_cut} (has signature, so NOT 'missing-signature unscorable')")
print("\nnegative controls (pre-specified, 1 per category):") print("\nnegative controls (pre-specified, 1 per category):")
n_bottom = 0 n_bottom = 0
for d, (cat, r) in negs.items(): for d, (cat, r) in negs.items():
@@ -70,10 +70,10 @@ def main() -> None:
print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} " print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} "
f"(top {100*hu_b/n:.1f}%)") f"(top {100*hu_b/n:.1f}%)")
print("\n--- TOP 10 (raw connectivity) ---") print("\n--- TOP 10 (primary spec_z ranking) ---")
top10 = df.nsmallest(10, "connectivity_score") top10 = df.sort_values("rank").head(10)
for name, r in top10.iterrows(): for name, r in top10.iterrows():
print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} " print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:+.2f} "
f"[{r['inclusion_reason']}] {str(r['known_targets'])[:45]}") f"[{r['inclusion_reason']}] {str(r['known_targets'])[:45]}")

View File

@@ -16,7 +16,7 @@ from pathlib import Path
import numpy as np import numpy as np
import pandas as pd import pandas as pd
from . import RESULTS_DIR from . import RANDOM_SEED, RESULTS_DIR
# Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3). # Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
# Keys are pathway categories; values are substrings matched (case-insensitive) against a # Keys are pathway categories; values are substrings matched (case-insensitive) against a
@@ -134,6 +134,92 @@ def mechanistic_prior(targets: list[str]) -> float:
return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values())) return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))
# --- KS connectivity + tau calibration (v1.1) -----------------------------------------------
# Unweighted Kolmogorov-Smirnov connectivity (Lamb 2006) is O(k) per query (depends only on the
# ranks of the query genes), which makes a permutation null over many random queries cheap. tau
# expresses each drug's real connectivity as a signed percentile within its own null — so
# broad-effect drugs that connect to *random* signatures too get down-weighted (specificity).
def _ks_es(rank_matrix: np.ndarray, query_cols: np.ndarray, n_genes: int) -> np.ndarray:
"""Vectorized unweighted KS enrichment score of a query gene set for every drug.
``rank_matrix`` is (n_drugs, n_genes) of 1..N rank positions (1 = most up-regulated).
Returns one ES per drug; ES>0 => query enriched among up-regulated genes.
"""
k = len(query_cols)
if k == 0:
return np.zeros(rank_matrix.shape[0])
p = np.sort(rank_matrix[:, query_cols], axis=1) # (n_drugs, k), positions ascending
j = np.arange(1, k + 1)
a = (j / k - p / n_genes).max(axis=1)
b = (p / n_genes - (j - 1) / k).max(axis=1)
return np.where(a >= b, a, -b)
def _ks_connectivity(rank_matrix: np.ndarray, up_cols: np.ndarray, down_cols: np.ndarray,
n_genes: int) -> np.ndarray:
"""KS connectivity per drug: (ES_up - ES_down)/2. Negative=reversal.
Note: unlike WTCS, this does NOT zero same-sign (ambiguous) connections — same-sign ES
partially cancel and land near 0 naturally. Hard-zeroing would collapse the random-query
null to a spike at 0 and make tau saturate, so the continuous form is required for calibration.
"""
es_up = _ks_es(rank_matrix, up_cols, n_genes)
es_down = _ks_es(rank_matrix, down_cols, n_genes)
return (es_up - es_down) / 2.0
def tau_calibrate(
up_genes: list[str],
down_genes: list[str],
signature_matrix: pd.DataFrame,
n_null: int = 1000,
seed: int = RANDOM_SEED,
) -> pd.DataFrame:
"""Rank drugs by tau: each drug's KS connectivity as a signed percentile vs a null of
random queries of the same size (PLAN §6; CMap tau, Subramanian 2017).
tau in [-100, 100]: -100 => reverses our signature more specifically than any random query
(strong, specific candidate); ~0 => connects to our signature no more than to random ones
(broad-effect / non-specific). Ranked by tau ascending (rank 1 = most specific reversal).
"""
genes = list(signature_matrix.columns)
gene_to_col = {g: i for i, g in enumerate(genes)}
n = len(genes)
rank_matrix = signature_matrix.rank(axis=1, ascending=False).to_numpy()
up_cols = np.array([gene_to_col[g] for g in set(up_genes) if g in gene_to_col], dtype=int)
down_cols = np.array([gene_to_col[g] for g in set(down_genes) if g in gene_to_col], dtype=int)
real = _ks_connectivity(rank_matrix, up_cols, down_cols, n)
rng = np.random.default_rng(seed)
k_up, k_down = len(up_cols), len(down_cols)
null = np.empty((rank_matrix.shape[0], n_null))
for m in range(n_null):
samp = rng.choice(n, size=k_up + k_down, replace=False)
null[:, m] = _ks_connectivity(rank_matrix, samp[:k_up], samp[k_up:], n)
null_mean = null.mean(axis=1)
null_std = null.std(axis=1)
null_std[null_std == 0] = np.nan
# Per-drug standardized connectivity: how many SDs the real reversal is below what random
# queries of the same size produce against THIS drug. Continuous (no percentile floor), so it
# discriminates within the strong-reversal tail. Negative = specific reversal.
spec_z = (real - null_mean) / null_std
frac = (null <= real[:, None]).mean(axis=1)
tau = 100.0 * (2.0 * frac - 1.0) # retained for reference; saturates at +/-100 in the tail
df = pd.DataFrame(
{"connectivity_ks": real, "null_mean": null_mean, "spec_z": spec_z, "tau": tau},
index=signature_matrix.index,
)
df = df.sort_values("spec_z") # most negative z = most specific reversal
df.insert(0, "rank", range(1, len(df) + 1))
return df
def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path: def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:
"""Write the ranked candidate list to ``data/results/ranked_candidates_v1.csv``.""" """Write the ranked candidate list to ``data/results/ranked_candidates_v1.csv``."""
out_path = out_path or (RESULTS_DIR / "ranked_candidates_v1.csv") out_path = out_path or (RESULTS_DIR / "ranked_candidates_v1.csv")

View File

@@ -114,6 +114,33 @@ class TestMechanisticPrior:
assert mechanistic_prior(["Some unrelated kinase"]) == 0.0 assert mechanistic_prior(["Some unrelated kinase"]) == 0.0
class TestTauCalibration:
"""tau should reward a SPECIFIC reverser and give a near-zero score to a noise drug."""
@staticmethod
def _matrix() -> pd.DataFrame:
genes = [f"U{i}" for i in range(5)] + [f"D{i}" for i in range(5)] + [f"G{i}" for i in range(40)]
rng_vals = {g: 0.01 * ((hash(g) % 7) - 3) for g in genes} # tiny deterministic noise
# specific reverser: query-up genes at the bottom, query-down at the top, rest ~0
specific = dict(rng_vals)
for i in range(5):
specific[f"U{i}"] = -8 - i
specific[f"D{i}"] = 8 + i
noise = dict(rng_vals)
return pd.DataFrame([specific, noise], index=["specific", "noise"])[genes]
def test_specific_reverser_has_strongly_negative_tau(self):
from src.scoring import tau_calibrate
up = [f"U{i}" for i in range(5)]
down = [f"D{i}" for i in range(5)]
out = tau_calibrate(up, down, self._matrix(), n_null=300, seed=0)
# Ranked by spec_z (continuous); the specific reverser is the most negative.
assert out.loc["specific", "spec_z"] < -2
assert out.loc["specific", "spec_z"] < out.loc["noise", "spec_z"]
assert out.loc["specific", "tau"] < -50 # tau also flags it (may saturate near -100)
assert out.loc["specific", "rank"] == 1
def test_rank_drugs_orders_by_reversal(): def test_rank_drugs_orders_by_reversal():
from src.scoring import rank_drugs from src.scoring import rank_drugs
genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)] genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]