v1.1: full gene space + specificity z-score; hydroxyurea recovers

Post-hoc improvement after the pre-registered v1 recovery test failed. Two changes, diagnosing v1's failure: - score on the full 12,328-gene LINCS space (week2_lincs_extract.py), lifting signature overlap from 12% to 85% (brings erythroid markers in) - src/scoring.py: KS connectivity + per-drug specificity z-score (spec_z = SDs below a 1,000 random-query null). Primary ranking is now spec_z. (Textbook tau saturated at +/-100 for a coherent query — documented; needs a reference-signature library, a v2 item.) - week3_scoring.py: spec_z primary + WTCS reference + prior-blended - tests: tau/spec_z calibration test; 19 passing - scripts/exp_genespace.py: the BING vs all-12,328 comparison Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%), confirming the v1 failure was the landmark bottleneck not the algorithm. Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite), and negative controls (norethindrone, ciprofloxacin) rank top-3 — connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory, not a confirmatory test; reported as such in recovery_test_report.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:57:30 +02:00
parent 72f1a49de6
commit 3417f85eb1
9 changed files with 378 additions and 150 deletions
--- a/scripts/week2_lincs_extract.py
+++ b/scripts/week2_lincs_extract.py
@@ -36,9 +36,15 @@ def read_gz_tsv(name: str) -> pd.DataFrame:


 def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
-    lm = pd.read_csv(LINCS / "landmark_genes.csv")
-    ids = [str(x) for x in lm["pr_gene_id"]]
-    id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in lm.itertuples()}
+    """Gene row-ids + id->symbol map for the scored gene space.
+
+    v1.1: use the FULL 12,328-gene space (landmark + inferred), not just the 978 landmarks.
+    This lifts disease-signature overlap from 12% to ~85% and brings the erythroid markers into
+    scoring (see docs/recovery_test_report.md). Inferred genes are model-predicted (noisier).
+    """
+    g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
+    ids = [str(x) for x in g["pr_gene_id"]]
+    id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
    return ids, id_to_symbol