v1.1: full gene space + specificity z-score; hydroxyurea recovers

Post-hoc improvement after the pre-registered v1 recovery test failed. Two changes, diagnosing v1's failure: - score on the full 12,328-gene LINCS space (week2_lincs_extract.py), lifting signature overlap from 12% to 85% (brings erythroid markers in) - src/scoring.py: KS connectivity + per-drug specificity z-score (spec_z = SDs below a 1,000 random-query null). Primary ranking is now spec_z. (Textbook tau saturated at +/-100 for a coherent query — documented; needs a reference-signature library, a v2 item.) - week3_scoring.py: spec_z primary + WTCS reference + prior-blended - tests: tau/spec_z calibration test; 19 passing - scripts/exp_genespace.py: the BING vs all-12,328 comparison Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%), confirming the v1 failure was the landmark bottleneck not the algorithm. Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite), and negative controls (norethindrone, ciprofloxacin) rank top-3 — connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory, not a confirmatory test; reported as such in recovery_test_report.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:57:30 +02:00
parent 72f1a49de6
commit 3417f85eb1
9 changed files with 378 additions and 150 deletions
--- a/scripts/week3_scoring.py
+++ b/scripts/week3_scoring.py
@@ -1,13 +1,11 @@
-"""Week 3: run connectivity scoring over all drugs -> ranked_candidates_v1.csv (PLAN §6).
+"""Week 3 (v1.1): connectivity scoring over the full gene space with tau calibration.

-Loads the disease signature + the 300 drug LINCS signatures, computes the weighted-KS
-connectivity score per drug, and produces two rankings:
-  1. raw connectivity (most negative = strongest reversal = rank 1)
-  2. a secondary ranking blending connectivity with a mechanistic prior (sickle-relevant
-     target pathways), to temper broad-effect drugs (HDAC/kinase) that dominate raw rankings.
+Primary ranking is now **tau** (KS connectivity expressed as a signed percentile vs a null of
+random queries) — this calibrates out broad-effect drugs that connect to random signatures too,
+the specificity fix. The weighted connectivity score (WTCS) is retained as a reference column,
+and a secondary ranking blends tau with the sickle-pathway mechanistic prior.

-The formal recovery test (ground-truth + negative-control evaluation against the pre-registered
-criteria) is Week 4; this script only prints a sanity peek.
+Output: data/results/ranked_candidates_v1.csv.
 """

 from __future__ import annotations
@@ -19,10 +17,13 @@ import pandas as pd

 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
-from src.scoring import mechanistic_prior, persist_ranking, rank_drugs  # noqa: E402
+from src.scoring import (  # noqa: E402
+    connectivity_score, mechanistic_prior, normalize_scores, persist_ranking, tau_calibrate,
+)

 PROCESSED = Path("data/processed")
-PRIOR_LAMBDA = 0.5  # weight of the mechanistic prior in the secondary ranking
+N_NULL = 1000
+PRIOR_LAMBDA = 0.5  # spec_z credit per matched sickle pathway, for the blended ranking


 def main() -> None:
@@ -30,52 +31,51 @@ def main() -> None:
    up = [g["gene"] for g in sig["up_regulated"]]
    down = [g["gene"] for g in sig["down_regulated"]]

-    sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")  # drug x 978 symbols
+    sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")  # drug x 12,328 genes
    profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")

-    landmark = set(sig_matrix.columns)
-    n_up_ov = len(set(up) & landmark)
-    n_down_ov = len(set(down) & landmark)
-    print(f"query overlap with 978 landmarks: {n_up_ov} up + {n_down_ov} down = {n_up_ov + n_down_ov}")
-    print(f"scoring {len(sig_matrix)} drugs (all scored; 0 without signature)")
+    n_up = len(set(up) & set(sig_matrix.columns))
+    n_down = len(set(down) & set(sig_matrix.columns))
+    print(f"gene space: {sig_matrix.shape[1]} genes; query overlap {n_up} up + {n_down} down = {n_up + n_down}")

-    ranked = rank_drugs(up, down, sig_matrix)
+    # primary: tau calibration
+    print(f"computing tau over {N_NULL} random-query permutations ...", flush=True)
+    ranked = tau_calibrate(up, down, sig_matrix, n_null=N_NULL)

-    # attach metadata + mechanistic prior
+    # reference: weighted connectivity score (WTCS) + NCS
+    wtcs = pd.Series({d: connectivity_score(up, down, sig_matrix.loc[d]) for d in sig_matrix.index},
+                     name="connectivity_score")
+    ranked["connectivity_score"] = wtcs
+    ranked["normalized_score"] = normalize_scores(wtcs)
+
+    # metadata + mechanistic prior
    ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
    ranked["mechanistic_prior"] = ranked["targets"].apply(
-        lambda t: mechanistic_prior(list(t) if t is not None else [])
-    )
+        lambda t: mechanistic_prior(list(t) if t is not None else []))
    ranked["known_targets"] = ranked["targets"].apply(
-        lambda t: "; ".join(t) if t is not None and len(t) else ""
-    )
+        lambda t: "; ".join(t) if t is not None and len(t) else "")
    ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})

-    # secondary, prior-weighted ranking: relevant drugs pushed toward better (more negative)
-    ranked["blended_score"] = ranked["normalized_score"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
+    # secondary, prior-weighted ranking (relevant drugs pushed toward more-negative spec_z)
+    ranked["blended_score"] = ranked["spec_z"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
    ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)

    out = ranked.rename_axis("drug_name").reset_index()[[
-        "rank", "drug_name", "chembl_id", "connectivity_score", "normalized_score",
+        "rank", "drug_name", "chembl_id", "spec_z", "tau", "connectivity_ks", "connectivity_score",
        "inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
    ]]
    path = persist_ranking(out)
    print(f"wrote {path} ({len(out)} drugs)")

-    # --- sanity peek (formal recovery test is Week 4) ---
-    print("\n--- sanity peek (raw connectivity rank) ---")
+    print("\n--- sanity peek (spec_z ranking) ---")
    for gt in ["hydroxyurea", "glutamine"]:
        r = ranked.loc[gt]
-        pct = 100 * r["rank"] / len(ranked)
-        print(f"  {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {pct:.0f}%), "
-              f"score={r['connectivity_score']:.3f}")
-    neg = ranked[ranked["inclusion_reason"] == "negative_control"]
-    print(f"  negative controls in bottom half: "
-          f"{(neg['rank'] > len(ranked) / 2).sum()}/{len(neg)}")
-    print("\n  top 5 raw candidates:")
-    for name, r in ranked.nsmallest(5, "connectivity_score").iterrows():
-        print(f"    {int(r['rank']):3d}  {name:18s} {r['connectivity_score']:+.3f}  "
-              f"[{r['inclusion_reason']}] {r['known_targets'][:50]}")
+        print(f"  {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {100*r['rank']/len(ranked):.0f}%), "
+              f"spec_z={r['spec_z']:.2f}")
+    print("  top 10 by spec_z:")
+    for name, r in ranked.nsmallest(10, "spec_z").iterrows():
+        print(f"    {int(r['rank']):2d} {name:18s} z={r['spec_z']:6.2f} [{r['inclusion_reason']:16s}] "
+              f"{str(r['known_targets'])[:38]}")


 if __name__ == "__main__":