3 Commits

Author SHA1 Message Date
3417f85eb1 v1.1: full gene space + specificity z-score; hydroxyurea recovers
Post-hoc improvement after the pre-registered v1 recovery test failed.
Two changes, diagnosing v1's failure:
- score on the full 12,328-gene LINCS space (week2_lincs_extract.py),
  lifting signature overlap from 12% to 85% (brings erythroid markers in)
- src/scoring.py: KS connectivity + per-drug specificity z-score
  (spec_z = SDs below a 1,000 random-query null). Primary ranking is
  now spec_z. (Textbook tau saturated at +/-100 for a coherent query —
  documented; needs a reference-signature library, a v2 item.)
- week3_scoring.py: spec_z primary + WTCS reference + prior-blended
- tests: tau/spec_z calibration test; 19 passing
- scripts/exp_genespace.py: the BING vs all-12,328 comparison

Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%),
confirming the v1 failure was the landmark bottleneck not the algorithm.
Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite),
and negative controls (norethindrone, ciprofloxacin) rank top-3 —
connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory,
not a confirmatory test; reported as such in recovery_test_report.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:57:30 +02:00
fd4591949c Week 3: CMap connectivity scoring engine + ranked candidates
Implement the matching engine (PLAN §6 Week 3):
- src/scoring.py: weighted-KS/GSEA enrichment, weighted connectivity
  score (WTCS, Lamb 2006 / Subramanian 2017), signed NCS normalization,
  rank_drugs, and a sickle-pathway mechanistic prior
- tests/test_scoring.py: real reference tests for the scorer (perfect
  reversal<null<mimic, same-sign->0, absent-gene invariance) + prior
- week3_scoring.py: score 300 drugs -> ranked_candidates_v1.csv with a
  raw ranking and a secondary mechanistic-prior-weighted ranking

Preliminary (formal recovery test is Week 4): hydroxyurea raw rank
40/300 (top 13%, just misses pre-registered top-10%), blended rank 7;
L-glutamine WTCS=0 (ambiguous). Notably anti-inflammatory SCD drugs
cluster in the raw top tier — the engine reverses the inflammation axis,
not the erythroid axis, traceable to the 12% landmark-overlap caveat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:34:56 +02:00
b731478f5d Scaffold Reverso MVP pipeline structure
Set up the project skeleton per PLAN.md §4:
- src/ package: identifiers, disease, drugs, scoring, provenance
  with pydantic schemas and confidence-tier logic (working);
  data-pull/compute functions stubbed per their build week
- 5 starter notebooks (01-05) with PLAN-referenced steps
- tests/test_scoring.py: tier-assignment tests pass; scoring
  reference test xfail until Week 3
- docs/: recovery_test_report, data_sources, known_limitations skeletons
- pyproject.toml (requires-python >=3.11,<3.14), .gitignore, README
- data/ tree preserved via .gitkeep; raw/processed/results gitignored

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 20:20:09 +02:00