Start the structure-based binding branch (PLAN §12), baseline-first.
- src/binding.py: validated RDKit ligand retrieval (morgan_fp, tanimoto,
retrieve_nearest = the §12.9 engine) + dock() stub documenting the
blocked ARM-Mac toolchain
- scripts/binding_ligand_baseline.py: 300 drugs vs known binders
- docs/structure_binding_notes.md: status, toolchain blocker, next steps
- pyproject: [structure] extra (rdkit); data/raw/structures/ for PDBs
Step-0 finding: retrieval engine VALIDATED on in-set classes
(decitabine->azacitidine 0.62; vorinostat->scriptaid/belinostat) but the
distinctive binders voxelotor/mitapivat have no analog in our 300-drug
set (Tanimoto ~0.2). Needs (a) bigger library, (b) real docking (§12.3),
which is blocked on the ARM-Mac docking toolchain (§12.6 pitfall 4).
Structures 5E83 (Hb+voxelotor) and 8XFD (PKR+mitapivat) fetched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Post-hoc improvement after the pre-registered v1 recovery test failed.
Two changes, diagnosing v1's failure:
- score on the full 12,328-gene LINCS space (week2_lincs_extract.py),
lifting signature overlap from 12% to 85% (brings erythroid markers in)
- src/scoring.py: KS connectivity + per-drug specificity z-score
(spec_z = SDs below a 1,000 random-query null). Primary ranking is
now spec_z. (Textbook tau saturated at +/-100 for a coherent query —
documented; needs a reference-signature library, a v2 item.)
- week3_scoring.py: spec_z primary + WTCS reference + prior-blended
- tests: tau/spec_z calibration test; 19 passing
- scripts/exp_genespace.py: the BING vs all-12,328 comparison
Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%),
confirming the v1 failure was the landmark bottleneck not the algorithm.
Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite),
and negative controls (norethindrone, ciprofloxacin) rank top-3 —
connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory,
not a confirmatory test; reported as such in recovery_test_report.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implement the matching engine (PLAN §6 Week 3):
- src/scoring.py: weighted-KS/GSEA enrichment, weighted connectivity
score (WTCS, Lamb 2006 / Subramanian 2017), signed NCS normalization,
rank_drugs, and a sickle-pathway mechanistic prior
- tests/test_scoring.py: real reference tests for the scorer (perfect
reversal<null<mimic, same-sign->0, absent-gene invariance) + prior
- week3_scoring.py: score 300 drugs -> ranked_candidates_v1.csv with a
raw ranking and a secondary mechanistic-prior-weighted ranking
Preliminary (formal recovery test is Week 4): hydroxyurea raw rank
40/300 (top 13%, just misses pre-registered top-10%), blended rank 7;
L-glutamine WTCS=0 (ambiguous). Notably anti-inflammatory SCD drugs
cluster in the raw top tier — the engine reverses the inflammation axis,
not the erythroid axis, traceable to the 12% landmark-overlap caveat.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>