Commit Graph

11 Commits

Author SHA1 Message Date
51bd90df41 Redocking-RMSD validation fails 3/3: pipeline-quality issue
§12.4 de-biased validation (scripts/dock_validate.py).
Redock each co-crystal ligand into its own structure, RMSD vs crystal:
- voxelotor->Hb: NA (covalent binder, out of scope §12.7)
- mitapivat->PKR: 8.2A (allosteric, cofactors stripped)
- vorinostat->HDAC2 (4LXZ, zinc kept): 7.9A -- a CLASSICAL target that
  should have worked

The clean target also failing => systematic pipeline-quality problem,
not target choice. Cheap Vina + open-babel prep gives scores but doesn't
reproduce known geometry, so affinities aren't trustworthy. Ligand
efficiency over-corrects (ranks tiny hydroxyurea best). Fix needs
production prep (Meeko/AutoDockTools prepare_receptor + reduce) and an
in-place RMSD metric. Consistent with the project theme: the quick
version of every method runs but fails honest validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 07:28:47 +02:00
75f5383961 Docking baseline: toolchain solved, raw affinity is size-biased
§12.3-12.4 first binding result on ARM Mac.
- Toolchain SOLVED: AutoDock Vina 1.2.5 mac binary (Rosetta) + open-babel
  (brew). No conda, no MLX. dock_positive_controls.py runs end-to-end.
- Cross-dock known binders + negatives into Hb (5E83) and PKR (8XFD),
  box centered on co-crystal ligands (5L7=voxelotor, WV2=mitapivat).

Finding: raw Vina affinity ranks almost perfectly by MOLECULAR SIZE
(mitapivat > voxelotor > decitabine/caffeine > hydroxyurea) in both
pockets — mitapivat wins even on hemoglobin it doesn't target. Raw score
can't distinguish target-specific binding: the docking analog of the
connectivity specificity problem. Next: redocking-RMSD validation +
ligand-efficiency normalization.

Note: machine is 24GB (not 96GB per PLAN §2), capping local AF3-class
inference. tools/ gitignored (vina binary).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 00:03:00 +02:00
817bcda7dc Structure-binding track: scaffold + ligand-retrieval baseline
Start the structure-based binding branch (PLAN §12), baseline-first.

- src/binding.py: validated RDKit ligand retrieval (morgan_fp, tanimoto,
  retrieve_nearest = the §12.9 engine) + dock() stub documenting the
  blocked ARM-Mac toolchain
- scripts/binding_ligand_baseline.py: 300 drugs vs known binders
- docs/structure_binding_notes.md: status, toolchain blocker, next steps
- pyproject: [structure] extra (rdkit); data/raw/structures/ for PDBs

Step-0 finding: retrieval engine VALIDATED on in-set classes
(decitabine->azacitidine 0.62; vorinostat->scriptaid/belinostat) but the
distinctive binders voxelotor/mitapivat have no analog in our 300-drug
set (Tanimoto ~0.2). Needs (a) bigger library, (b) real docking (§12.3),
which is blocked on the ARM-Mac docking toolchain (§12.6 pitfall 4).
Structures 5E83 (Hb+voxelotor) and 8XFD (PKR+mitapivat) fetched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:53:27 +02:00
649f617019 Phase D: supervised cross-disease (0.925 AUC degree-bias mirage)
Train GradientBoosting on 300 drugs x 839 GEO disease signatures with
Repurposing-Hub indications as labels (432 positives), disease-grouped CV.

Finding: 0.925 CV AUC looks like a win but is a MIRAGE. Feature
importances are all drug-level (drug_std 0.33, drug_mean 0.30,
broadness 0.17); drug-disease connectivity importance = 0.01. The model
learned a drug-POPULARITY prior, not disease-specific matching. On
held-out sickle it ranks hydroxyurea 231/300 (worse than baseline) and
tops out with promiscuous drugs (dexamethasone, methotrexate). Classic
degree-bias trap. Connectivity also has ~chance AUC (0.51) for predicting
approved indications.

Both obvious approaches now fail instructively: unsupervised = specificity
ceiling; naive supervised = degree bias. Real progress needs degree-
debiased training + much larger clean labels (a research effort).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:31:32 +02:00
0ce688449d Phase A: reference-library tau (negative result on specificity)
Calibrate sickle connectivity against a real disease-signature
reference population (Enrichr Disease_Signatures_from_GEO, 141 diseases)
instead of random gene-set nulls — proper CMap tau.

Finding: hydroxyurea still recovers (rank 25, top 8%), but negative-
control specificity is UNCHANGED (2/5; norethindrone + ciprofloxacin
still top). The reference-calibrated ranking is nearly identical to the
random-null ranking. Third independent fix (after gene-space expansion
and composition adjustment) that recovers hydroxyurea but does NOT fix
specificity.

Conclusion: unsupervised connectivity has a specificity ceiling — it
cannot distinguish therapeutic reversal from coincidental transcriptional
anti-correlation. Breaking it needs external signal (supervised labels
or mechanistic filtering), not more calibration. Disease-signature
library cached at data/raw/disease_sigs/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:19:26 +02:00
b62048614d Experiment: composition-adjusted signature (negative result)
Test whether fixing the cell-composition confound rescues the recovery
test. GSE35007 measured WBC/RBC/MCV per sample, so adjust DE directly
for them (per-gene OLS: expression ~ disease + WBC + RBC + MCV + age +
sex) — a measured-covariate deconvolution, compared vs unadjusted.

Finding (negative, informative): adjustment HURT hydroxyurea (rank
20 -> 35, fails top-10%) — it was recovering partly via composition
genes — and did NOT fix negative-control specificity (still 2/5). Only
gain: top hits become more mechanistically coherent (resveratrol,
aspirin enter). Conclusion: cleaning the disease signature does not
rescue connectivity; the binding constraint is matching-side (needs a
reference-signature library for proper specificity calibration), which
is a multi-disease investment. v1.1 signature NOT replaced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:05:08 +02:00
3417f85eb1 v1.1: full gene space + specificity z-score; hydroxyurea recovers
Post-hoc improvement after the pre-registered v1 recovery test failed.
Two changes, diagnosing v1's failure:
- score on the full 12,328-gene LINCS space (week2_lincs_extract.py),
  lifting signature overlap from 12% to 85% (brings erythroid markers in)
- src/scoring.py: KS connectivity + per-drug specificity z-score
  (spec_z = SDs below a 1,000 random-query null). Primary ranking is
  now spec_z. (Textbook tau saturated at +/-100 for a coherent query —
  documented; needs a reference-signature library, a v2 item.)
- week3_scoring.py: spec_z primary + WTCS reference + prior-blended
- tests: tau/spec_z calibration test; 19 passing
- scripts/exp_genespace.py: the BING vs all-12,328 comparison

Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%),
confirming the v1 failure was the landmark bottleneck not the algorithm.
Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite),
and negative controls (norethindrone, ciprofloxacin) rank top-3 —
connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory,
not a confirmatory test; reported as such in recovery_test_report.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:57:30 +02:00
72f1a49de6 Week 4: recovery test (FAIL, reported honestly) + 2-page report
Run the formal recovery test against the pre-registered criteria and
write the deliverable report (PLAN §6 Week 4):
- week4_recovery_test.py: evaluate hydroxyurea/L-glutamine + 5
  pre-specified negative controls vs the committed criteria
- recovery_test_report.md: methodology, FAIL result with diagnosis,
  top-10, lisinopril as the non-obvious candidate, limitations, v2
- known_limitations.md: L-glutamine coverage resolved, 12%-overlap
  driver, recovery outcome table

Outcome: FAIL on all 3 criteria (hydroxyurea top 13%, L-glutamine
WTCS=0, 1/5 negative controls bottom-half). Root cause is signature/
assay data limitations (lost erythroid+HbF axis, 12% landmark overlap),
not the matching algorithm — reported straight per the project ethos.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:38:56 +02:00
fd4591949c Week 3: CMap connectivity scoring engine + ranked candidates
Implement the matching engine (PLAN §6 Week 3):
- src/scoring.py: weighted-KS/GSEA enrichment, weighted connectivity
  score (WTCS, Lamb 2006 / Subramanian 2017), signed NCS normalization,
  rank_drugs, and a sickle-pathway mechanistic prior
- tests/test_scoring.py: real reference tests for the scorer (perfect
  reversal<null<mimic, same-sign->0, absent-gene invariance) + prior
- week3_scoring.py: score 300 drugs -> ranked_candidates_v1.csv with a
  raw ranking and a secondary mechanistic-prior-weighted ranking

Preliminary (formal recovery test is Week 4): hydroxyurea raw rank
40/300 (top 13%, just misses pre-registered top-10%), blended rank 7;
L-glutamine WTCS=0 (ambiguous). Notably anti-inflammatory SCD drugs
cluster in the raw top tier — the engine reverses the inflammation axis,
not the erythroid axis, traceable to the 12% landmark-overlap caveat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:34:56 +02:00
47b0094079 Week 2: 300-drug profiles with LINCS signatures + ChEMBL
Build the drug profile dataset (PLAN §6 Week 2):
- week2_curate_drugset.py: 300-drug set (2 ground-truth + 32 related-
  mechanism + 26 negative-control + 240 random), restricted to
  LINCS-scorable compounds, seed=42
- week2_chembl.py: InChIKey->ChEMBL match (145/300), MoA + targets
- week2_lincs_extract.py: cmapPy-slice both Level-5 GCTX phases to 978
  landmark genes, mean-aggregate per drug to one consensus signature
- week2_assemble.py: join into drug_profiles_v1.parquet, Tier B (LINCS
  single-source), scored flag per PLAN §6 Week 3 task 2
- docs/data_sources.md: drug set composition + LINCS/ChEMBL provenance

Results (all gitignored data): 300/300 drugs scored, both ground-truth
drugs present (hydroxyurea Phase II = CHEMBL467, L-glutamine Phase I).
Key caveat recorded: only 56/477 (12%) of the disease signature genes
are LINCS landmarks, so Week-3 scoring uses a 30-up/26-down query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:25:00 +02:00
c7b6649d31 Week 1: Tier-A sickle cell signature via 2-study concordance
Implement and run the Week 1 disease-signature pipeline:
- src/disease.py: Welch t-test + BH DE (microarray), probe->symbol
  collapse, cross-study concordance filter, 2-study provenance schema
- scripts/week1_explore.py: download GSE35007 + GSE16728, DE + concordance
- scripts/week1_finalize.py: mygene ID mapping + persist signature
- tests/test_disease.py: synthetic-data tests for DE/collapse/concordance
- docs/data_sources.md: chosen datasets, group defs, reproduction steps

Result: sickle_cell_signature_v1.json (gitignored), Tier A, 250 up /
227 down genes from 671 concordant (GSE35007 Illumina whole blood SS/AA +
GSE16728 Affymetrix whole blood patient/control). Documented caveats:
missing HbF axis (globin depletion) and reticulocyte composition confound.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 20:43:54 +02:00