Compare commits

...

24 Commits

Author SHA1 Message Date
f74a58aa43 Harden cache_msa: pick valid FASTA a3m + strip null bytes
Pilot validated the MSA-reuse architecture (1 server query, cached &
reused -- server hammering fixed) but Boltz's INPUT a3m parser crashed on
the cached file: KeyError '\x00'. Boltz writes several .a3m files (some
binary); we were grabbing a bad one.

Fix: select the largest a3m that looks like FASTA (starts with '>'), and
strip null bytes before caching. Raise with the candidate list if none
qualifies (so we can see what's there).

Pilot status: MSA caching + reuse confirmed working; this fixes the
a3m-format crash. Re-validate with a 2-drug pilot before the full run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 00:44:32 +02:00
066e0096d7 Fix screen: compute target MSA once, reuse it, tolerate failures
The full 300-drug screen hammered the public ColabFold MSA server (one
redundant query per drug for the same HDAC2 sequence) -> timeouts, and
the default return_exceptions=False risked aborting the whole run on a
single failure.

Corrected:
- cache_msa(): compute the target MSA ONCE via the server, cache the a3m
  on the Volume; doubles as the weight/CCD warmup.
- build_boltz_yaml(msa_path): protein reuses the cached a3m.
- cofold(msa_path): when given, skip --use_msa_server (no server query).
- screen(): cache MSA once, then cofold.starmap(..., return_exceptions=True)
  so a bad drug is skipped, not fatal.

Turns a fragile ~2.5 hr run into a fast, robust one. Validation pilot
(6 drugs) running; full 300 to run tomorrow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 00:41:51 +02:00
0535886ce6 Phase 2 screen pilot: HDAC2 recovers the inhibitor class (P>=0.99)
Add the `screen` entrypoint (parallel ~10-wide, cached weights) and run a
24-drug pilot vs HDAC2 (+Zn), ranked by Boltz-2 P(binder). ~$1.3.

Result (recovery test at scale): top 9 are ALL HDAC inhibitors
(trichostatin-A/vorinostat/panobinostat/belinostat/scriptaid/mocetinostat/
entinostat/apicidin >=0.99; valproic-acid 0.91), clean drop-off to
hydroxyurea 0.78 and non-HDAC drugs to dexamethasone 0.03. Captures the
structure-activity gradient (hydroxamates > weak fatty-acid > non-HDAC).

Honest false negative: romidepsin (potent HDAC inhibitor) ranks low (0.43)
-- it's a depsipeptide PRODRUG co-folding doesn't model. Screen mishandles
non-standard chemotypes.

Screening pipeline validated; next is the full 300-drug discovery run.
max_containers=10 (parallel safe once weights cached).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 22:23:01 +02:00
907a39aaba Capture structure-track deps + pose entrypoint
- gpu/modal_app.py: add the `pose` local entrypoint used for the HDAC2
  pose-RMSD validation (run: `modal run gpu/modal_app.py::pose`).
- pyproject [structure] extra: add the deps we actually use locally
  (gemmi, spyrmsd, meeko, modal) for reproducibility; document the non-pip
  tools (Vina binary, open-babel) and that Boltz/cuequivariance are
  Modal-image-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 19:56:40 +02:00
9efdc0acf1 Pose-RMSD confirms HDAC2/vorinostat geometry: 0.21A (Vina was 7.9A)
Close the §12.4 validation loop. scripts/pose_rmsd.py superposes the
Boltz-2-predicted HDAC2 onto crystal 4LXZ, transforms the predicted
ligand, and scores pose RMSD (spyrmsd, in-place):
- protein fold: Ca RMSD 0.14A over 366 residues
- vorinostat pose: 0.21A (crystal-accurate) vs Vina 7.9A on this exact
  Zn-chelation case
- catalytic Zn ion: 2.73A off (ligand perfect, metal slightly less)

HDAC2 now validated on BOTH affinity (P(binder)=0.999) and geometry
(0.21A). The structure-binding modality is comprehensively validated on
its decisive metal-coordination case. Commits the predicted complex as
evidence (docs/results/HDAC2_vorinostat_pred.pdb).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 19:50:33 +02:00
c891a78541 Phase 1 co-folding WORKS: HDAC2/Zn validated (Boltz-2 on Modal)
First clear positive result in the project. Ran Phase 1 on Modal L4
(~$0.70). Boltz-2 P(binder), cofactors co-folded:
- HDAC2 (+Zn): vorinostat 0.9994 vs negatives ~0.1 -> PASS, decisive
- hemoglobin (+heme): voxelotor 0.46 -> PASS (weak; covalent/tetramer)
- PKR (+FBP/Mg): mitapivat 0.32 < hydroxyurea 0.40 -> FAIL (allosteric)

HDAC2/Zn is the exact case classical Vina failed (no metal term, 7.9A
redock). Co-folding handles the Zn-chelation chemistry -> the structure-
binding modality pivot (PLAN §12) is validated on its decisive test.

Engineering fixes that got it running: image needs cuequivariance kernels;
max_containers=1 so weights download once (parallel corrupted the shared-
Volume checkpoint); rank by P(binder) not affinity_pred_value (sign).

Adds docs/results/phase1_affinity.csv (committed; raw under data/ gitignored).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 19:41:19 +02:00
07705a5884 GPU Phase 1: co-fold cofactors/metals (the binding-mode determinants)
Add metal/cofactor handling to the Boltz-2 YAML as CCD ligand entries -
the modes classical docking couldn't model:
- HDAC2 + catalytic Zn (vorinostat chelates it)
- PKR + FBP + Mg (allosteric activator + metal)
- hemoglobin + heme
Same cofactors present when co-folding negatives into a target (fair test).

build_boltz_yaml() gains a cofactor_ccds arg (emits `ligand: {ccd: ...}`
entries); TARGETS carries per-target cofactors; cofold()/main() thread them
through. Verified locally: YAML builds correctly with Zn / FBP+Mg.

Honest limitation noted: Hb's voxelotor site is at the tetramer centre and
covalent (Schiff base), so single-chain+heme only approximates it - HDAC2
(Zn) and PKR (cofactor) are the real co-folding tests. Ready for
`modal run gpu/modal_app.py`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 17:16:06 +02:00
4022c0cb94 GPU Phase 1 runnable: real Boltz-2 co-folding + alignment review
Flesh out the Modal app into a runnable Phase-1 positive-control test and
reconcile it with the plan:
- cofold() GPU fn: build Boltz-2 YAML (protein+ligand+affinity), run
  `boltz predict --use_msa_server --cache /weights/boltz`, parse affinity
  JSON + predicted pose; weights persist via Volume.
- Local helpers (CPU, import-tested against our PDBs): binding_chain_sequence
  (gemmi -- correctly picks the binding chain, e.g. alpha-globin for 5E83),
  pubchem_smiles, build_boltz_yaml, fetch_pdb (RCSB).
- main(): fan out cofold.starmap over 3 targets x (known binder + 2
  negatives); tabulate; PASS if known binder has top P(binder) for its target.

Alignment fixes:
- Rank by P(binder) (higher=better), NOT raw affinity_pred_value whose sign
  (~log IC50) is version-dependent -- avoids a backwards positive-control test.
- gpu_plan.md Phase 1 updated to affinity/P(binder) ranking; pose-RMSD noted
  as a later refinement (needs receptor superposition).

Local half verified (sequence/SMILES/YAML); cofold() needs a live `modal run`
(account + `modal token new`) to validate end-to-end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 16:56:27 +02:00
81d56b7a76 GPU plan: make weight persistence concrete (Modal Volume cache)
Document and wire the weight-caching mechanism:
- modal.Volume is a cloud-backed FS independent of the GPU/container;
  run 1 downloads weights into /weights, run 2+ reuses them (no GPU time
  wasted re-downloading).
- Point downloaders at the mount: HF_HOME/TORCH_HOME/boltz --cache; persist
  via weights.commit(), see updates via weights.reload().
- Volume storage costs pennies, separate from GPU = near-free caching.

modal_app.py cofold(): set cache env vars to /weights, reload()/commit()
around the (stubbed) boltz call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 16:48:50 +02:00
08ed713cc8 GPU plan: ephemeral serverless co-folding (Modal) + app skeleton
docs/gpu_plan.md: cost-efficient plan for running AF3-class co-folding
(Boltz-2/DiffDock) on a GPU then paying nothing when idle.
- Key insight: structure-track data is tiny (MB of PDBs/SMILES); only the
  GPU + model weights are heavy -> serverless is ideal.
- Recommend Modal (per-second billing, scales to zero = nothing to kill);
  RunPod as the SSH-box alternative with idle auto-terminate.
- Lifecycle: image -> weights Volume (cache, don't re-download) -> run ->
  git push small results -> teardown automatic.
- Phase 1 validate on 3 known binders (~$1) before paying for a screen;
  Boltz-2 (affinity) on an L4/A10 (24-48GB); est total ~$5-15.

gpu/modal_app.py: Modal app skeleton (image, weights volume, GPU cofold()
function, local entrypoint); boltz invocation stubbed with TODOs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 16:45:04 +02:00
7c6cef1aef Production docking: prep helps (7.9->4.8A) but Vina wrong tool for sickle
§12.4 pushed to its limit. Meeko ligand prep + in-place symmetry RMSD
(spyrmsd, not obrms) on clean HDAC2/vorinostat: 7.9A -> 4.76A. Prep and
metric mattered, but still FAIL.

Residual cause is fundamental: vorinostat binds via hydroxamate-Zn
chelation and Vina has no metal-coordination term. Real finding: sickle's
druggable targets bind via non-classical chemistry classical docking
handles poorly -- Hb (covalent), PKR (allosteric+cofactor), HDAC (Zn
chelation). Vina is the wrong tool for this target landscape.

Redirect: data-driven AF3-class co-folding (Boltz-2/Chai-1/DiffDock)
handles these modes -- the indicated next tool, gated by the 24GB local
memory ceiling (cloud GPU needed). The "GPU breaks all-local" §12.6
prediction is now the binding constraint of the track.

Adds: scripts/dock_production.py; deps meeko, spyrmsd, gemmi.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 16:38:54 +02:00
51bd90df41 Redocking-RMSD validation fails 3/3: pipeline-quality issue
§12.4 de-biased validation (scripts/dock_validate.py).
Redock each co-crystal ligand into its own structure, RMSD vs crystal:
- voxelotor->Hb: NA (covalent binder, out of scope §12.7)
- mitapivat->PKR: 8.2A (allosteric, cofactors stripped)
- vorinostat->HDAC2 (4LXZ, zinc kept): 7.9A -- a CLASSICAL target that
  should have worked

The clean target also failing => systematic pipeline-quality problem,
not target choice. Cheap Vina + open-babel prep gives scores but doesn't
reproduce known geometry, so affinities aren't trustworthy. Ligand
efficiency over-corrects (ranks tiny hydroxyurea best). Fix needs
production prep (Meeko/AutoDockTools prepare_receptor + reduce) and an
in-place RMSD metric. Consistent with the project theme: the quick
version of every method runs but fails honest validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 07:28:47 +02:00
75f5383961 Docking baseline: toolchain solved, raw affinity is size-biased
§12.3-12.4 first binding result on ARM Mac.
- Toolchain SOLVED: AutoDock Vina 1.2.5 mac binary (Rosetta) + open-babel
  (brew). No conda, no MLX. dock_positive_controls.py runs end-to-end.
- Cross-dock known binders + negatives into Hb (5E83) and PKR (8XFD),
  box centered on co-crystal ligands (5L7=voxelotor, WV2=mitapivat).

Finding: raw Vina affinity ranks almost perfectly by MOLECULAR SIZE
(mitapivat > voxelotor > decitabine/caffeine > hydroxyurea) in both
pockets — mitapivat wins even on hemoglobin it doesn't target. Raw score
can't distinguish target-specific binding: the docking analog of the
connectivity specificity problem. Next: redocking-RMSD validation +
ligand-efficiency normalization.

Note: machine is 24GB (not 96GB per PLAN §2), capping local AF3-class
inference. tools/ gitignored (vina binary).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 00:03:00 +02:00
817bcda7dc Structure-binding track: scaffold + ligand-retrieval baseline
Start the structure-based binding branch (PLAN §12), baseline-first.

- src/binding.py: validated RDKit ligand retrieval (morgan_fp, tanimoto,
  retrieve_nearest = the §12.9 engine) + dock() stub documenting the
  blocked ARM-Mac toolchain
- scripts/binding_ligand_baseline.py: 300 drugs vs known binders
- docs/structure_binding_notes.md: status, toolchain blocker, next steps
- pyproject: [structure] extra (rdkit); data/raw/structures/ for PDBs

Step-0 finding: retrieval engine VALIDATED on in-set classes
(decitabine->azacitidine 0.62; vorinostat->scriptaid/belinostat) but the
distinctive binders voxelotor/mitapivat have no analog in our 300-drug
set (Tanimoto ~0.2). Needs (a) bigger library, (b) real docking (§12.3),
which is blocked on the ARM-Mac docking toolchain (§12.6 pitfall 4).
Structures 5E83 (Hb+voxelotor) and 8XFD (PKR+mitapivat) fetched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:53:27 +02:00
6c2c71d73d PLAN §12.9: leave door open for generative-guided retrieval
Reframe de novo generation into the repurposing frame per the founder's
idea: use a pocket-conditioned generative model (TargetDiff/DiffSBDD/
Pocket2Mol) to propose an idealised binder as a SEARCH BEACON, then
retrieve the nearest EXISTING drugs by chemical similarity (Tanimoto/
embedding) as repurposing candidates — the generated molecule is never
synthesised.

Caveats kept honest: generated molecules used only as beacons (often
synthetically invalid); similarity != activity, so retrieved neighbours
still must be docked + pass the binding recovery test; open question
whether it beats brute-force docking the existing library. Explore only
after the §12.3-12.4 docking baseline is validated. §12.7 exclusion
reworded to point here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:43:25 +02:00
7449dbeefb Scope Phase 2 structure-based binding track into PLAN (§12)
Add a scoped (not committed) follow-on track pivoting modality from
expression-connectivity to structure-based drug-target binding, motivated
by the empirical finding that the expression modality is signal-dead for
this task (relational-only supervised AUC = 0.49, chance).

§12 covers: the evidence for the pivot, a sickle-specific druggable target
shortlist with known-binder positive controls (Hb/voxelotor, PKR/mitapivat,
DNMT1/decitabine, LSD1, HDAC, EHMT2, PDE9), method (classical docking
baseline -> AF3-class co-folding: Boltz-2/Chai-1/DiffDock), a pre-registered
binding recovery test, integration with the expression layer as the real
prize, honest pitfalls (binding != efficacy, BCL11A untractable, GPU breaks
the all-local assumption), and open decisions before committing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:40:18 +02:00
649f617019 Phase D: supervised cross-disease (0.925 AUC degree-bias mirage)
Train GradientBoosting on 300 drugs x 839 GEO disease signatures with
Repurposing-Hub indications as labels (432 positives), disease-grouped CV.

Finding: 0.925 CV AUC looks like a win but is a MIRAGE. Feature
importances are all drug-level (drug_std 0.33, drug_mean 0.30,
broadness 0.17); drug-disease connectivity importance = 0.01. The model
learned a drug-POPULARITY prior, not disease-specific matching. On
held-out sickle it ranks hydroxyurea 231/300 (worse than baseline) and
tops out with promiscuous drugs (dexamethasone, methotrexate). Classic
degree-bias trap. Connectivity also has ~chance AUC (0.51) for predicting
approved indications.

Both obvious approaches now fail instructively: unsupervised = specificity
ceiling; naive supervised = degree bias. Real progress needs degree-
debiased training + much larger clean labels (a research effort).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:31:32 +02:00
0ce688449d Phase A: reference-library tau (negative result on specificity)
Calibrate sickle connectivity against a real disease-signature
reference population (Enrichr Disease_Signatures_from_GEO, 141 diseases)
instead of random gene-set nulls — proper CMap tau.

Finding: hydroxyurea still recovers (rank 25, top 8%), but negative-
control specificity is UNCHANGED (2/5; norethindrone + ciprofloxacin
still top). The reference-calibrated ranking is nearly identical to the
random-null ranking. Third independent fix (after gene-space expansion
and composition adjustment) that recovers hydroxyurea but does NOT fix
specificity.

Conclusion: unsupervised connectivity has a specificity ceiling — it
cannot distinguish therapeutic reversal from coincidental transcriptional
anti-correlation. Breaking it needs external signal (supervised labels
or mechanistic filtering), not more calibration. Disease-signature
library cached at data/raw/disease_sigs/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:19:26 +02:00
b62048614d Experiment: composition-adjusted signature (negative result)
Test whether fixing the cell-composition confound rescues the recovery
test. GSE35007 measured WBC/RBC/MCV per sample, so adjust DE directly
for them (per-gene OLS: expression ~ disease + WBC + RBC + MCV + age +
sex) — a measured-covariate deconvolution, compared vs unadjusted.

Finding (negative, informative): adjustment HURT hydroxyurea (rank
20 -> 35, fails top-10%) — it was recovering partly via composition
genes — and did NOT fix negative-control specificity (still 2/5). Only
gain: top hits become more mechanistically coherent (resveratrol,
aspirin enter). Conclusion: cleaning the disease signature does not
rescue connectivity; the binding constraint is matching-side (needs a
reference-signature library for proper specificity calibration), which
is a multi-disease investment. v1.1 signature NOT replaced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:05:08 +02:00
3417f85eb1 v1.1: full gene space + specificity z-score; hydroxyurea recovers
Post-hoc improvement after the pre-registered v1 recovery test failed.
Two changes, diagnosing v1's failure:
- score on the full 12,328-gene LINCS space (week2_lincs_extract.py),
  lifting signature overlap from 12% to 85% (brings erythroid markers in)
- src/scoring.py: KS connectivity + per-drug specificity z-score
  (spec_z = SDs below a 1,000 random-query null). Primary ranking is
  now spec_z. (Textbook tau saturated at +/-100 for a coherent query —
  documented; needs a reference-signature library, a v2 item.)
- week3_scoring.py: spec_z primary + WTCS reference + prior-blended
- tests: tau/spec_z calibration test; 19 passing
- scripts/exp_genespace.py: the BING vs all-12,328 comparison

Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%),
confirming the v1 failure was the landmark bottleneck not the algorithm.
Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite),
and negative controls (norethindrone, ciprofloxacin) rank top-3 —
connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory,
not a confirmatory test; reported as such in recovery_test_report.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:57:30 +02:00
72f1a49de6 Week 4: recovery test (FAIL, reported honestly) + 2-page report
Run the formal recovery test against the pre-registered criteria and
write the deliverable report (PLAN §6 Week 4):
- week4_recovery_test.py: evaluate hydroxyurea/L-glutamine + 5
  pre-specified negative controls vs the committed criteria
- recovery_test_report.md: methodology, FAIL result with diagnosis,
  top-10, lisinopril as the non-obvious candidate, limitations, v2
- known_limitations.md: L-glutamine coverage resolved, 12%-overlap
  driver, recovery outcome table

Outcome: FAIL on all 3 criteria (hydroxyurea top 13%, L-glutamine
WTCS=0, 1/5 negative controls bottom-half). Root cause is signature/
assay data limitations (lost erythroid+HbF axis, 12% landmark overlap),
not the matching algorithm — reported straight per the project ethos.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:38:56 +02:00
fd4591949c Week 3: CMap connectivity scoring engine + ranked candidates
Implement the matching engine (PLAN §6 Week 3):
- src/scoring.py: weighted-KS/GSEA enrichment, weighted connectivity
  score (WTCS, Lamb 2006 / Subramanian 2017), signed NCS normalization,
  rank_drugs, and a sickle-pathway mechanistic prior
- tests/test_scoring.py: real reference tests for the scorer (perfect
  reversal<null<mimic, same-sign->0, absent-gene invariance) + prior
- week3_scoring.py: score 300 drugs -> ranked_candidates_v1.csv with a
  raw ranking and a secondary mechanistic-prior-weighted ranking

Preliminary (formal recovery test is Week 4): hydroxyurea raw rank
40/300 (top 13%, just misses pre-registered top-10%), blended rank 7;
L-glutamine WTCS=0 (ambiguous). Notably anti-inflammatory SCD drugs
cluster in the raw top tier — the engine reverses the inflammation axis,
not the erythroid axis, traceable to the 12% landmark-overlap caveat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:34:56 +02:00
47b0094079 Week 2: 300-drug profiles with LINCS signatures + ChEMBL
Build the drug profile dataset (PLAN §6 Week 2):
- week2_curate_drugset.py: 300-drug set (2 ground-truth + 32 related-
  mechanism + 26 negative-control + 240 random), restricted to
  LINCS-scorable compounds, seed=42
- week2_chembl.py: InChIKey->ChEMBL match (145/300), MoA + targets
- week2_lincs_extract.py: cmapPy-slice both Level-5 GCTX phases to 978
  landmark genes, mean-aggregate per drug to one consensus signature
- week2_assemble.py: join into drug_profiles_v1.parquet, Tier B (LINCS
  single-source), scored flag per PLAN §6 Week 3 task 2
- docs/data_sources.md: drug set composition + LINCS/ChEMBL provenance

Results (all gitignored data): 300/300 drugs scored, both ground-truth
drugs present (hydroxyurea Phase II = CHEMBL467, L-glutamine Phase I).
Key caveat recorded: only 56/477 (12%) of the disease signature genes
are LINCS landmarks, so Week-3 scoring uses a 30-up/26-down query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:25:00 +02:00
c7b6649d31 Week 1: Tier-A sickle cell signature via 2-study concordance
Implement and run the Week 1 disease-signature pipeline:
- src/disease.py: Welch t-test + BH DE (microarray), probe->symbol
  collapse, cross-study concordance filter, 2-study provenance schema
- scripts/week1_explore.py: download GSE35007 + GSE16728, DE + concordance
- scripts/week1_finalize.py: mygene ID mapping + persist signature
- tests/test_disease.py: synthetic-data tests for DE/collapse/concordance
- docs/data_sources.md: chosen datasets, group defs, reproduction steps

Result: sickle_cell_signature_v1.json (gitignored), Tier A, 250 up /
227 down genes from 671 concordant (GSE35007 Illumina whole blood SS/AA +
GSE16728 Affymetrix whole blood patient/control). Documented caveats:
missing HbF axis (globin depletion) and reticulocyte composition confound.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 20:43:54 +02:00
35 changed files with 6520 additions and 124 deletions

3
.gitignore vendored
View File

@@ -31,3 +31,6 @@ env/
.DS_Store .DS_Store
.idea/ .idea/
.vscode/ .vscode/
# Local tool binaries (refetch per docs/structure_binding_notes.md)
tools/

140
PLAN.md
View File

@@ -426,3 +426,143 @@ This MVP exists in a broader strategic context that was built through ~7 expert
- **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product. - **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product.
The MVP's job is to produce one credible result. Everything else follows from that. The MVP's job is to produce one credible result. Everything else follows from that.
---
## 12. Phase 2 track — Structure-based binding (scoped 2026-06-23)
> **Status: scoped, not committed.** This is a follow-on track proposed *after* the MVP and its
> follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what
> those experiments empirically showed. Read §911 and the experiment commits first.
### 12.1 Why pivot modality (the evidence, not a hunch)
The expression-connectivity approach was built, validated, and pushed hard (gene-space
expansion, cell-composition deconvolution, reference-library tau, supervised learning). The
empirical verdict:
- Connectivity **recovers hydroxyurea** (top ~68%) but **cannot achieve specificity**
unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four
independent methods.
- A supervised model on indication labels hit **0.925 CV AUC** but it was a **degree-bias
mirage**: it learned drug popularity, not disease matching (it ranked hydroxyurea *231/300*).
- The decisive test: with drug-popularity features removed, the model trained on the actual
drugdisease connectivity scored **AUC 0.491 — pure chance**. **The expression-connectivity
modality contains essentially no disease-specific therapeutic signal for this task.**
This is a *signal* problem, not a *model* problem no amount of model sophistication (diffusion,
GNNs, etc.) extracts signal that isn't in the data. The response is to **change modality** to one
with a strong, physical, drug-specific signal: **does a molecule bind a sickle-relevant target?**
A drug that binds HbS is mechanistically specific by construction the opposite of a coincidental
expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is
itself a diffusion model) actually has traction, because structure has physical ground truth.
### 12.2 Targets (sickle-specific, druggable, structurally characterised)
Small molecules only 2). Curated shortlist with public structures and, crucially, **known
small-molecule binders to serve as positive controls**:
| Target | Mechanism in sickle | Known binder (positive control) |
|---|---|---|
| Hemoglobin (HBB/HBA tetramer, HbS) | Anti-polymerisation; R-state stabiliser | **voxelotor** (binds α-Val1) |
| PKR (PKLR, red-cell pyruvate kinase) | Activator 2,3-BPG O2 affinity | **mitapivat**, etavopivat |
| DNMT1 | HbF induction (de-repress γ-globin) | **decitabine**, azacitidine |
| LSD1 / KDM1A | HbF induction | tranylcypromine analogues |
| HDAC1/2 | HbF induction | vorinostat, panobinostat |
| EHMT2 (G9a) | HbF induction | UNC0642 (tool) |
| PDE9 | cGMP, anti-adhesion | PF-04447943 (sickle trial) |
Hard/excluded for v1: **BCL11A** (transcription factor, no classic pocket the γ-globin master
repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism.
### 12.3 Method (baseline → generative co-folding)
1. **Prepare structures.** Pull target structures from the PDB; AF3/Boltz-predict any missing.
2. **Prepare ligands.** Reuse the existing ~300-drug set (we already have canonical SMILES from
ChEMBL); expandable to the full ChEMBL/LINCS catalogue.
3. **Dock + score**, in increasing sophistication:
- **Baseline:** classical docking (AutoDock Vina / smina) fast, CPU, well-understood.
- **Generative co-folding:** an open AlphaFold3-class model **Boltz-2** (predicts the
proteinligand complex *and* a binding-affinity estimate, MIT-licensed), **Chai-1**, or
**DiffDock** (a diffusion model for docking the legitimate home for the "diffusion"
instinct). These predict the bound pose directly and tend to beat classical docking.
- Report both; the baseline keeps us honest about whether the ML model adds anything.
### 12.4 Validation (a real recovery test, like §6 Week 4)
Pre-register before scoring: **the known structure-based sickle drugs must rank as top binders to
their targets** voxelotorhemoglobin, mitapivatPKR, decitabineDNMT1. Negative controls
(unrelated drugs) must *not* bind these pockets. This is a cleaner recovery test than the
expression one, because binding is mechanistically specific it should not have the
coincidental-reverser problem that sank the connectivity approach.
### 12.5 The real prize — integrate, don't replace
The long-term value is **both modalities together**: a candidate that *reverses the disease
signature* (expression) **and** *binds a sickle-relevant target* (structure) is far more credible
than either alone. Structure supplies the specificity the expression layer lacks; expression
supplies the systems-level, target-agnostic view structure lacks. The platform thesis 11)
two databases + a matching engine extends naturally to a third (structures) feeding the same
confidence-tiered data layer.
### 12.6 Honest pitfalls (do not ignore)
1. **Binding ≠ efficacy.** A molecule can bind and do nothing therapeutic. Structure-based hits
are still hypotheses (cf. §9.7).
2. **Only covers the enzyme/pocket subset.** Sickle's biggest lever (γ-globin reactivation via
BCL11A) is largely transcriptional and not small-molecule-tractable structure-based screening
is blind to it. Be explicit about coverage.
3. **Docking/affinity accuracy is limited.** Pose prediction is decent; absolute affinity is hard.
Validate on known binders before trusting novel scores.
4. **Compute.** AF3-class models are GPU-heavy; the local Mac Studio 2) may not suffice this
track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule.
5. **Moat.** Structures and tools are public; the proprietary value is the curated target list,
the integration with the expression layer, and provenance/tiering not the docker.
### 12.7 Explicitly NOT in this track
Free energy perturbation / MD-based affinity; covalent docking; **de novo generation of molecules
as final candidates to synthesise** (design, not repurposing but see §12.9 for the
generate-then-retrieve hybrid, which *is* repurposing); BCL11A or any non-pocket target;
biologics; combination binding.
### 12.8 Open decisions before committing
- **Tooling:** classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend:
baseline first, for an honest reference the lesson of the whole expression arc.)
- **Compute:** secure a GPU environment (the "all local" §2 assumption breaks here).
- **Scope of v1:** the 7-target shortlist above, or start with just Hb + PKR (the two with the
cleanest positive controls) to de-risk the harness before scaling targets.
### 12.9 Door left open — generative-guided retrieval (generate → match existing)
A legitimate way to bring generative models *into the repurposing frame* (vs the design frame
excluded in §12.7): don't generate molecules to synthesise generate them as a **search beacon**.
**The idea.** Use a pocket-conditioned generative model (target-conditioned diffusion e.g.
TargetDiff, DiffSBDD, Pocket2Mol) to propose idealised binders for a sickle target. Then retrieve
the **nearest existing drugs** to those proposals by chemical similarity (Tanimoto over Morgan
fingerprints, or a learned molecular embedding). The retrieved neighbours real, already-approved
or clinical molecules are the repurposing candidates. The generated molecule is never made; it
only *defines a region of chemical space* worth searching. This is the user-proposed framing and
it is sound: "generate the ideal, then find what we already have nearby."
**Why it could add value.** It can point at scaffolds / regions a known-binder-seeded or
brute-force docking sweep would miss, and it makes the target's binding requirements explicit as
geometry rather than as a single reference ligand.
**Honest caveats (why it's a *door*, not a commitment).**
1. **Generated molecules are often synthetically unrealistic / invalid** which is exactly why
they must be used only as beacons, never as candidates.
2. **Similarity ≠ activity.** Activity cliffs mean a near-neighbour of a good binder can be inert.
So retrieved neighbours do **not** bypass validation they must still be docked/scored 12.3)
and clear the binding recovery test 12.4). The generative step *proposes*; it does not *prove*.
3. **Marginal-value question.** Directly docking the whole existing library 12.3) already covers
chemical space. Whether generate-then-retrieve beats that by efficiency or by surfacing
non-obvious scaffolds is an open empirical question and needs a head-to-head before it earns
real investment.
4. **Only as good as the pocket conditioning** of the generator, and the chemistry of the target.
**Status:** explore only *after* the §12.312.4 docking harness works and is validated on the
known binders same discipline as everywhere else: prove the baseline, then test whether the
fancier method adds anything.

View File

View File

@@ -10,17 +10,63 @@
| MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD | | MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD |
| Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD | | Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD |
| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD | | OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
| GEO | https://www.ncbi.nlm.nih.gov/geo/ | GEOparse, FTP | Public domain | Expression data | TBD | TBD | | GEO (GSE35007) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 | GEOparse, FTP | Public domain | Disease signature (study 1) | GPL10558 | 2026-06-23 |
| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets | TBD | TBD | | GEO (GSE16728) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 | GEOparse, FTP | Public domain | Disease signature (study 2) | GPL570 | 2026-06-23 |
| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures | TBD | TBD | | ChEMBL | https://www.ebi.ac.uk/chembl | chembl_webresource_client | CC BY-SA 3.0 | Drug structures, MoA, targets | API (live) | 2026-06-23 |
| LINCS L1000 Phase I | GSE92742 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. L-glutamine) | GSE92742 | 2026-06-23 |
| LINCS L1000 Phase II | GSE70138 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. hydroxyurea) | GSE70138 | 2026-06-23 |
| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD | | ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD | | FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD | | Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |
## Chosen GEO dataset ## Chosen GEO datasets (disease signature, Tier A via 2-study concordance)
_Document the chosen study fully: accession, platform, n per group, publication, why it was The signature is the cross-study concordance of two independent whole-blood studies (genes
selected over the alternatives (GSE53441, GSE35007, …)._ significant at q<0.05 in **both** with the same direction). Whole-blood tissue was required so
concordance is meaningful; the two differ by platform and population, which strengthens
robustness.
| Study | Platform | Tissue | Disease group | Healthy group | n disease / healthy |
|---|---|---|---|---|---|
| **GSE35007** | Illumina HumanHT-12 V4 (GPL10558) | whole blood | hb phenotype = SS | hb phenotype = AA | 190 / 12 |
| **GSE16728** | Affymetrix HG-U133 Plus 2.0 (GPL570) | whole blood (PAXgene) | sickle-cell patient | control | 10 / 10 |
- DE method: per-gene Welch t-test + BenjaminiHochberg (microarray, pure Python).
- Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
- Result: 16,208 genes tested in both **671 concordant** (444 up / 227 down). Signature =
top 250 up + all 227 down by worst-case q-value.
- **Rejected candidates:** GSE53441 (PBMC tissue mismatch with the whole-blood anchor);
GSE84633/GSE84634 (PBMC, no healthy controls).
- **Tier caveat:** GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict
n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.
Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
`scripts/week1_finalize.py` (mygene mapping + persist).
## Drug profiles (Week 2)
300-drug set (`drug_set_v1.csv`), composed and restricted to LINCS-scorable compounds:
| Inclusion reason | n | Notes |
|---|---|---|
| ground_truth | 2 | hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I) |
| related_mechanism | 32 | HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories |
| negative_control | 26 | antifungals, antihistamines, antibiotics, hormones |
| general_sample | 240 | random from LINCS catalog, seed=42 |
- **LINCS signatures:** per-drug consensus = mean of Level-5 MODZ z-scores across the drug's
sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH
phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
- **ChEMBL:** matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds
38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
- **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
not-single-source rule).
- **Gene space (v1.1):** scoring uses the full **12,328-gene** LINCS space, not just the 978
landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger
space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from
LINCS entirely and remain unscoreable.
- Reproduce: `week2_curate_drugset.py``week2_chembl.py` → download Level-5 GCTX →
`week2_lincs_extract.py``week2_assemble.py`.
## Licensing note for LINCS ## Licensing note for LINCS

112
docs/gpu_plan.md Normal file
View File

@@ -0,0 +1,112 @@
# GPU plan — ephemeral AF3-class co-folding for the binding track
Goal: run AF3-class co-folding (Boltz-2 / DiffDock) on a GPU for the structure-binding track
(§12), then **pay nothing when idle**. The work is *bursty* (a validation run, then a screen),
the inputs are *tiny*, so the design optimises for zero idle cost, not for a persistent box.
## What actually has to move (small!)
| Thing | Size | How it gets to the GPU |
|---|---|---|
| Code | KB | `git clone` the gitea repo (or mount) |
| Target structures (PDBs) | a few MB | in the repo / git |
| Ligands (SMILES, drug set) | KB | in the repo / git |
| **Model weights** (Boltz-2 / DiffDock) | **~26 GB** | downloaded once, **cached in a persistent volume** |
| Results (poses, scores, RMSDs) | KBMB | `git push` back / download |
The 27 GB LINCS data is **not** part of this track — nothing big to upload. The only thing worth
persisting is the model-weights cache (so we don't re-download = re-pay GPU time every run).
## How the model weights persist (the cost-saver)
A `modal.Volume` is a **named, cloud-backed filesystem that lives independently of any container
or GPU** — it survives every teardown. Mounted into the function at `/weights`:
- **Run 1:** `/weights` is empty → the model downloads weights there (the one-time slow cost).
- **Run 2+:** the same Volume mounts with the files already present → download skipped → **no
GPU-billed seconds wasted re-fetching 5 GB.**
Two things make it actually cache:
1. **Point the downloader at the mount** (weights only persist if written under `/weights`):
`HF_HOME=/weights/hf` (HuggingFace), `TORCH_HOME=/weights/torch`, `boltz --cache /weights/boltz`.
2. **Commit semantics:** writes persist on `weights.commit()` (modern Modal also auto-commits on a
clean exit); other containers see them after `weights.reload()`. Pattern: `reload()` → run →
`commit()`.
The Volume itself costs pennies (~$/GB-month of storage), *separate from the GPU* — so caching ~5 GB
of weights is near-free and saves real GPU time on every subsequent run.
(Alternative: bake weights into the image at build time via `image.run_function(download)` — fastest
cold start, but the image rebuilds when weights change. The skeleton uses the Volume approach.)
## Provider choice
| Option | Billing | Idle cost | "Kill" model | Best for |
|---|---|---|---|---|
| **Modal** (recommended) | per-second | **$0 (scales to zero)** | automatic — nothing to remember | bursty batch runs |
| RunPod | per-minute, on-demand/spot | only while pod exists | manual `terminate` | interactive SSH box |
| Vast.ai | per-minute spot | only while rented | manual destroy | cheapest, more fiddly |
| Lambda / AWS-GCP spot | per-hour/second | until you stop it | manual stop | if you already have credits |
**Recommend Modal.** You define the run as a function with a `gpu=` decorator; on call it
provisions the GPU, runs, and releases it — **there is no GPU to forget to kill**, and you pay only
for the seconds it ran. That *is* the "kill the GPU to save money" requirement, automated.
## The lifecycle (Modal)
1. **Define image** once: CUDA base + `boltz` (or DiffDock) + `rdkit`, `meeko`, `spyrmsd`, `gemmi`.
2. **Weights volume**: `modal.Volume` mounted at `/weights`; the model downloads into it on first
run and is cached forever after (no re-download cost).
3. **Run**: `modal run gpu/modal_app.py` → provisions GPU → runs the test → returns results to your
laptop. GPU released the moment the function returns.
4. **Persist results**: write the returned scores/poses into `data/processed/binding/` and
`git commit` (small, text).
5. **Teardown**: nothing to do — Modal scaled to zero. `modal app stop` only if a run is hung.
RunPod alternative (if you want an interactive box): start pod → `git clone` → run → `git push`
results → **`runpodctl remove pod <id>`** (or Stop in the UI). Set an **idle auto-terminate** so a
forgotten box can't bleed money.
## What runs on the GPU (in cost order — cheap validation first)
- **Phase 1 — modality validation (~minutes, ~$1):** co-fold each known binder + 2 negative
controls (caffeine, hydroxyurea) into each target (Hb, PKR, HDAC2) **with the binding-mode
cofactors co-folded in** — HDAC2 + catalytic **Zn**, PKR + **FBP/Mg**, Hb + **heme** (as CCD
ligand entries) — and check the **known binder has the highest Boltz-2 P(binder)** for its own
target. This is the discrimination Vina couldn't manage precisely because it can't model
Zn-chelation / cofactors. (Ranking uses P(binder), not the raw affinity value, whose
sign is version-dependent.) Pose-RMSD vs crystal is a deeper check but needs receptor
superposition (align predicted protein to crystal, transform ligand) — a later refinement. If
Phase 1 passes, the modality is real; if not, stop before paying for a screen.
- **Phase 2 — screen (~tens of minutes, a few $):** run the ~300-drug set (or a focused subset)
against the sickle targets; rank by Boltz-2 predicted affinity; redo the §12.4 positive-control
recovery test. Output a ranked CSV, same shape as the connectivity `ranked_candidates`.
## Model choice
- **Boltz-2** (MIT, pip-installable) — predicts the proteinligand complex **and** a binding
affinity → directly gives a rankable score. Primary choice. Fits a 2440 GB GPU for these
single-domain targets.
- **DiffDock-L** — lighter, pose-only (needs a separate scorer); fallback if Boltz memory is tight.
- GPU: an **L4 (24 GB, ~$0.60.8/hr)** or **A10/L40S (2448 GB)** is plenty; no multi-GPU, no A100
needed for these sizes.
## Cost controls (the save-money checklist)
- Serverless (Modal) → **zero idle cost** by construction; or an **idle-timeout** auto-kill on a box.
- **Cache weights** in a persistent volume — re-downloading 5 GB on a $1/hr GPU is wasted money.
- **Validate on one target before screening** — don't pay for a 300-drug screen until Phase 1 passes.
- Prefer **spot/interruptible** for the batch screen (Phase 2 is restartable).
- Keep prep (Meeko/RDKit) and result-scoring on the **laptop**; only the model forward pass needs GPU.
- Estimated total to validate + a first screen: **~$515**, not a standing bill.
## Repo integration
- `gpu/modal_app.py` — the Modal app (skeleton committed alongside this plan).
- Results land in `data/processed/binding/` (gitignored) + a small committed summary.
- Pin model + weights version in the image for reproducibility.
## Next step
Scaffold `gpu/modal_app.py` for Phase-1 validation (3 known binders), do a dry run locally
(`modal run --detach`? no — just `modal run`), confirm cost, then Phase 2. Requires a Modal account
+ `pip install modal` + `modal token new` (one-time auth).

View File

@@ -12,9 +12,11 @@ Source: PLAN.md §9.
cell lines (MCF7, A375, PC3, …). Signatures for non-oncology diseases may be noisy. A cell lines (MCF7, A375, PC3, …). Signatures for non-oncology diseases may be noisy. A
field-wide limitation, not unique to Reverso. field-wide limitation, not unique to Reverso.
3. **L-glutamine probably has no LINCS signature.** Amino acids and metabolites weren't LINCS 3. **L-glutamine LINCS coverage — RESOLVED, opposite of expected.** L-glutamine DOES have a
priorities. If true, the ground-truth test effectively rests on hydroxyurea alone, which is Phase I signature (hydroxyurea is Phase-II-only) — both ground-truth drugs are scorable. But
weaker. _Status: TBD — record the actual finding here once LINCS is pulled (Week 2)._ L-glutamine's connectivity is **ambiguous (WTCS=0)**: its up- and down-set enrichments share
a sign, so it shows no reversal. It ranks 100/300. So the ground-truth test effectively rests
on hydroxyurea, which itself only reaches top 13% (raw) — see the recovery test report.
4. **Connectivity scoring surfaces broad-effect drugs as false positives.** HDAC inhibitors and 4. **Connectivity scoring surfaces broad-effect drugs as false positives.** HDAC inhibitors and
broad kinase inhibitors often top connectivity rankings simply because they perturb many broad kinase inhibitors often top connectivity rankings simply because they perturb many
@@ -32,8 +34,28 @@ Source: PLAN.md §9.
7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses 7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
to test, not discoveries. Use careful language in any write-up. to test, not discoveries. Use careful language in any write-up.
## Drug-specific gaps (fill in during Week 23) 8. **Gene-space bottleneck (v1 → fixed in v1.1).** v1 scored on only the 978 landmark genes (12%
signature overlap) — the main driver of the v1 failure. v1.1 uses the full 12,328-gene space
(85% overlap) and recovers hydroxyurea. HBG1/HBG2 remain absent from LINCS entirely.
| Drug | Issue | Handling | 9. **No reference-signature library for tau.** Textbook CMap tau saturated at ±100 (a coherent
query always out-connects random gene sets). v1.1 substitutes a per-drug specificity z-score.
Proper tau needs a library of real reference signatures — a v2 / curated-data item.
10. **Negative-control criterion may be invalid for connectivity scoring.** Unrelated drugs
(norethindrone, ciprofloxacin) rank as top specific reversers — connectivity measures
expression reversal, not therapeutic relatedness.
## Recovery test outcome
Pre-registered test (**v1, confirmatory**): **FAILED** all three criteria (hydroxyurea rank
40/top 13%; L-glutamine rank 100; 1/5 negative controls bottom-half). Post-hoc (**v1.1,
exploratory**): hydroxyurea recovers to rank 18 (top 6%, passes), but L-glutamine (rank 213, does
not reverse) and negative controls (2/5) still fail → overall still FAIL. See
`recovery_test_report.md`.
| Drug | Issue | v1.1 status |
|---|---|---| |---|---|---|
| TBD | e.g. no LINCS signature | flagged "not scored, no signature available" | | hydroxyurea | needed the full gene space | rank 18 (top 6%) — recovered post-hoc |
| L-glutamine | metabolite, no reversal signal (positive connectivity) | rank 213 — genuine negative |
| neg controls | reverse the generic inflammation signature | 2/5 bottom-half — criterion questionable |

View File

@@ -1,13 +1,10 @@
# Sickle Cell Repurposing — Recovery Test Report # Sickle Cell Repurposing — Recovery Test Report
> **Status: DRAFT SCAFFOLD — not yet run.** Filled in during Week 4 from > **Status: COMPLETE (v1 confirmatory + v1.1 exploratory).** Reproduce with `scripts/week1_*` →
> `notebooks/05_recovery_test.ipynb`. Target length: ~2 pages, readable by a sceptical > `week2_*` → `week3_scoring.py` → `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.
> pharma scientist in 5 minutes.
## Pre-registered success criteria ## Pre-registered success criteria
> ⚠️ **Commit this section to git _before_ running the recovery test** (PLAN.md §8, §10).
The MVP passes if: The MVP passes if:
- Hydroxyurea ranks in the **top 10%** (top 30 of 300), **AND** - Hydroxyurea ranks in the **top 10%** (top 30 of 300), **AND**
@@ -15,54 +12,116 @@ The MVP passes if:
missing LINCS signature, **AND** missing LINCS signature, **AND**
- At least **4 of 5** negative-control drugs rank in the **bottom half**. - At least **4 of 5** negative-control drugs rank in the **bottom half**.
_Pre-registered on: TBD (date of commit)_ _Pre-registered in the scaffold commit (`b731478`) before any scoring. **Primary (confirmatory)
analysis = v1**: 978 landmark genes, weighted connectivity score (WTCS). The 5 negative controls
were pre-specified by category rule without inspecting ranks._
--- ---
## Section 1 — Methodology ## Section 1 — Methodology
_56 sentences: what was built, the GEO dataset used, the drug-set composition, and the A sickle cell disease signature was built from **two whole-blood microarray studies** (GSE35007
scoring method (CMap connectivity, Lamb 2006 / Subramanian 2017)._ Illumina SS-vs-AA; GSE16728 Affymetrix patient-vs-control), keeping the **671 genes concordant**
across both (q<0.05, same direction) a cross-platform Tier-A signature (250 up / 227 down).
Profiles were built for **300 small molecules** (2 ground-truth; 32 related-mechanism; 26
negative controls; 240 random), each with a **LINCS L1000** consensus signature (mean Level-5
MODZ across cell lines, both CMap phases). Drugs were ranked by **CMap connectivity scoring**
(Kolmogorov-Smirnov, Lamb 2006 / Subramanian 2017): negative = reversal = candidate.
**v1 (pre-registered/confirmatory):** scored on the 978 landmark genes with WTCS.
**v1.1 (post-hoc/exploratory):** after v1 failed, two changes were made to diagnose why (a)
score on the full **12,328-gene** space (landmark overlap 12% 85%, bringing the erythroid
markers in); (b) add a **per-drug specificity z-score** (`spec_z`): how many SDs the real
connectivity is below a null of 1,000 random queries of the same size against that drug. Because
these changes followed inspection of the v1 result, **v1.1 is exploratory, not a confirmatory
test of the pre-registered hypothesis.**
## Section 2 — Recovery test result ## Section 2 — Recovery test result
| Drug | Rank | Percentile | Pass? | | Criterion | v1 (confirmatory) | v1.1 (exploratory) |
|---|---|---|---|
| Hydroxyurea | TBD | TBD | TBD |
| L-glutamine | TBD | TBD | TBD |
Negative controls (expected: bottom half):
| Control drug | Rank | Bottom half? |
|---|---|---| |---|---|---|
| TBD | TBD | TBD | | Hydroxyurea top-10% (≤30) | rank **40** (13.3%) | rank **18** (6.0%) |
| L-glutamine top-25% (≤75) | rank 100, WTCS=0 | rank 213, spec_z=+0.98 |
| 4/5 neg controls bottom-half | 1/5 | 2/5 |
| **Overall** | **FAIL** | **FAIL** (but hydroxyurea recovered) |
**Overall: PASS / FAIL against pre-registered criteria — TBD** v1.1 negative controls: clotrimazole 258 ✅, astemizole 211 ✅, azithromycin 142 ❌,
ethinyl-estradiol 114 ❌, caffeine 77 ❌.
## Section 3 — Top 10 candidates **Honest reading.** The **pre-registered test FAILED (v1).** The post-hoc v1.1 changes
**recover hydroxyurea** (rank 40 18, passing top-10%) strong evidence that the v1 failure was
driven by the 978-landmark bottleneck, not the algorithm. But two failures survive into v1.1, and
both are now precisely diagnosed:
| Rank | Drug | Score | Known mechanism | Biological plausibility | 1. **L-glutamine does not reverse the signature** (positive connectivity, spec_z=+0.98). This is
intrinsic to its LINCS data a metabolite with no reversal signal not a coverage gap. More
genes cannot fix it.
2. **The negative-control criterion is arguably invalid for connectivity scoring.** Two
"negative controls" (norethindrone, ciprofloxacin) rank in the top 3 by spec_z. Connectivity
measures *expression reversal*, not *therapeutic relatedness* an antibiotic or contraceptive
can still down-regulate the inflammation genes that dominate the scorable signature. The test
design conflates the two.
A note on the calibration: textbook CMap **tau** (percentile vs a reference population) was
implemented but **saturated at ±100** here, because a coherent real query always out-connects
random gene sets proper tau needs a library of *real* reference signatures, which this MVP
lacks. The continuous `spec_z` is the workable substitute.
## Section 3 — Top 10 candidates (v1.1 spec_z)
| Rank | Drug | spec_z | Inclusion | Read |
|---|---|---|---|---| |---|---|---|---|---|
| 1 | TBD | TBD | TBD | TBD | | 1 | reserpic-acid | 3.80 | random | reserpine metabolite; non-obvious |
| 2 | norethindrone | 3.78 | **negative control** | false positive (see §2) |
| 3 | ciprofloxacin | 3.61 | **negative control** | false positive |
| 4 | resveratrol | 3.46 | related-mechanism | antioxidant studied in SCD coherent |
| 5 | BRD-K57490754 | 3.37 | random | tool compound |
| 6 | anastrozole | 3.27 | random | aromatase inhibitor |
| 710 | BRD-* / palmitoylethanolamide | ~3.1 | random | mostly tool compounds |
_Note: HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due That two negative controls outrank hydroxyurea is the single most informative result here see §4.
to widespread expression effects — flag these honestly (PLAN.md §9.4)._
## Section 4 — One non-obvious candidate worth investigating ## Section 4 — One non-obvious result worth investigating
_A single paragraph on the most interesting result. Language must be careful: this is a The most useful finding is **not** a candidate drug but the **negative-control failure**:
computational hypothesis to test, not a discovery (PLAN.md §9.7)._ unrelated drugs (norethindrone, ciprofloxacin) score as strong specific reversers. This is a
real, generalizable lesson for a signature whose *scorable* portion is generic
inflammation/metabolic genes, connectivity rewards any broad transcriptional perturbation that
touches those genes. The honest implication: **this signature is not specific enough to
discriminate true repurposing candidates from incidental expression reversers.** Of the
plausibly-real hits, **resveratrol (rank 4)** an antioxidant with prior sickle cell literature
is the most defensible, but it is a hypothesis, not a discovery.
## Section 5 — Honest limitations ## Section 5 — Honest limitations
- Cell-composition confound in whole-blood expression (PLAN.md §9.1) 1. **Pre-registered test failed; the pass is post-hoc.** v1.1's hydroxyurea recovery is
- LINCS L1000 cell-line limitations — landmark genes measured mostly in cancer lines (§9.2) exploratory and must be re-validated on a held-out disease before any claim is made.
- Missing signatures (e.g. L-glutamine) (§9.3) 2. **Missing HbF axis** HBG1/HBG2 are absent from LINCS entirely (not just landmarks), so the
- No mechanistic validation layer — discovery hypothesis generation, not validated prediction (§9.6) pathway hydroxyurea acts on can never be scored by this method.
3. **Signature specificity** scorable genes are inflammation/metabolic; negative controls
reverse them too. Connectivity therapeutic relatedness.
4. **Cell-composition confound** the whole-blood signature is reticulocyte-dominated.
5. **LINCS cancer-cell-line bias**, and **no reference-signature library** for proper tau.
6. **No mechanistic validation** all hits are computational hypotheses.
## Section 6 — What v2 would fix ## Section 6 — What v2 would fix
- Cell-type deconvolution of the disease signature - **A reference-signature library** to make tau (proper specificity calibration) work the
- Knowledge graph fallback for missing-signature drugs single biggest fix to the negative-control problem, and a direct use of the curated-data moat.
- A second disease to test generalization (the real test — sickle cell results do not prove - **Cell-type deconvolution** + a non-globin-depleted RNA-seq study to recover a more specific,
the platform generalizes, §9.5) HbF-containing signature.
- **Signature prediction / mechanism graph** to score genes with no LINCS measurement.
- **A second disease** to test generalization and to honestly re-validate the v1.1 method
(PLAN §9.5).
---
### Bottom line
The pre-registered recovery test **failed**. Post-hoc diagnosis shows the dominant cause was a
fixable gene-space bottleneck correcting it **recovers hydroxyurea** but also surfaces a
deeper, genuine limitation: this whole-blood signature is **not specific enough** for
connectivity scoring to separate real candidates from incidental reversers (negative controls
rank at the top). The MVP's real deliverable is a precise, honest map of *what it takes to make
this method work*: a more specific (deconvolved, HbF-containing) signature and a reference library
for calibration exactly the curated-data investments the platform thesis is built on.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,10 @@
target,ligand,affinity,prob_binder,is_known_binder
hemoglobin,voxelotor,0.16870097815990448,0.4566290080547333,True
hemoglobin,caffeine,1.5093848705291748,0.34217822551727295,False
hemoglobin,hydroxyurea,0.6418474316596985,0.06856877356767654,False
PKR,mitapivat,1.187251329421997,0.3238581717014313,True
PKR,caffeine,2.992832899093628,0.2582802474498749,False
PKR,hydroxyurea,2.444709300994873,0.39834722876548767,False
HDAC2,vorinostat,-1.7771220207214355,0.9993993043899536,True
HDAC2,caffeine,2.3925399780273438,0.11900843679904938,False
HDAC2,hydroxyurea,1.284231424331665,0.7654422521591187,False
1 target ligand affinity prob_binder is_known_binder
2 hemoglobin voxelotor 0.16870097815990448 0.4566290080547333 True
3 hemoglobin caffeine 1.5093848705291748 0.34217822551727295 False
4 hemoglobin hydroxyurea 0.6418474316596985 0.06856877356767654 False
5 PKR mitapivat 1.187251329421997 0.3238581717014313 True
6 PKR caffeine 2.992832899093628 0.2582802474498749 False
7 PKR hydroxyurea 2.444709300994873 0.39834722876548767 False
8 HDAC2 vorinostat -1.7771220207214355 0.9993993043899536 True
9 HDAC2 caffeine 2.3925399780273438 0.11900843679904938 False
10 HDAC2 hydroxyurea 1.284231424331665 0.7654422521591187 False

View File

@@ -0,0 +1,25 @@
rank,drug,P_binder,affinity,inclusion_reason
1,trichostatin-a,0.9994845986366272,-2.883908271789551,related_mechanism
2,vorinostat,0.9994641542434692,-1.7357702255249023,related_mechanism
3,panobinostat,0.9983962774276733,-2.4892501831054688,related_mechanism
4,belinostat,0.9970909357070923,-2.102327823638916,related_mechanism
5,scriptaid,0.9957252740859985,-1.5838574171066284,related_mechanism
6,mocetinostat,0.9910210371017456,-1.3829972743988037,related_mechanism
7,entinostat,0.9903219938278198,-0.6888977289199829,related_mechanism
8,apicidin,0.9882931113243103,-2.2579169273376465,related_mechanism
9,valproic-acid,0.9134805202484131,0.3317195773124695,related_mechanism
10,hydroxyurea,0.77649986743927,1.2352395057678223,ground_truth
11,curcumin,0.6803375482559204,0.9697343707084656,related_mechanism
12,sulforaphane,0.5829939842224121,0.6017141342163086,related_mechanism
13,resveratrol,0.5800595283508301,1.1647570133209229,related_mechanism
14,tadalafil,0.5426475405693054,0.21663367748260498,related_mechanism
15,glutamine,0.4674023389816284,2.029467821121216,ground_truth
16,quercetin,0.45189985632896423,0.34488344192504883,related_mechanism
17,romidepsin,0.4316568374633789,-0.8200547099113464,related_mechanism
18,decitabine,0.38500306010246277,0.8381982445716858,related_mechanism
19,azacitidine,0.31987687945365906,0.6874549388885498,related_mechanism
20,sildenafil,0.15390564501285553,0.26623794436454773,related_mechanism
21,thalidomide,0.1190553605556488,1.9194087982177734,related_mechanism
22,pomalidomide,0.11849743872880936,1.7003729343414307,related_mechanism
23,lenalidomide,0.09446469694375992,1.9872398376464844,related_mechanism
24,dexamethasone,0.031893711537122726,0.5724264979362488,related_mechanism
1 rank drug P_binder affinity inclusion_reason
2 1 trichostatin-a 0.9994845986366272 -2.883908271789551 related_mechanism
3 2 vorinostat 0.9994641542434692 -1.7357702255249023 related_mechanism
4 3 panobinostat 0.9983962774276733 -2.4892501831054688 related_mechanism
5 4 belinostat 0.9970909357070923 -2.102327823638916 related_mechanism
6 5 scriptaid 0.9957252740859985 -1.5838574171066284 related_mechanism
7 6 mocetinostat 0.9910210371017456 -1.3829972743988037 related_mechanism
8 7 entinostat 0.9903219938278198 -0.6888977289199829 related_mechanism
9 8 apicidin 0.9882931113243103 -2.2579169273376465 related_mechanism
10 9 valproic-acid 0.9134805202484131 0.3317195773124695 related_mechanism
11 10 hydroxyurea 0.77649986743927 1.2352395057678223 ground_truth
12 11 curcumin 0.6803375482559204 0.9697343707084656 related_mechanism
13 12 sulforaphane 0.5829939842224121 0.6017141342163086 related_mechanism
14 13 resveratrol 0.5800595283508301 1.1647570133209229 related_mechanism
15 14 tadalafil 0.5426475405693054 0.21663367748260498 related_mechanism
16 15 glutamine 0.4674023389816284 2.029467821121216 ground_truth
17 16 quercetin 0.45189985632896423 0.34488344192504883 related_mechanism
18 17 romidepsin 0.4316568374633789 -0.8200547099113464 related_mechanism
19 18 decitabine 0.38500306010246277 0.8381982445716858 related_mechanism
20 19 azacitidine 0.31987687945365906 0.6874549388885498 related_mechanism
21 20 sildenafil 0.15390564501285553 0.26623794436454773 related_mechanism
22 21 thalidomide 0.1190553605556488 1.9194087982177734 related_mechanism
23 22 pomalidomide 0.11849743872880936 1.7003729343414307 related_mechanism
24 23 lenalidomide 0.09446469694375992 1.9872398376464844 related_mechanism
25 24 dexamethasone 0.031893711537122726 0.5724264979362488 related_mechanism

View File

@@ -0,0 +1,175 @@
# Structure-based binding track — working notes
Branch `structure-based-binding`. Implements PLAN §12. Baseline-first, start with the two cleanest
targets (Hemoglobin + PKR), de-risk the harness before scaling.
## Status (2026-06-23)
**Toolchain check (PLAN §12.6 pitfall 4, confirmed real):**
- ✅ RDKit installs on ARM Mac — ligand side ready.
- ❌ AutoDock Vina does NOT pip-install on ARM Mac; no docking binary available. Docking (§12.3)
is **blocked on toolchain** — must resolve via conda/micromamba (`vina`/`smina`), a GPU AF3-class
model (Boltz-2/Chai-1/DiffDock), or an x86 Vina binary under Rosetta.
**Structures obtained:** `5E83` (hemoglobin + voxelotor), `8XFD` (PKR + mitapivat) in
`data/raw/structures/`.
**Step 0 — ligand-based retrieval baseline (`scripts/binding_ligand_baseline.py`):**
RDKit Tanimoto of our 300 drugs vs known sickle binders.
- Engine VALIDATED on in-set classes: `decitabine`→azacitidine (0.62); `vorinostat`→scriptaid
(0.42), belinostat (0.28). Correctly clusters DNMT1 / HDAC HbF-inducers.
- But voxelotor / mitapivat have **no analog** in our set (max Tanimoto ~0.200.26). A 300-drug
library is too sparse to contain look-alikes of distinctive scaffolds.
**Takeaways:**
1. Ligand retrieval works but needs a **bigger drug library** to be useful for distinctive targets.
2. The targets without in-set analogs (Hb, PKR) need **actual docking** (§12.3) — which scores
binding directly, no look-alike required. That is the gating next step, and it needs the
toolchain solved.
## Step 1 — docking baseline (2026-06-24)
**Toolchain SOLVED on ARM Mac:** AutoDock Vina 1.2.5 mac binary (`tools/vina`, runs under Rosetta)
+ open-babel (brew) for prep. Docking runs end-to-end (`scripts/dock_positive_controls.py`).
Co-crystal ligands identified: 5L7 = voxelotor (5E83), WV2 = mitapivat (8XFD).
**Positive-control cross-docking — inconclusive, and instructively so.** Affinities (kcal/mol):
| ligand | hemoglobin | PKR |
|---|---|---|
| voxelotor | 8.1 | 9.3 |
| mitapivat | 10.0 | 11.2 |
| decitabine | 6.6 | 7.0 |
| hydroxyurea | 3.9 | 3.6 |
| caffeine | 6.1 | 6.4 |
The scores rank almost perfectly by **molecular size** (mitapivat > voxelotor > decitabine/caffeine
> hydroxyurea) in *both* pockets — mitapivat wins even on hemoglobin, which it doesn't target. So
raw Vina affinity is confounded by ligand size and per-pocket stickiness; it cannot yet
distinguish target-specific binding. This is the **docking analog of the connectivity specificity
problem** — raw scores carry a systematic bias (size here, broadness there) that masquerades as
signal. voxelotor *does* dock to Hb (8.1, a real score); the cross-target test just isn't the
right validation.
## Step 2 — redocking-RMSD validation FAILS across the board (2026-06-24)
Redocked each co-crystal ligand into its own structure (`scripts/dock_validate.py`); RMSD vs
crystal pose via obrms:
| redock | RMSD | note |
|---|---|---|
| voxelotor → Hb (5E83) | NA | covalent binder (Schiff base, αVal1) — out of scope §12.7 |
| mitapivat → PKR (8XFD) | 8.2 Å | allosteric, cofactor (FBP/Mg) stripped |
| **vorinostat → HDAC2 (4LXZ, Zn kept)** | **7.9 Å** | classical non-covalent target — should have worked |
**The clean target also failing means this is a systematic PIPELINE-QUALITY problem, not target
choice.** The cheap Vina + open-babel setup produces scores but does not reproduce known binding
geometry, so its affinities are not yet trustworthy. Ligand efficiency (affinity / heavy atoms)
also doesn't fix it — it over-corrects, ranking tiny hydroxyurea (0.78) "best".
Likely causes (in priority order):
1. **Low-quality receptor prep** — open-babel `-xr` is not production docking prep. Need
AutoDockTools `prepare_receptor` or **Meeko** + `reduce`/pdb2pqr for protonation, charges, and
proper AutoDock atom typing.
2. **Ligand prep** — should use Meeko (correct rotatable bonds / typing), not bare obabel `--gen3d`.
3. **RMSD metric** — obrms superimposes before RMSD; redocking validation wants symmetry-corrected
RMSD **in place** (receptor frame). Worth confirming with an in-place metric.
**Honest takeaway:** consistent with the whole project — the *quick* version of each method runs
but doesn't survive honest validation. Credible structure-based docking needs production prep
tooling (Meeko/ADFR), which is the real next investment for this track.
## Step 3 — production prep helps, but classical docking is the wrong tool here (2026-06-24)
`scripts/dock_production.py`: Meeko ligand prep (proper rotatable-bond/AD typing) + in-place
symmetry-corrected RMSD (spyrmsd, not obrms which superimposes). On the clean HDAC2/vorinostat
target (Zn kept):
- **7.9 Å → 4.76 Å** with proper ligand prep + correct metric. Prep and metric genuinely mattered.
- But still FAIL (>2 Å). The residual is the deeper problem: **vorinostat binding is defined by its
hydroxamate chelating the catalytic Zn, and Vina has no metal-coordination term** — it cannot
score the interaction that determines the pose.
**The real finding: sickle's druggable targets bind via non-classical chemistry that classical
docking handles poorly** — Hb/voxelotor (covalent), PKR/mitapivat (allosteric + cofactor),
HDAC/vorinostat (Zn chelation). This is the target landscape, not bad luck. AutoDock Vina is the
wrong tool for it.
**Redirect:** the modality that DOES handle covalent/metal/induced-fit binding is **data-driven
AF3-class co-folding** (Boltz-2 / Chai-1 / DiffDock — they learn these modes from the PDB). That is
the indicated next tool for this disease — and it's gated by the **24 GB local memory ceiling**
(PLAN §12.6 pitfall 4): needs a cloud GPU or a bigger box. The "GPU breaks all-local" prediction is
now the binding constraint of the whole track.
## Step 4 — AF3-class co-folding (Boltz-2 on Modal GPU) WORKS on the Zn case (2026-06-24)
Ran Phase 1 on Modal (L4, serverless) — `gpu/modal_app.py`, ~$0.600.80. Co-folded each known
binder + 2 negatives into each target WITH the binding-mode cofactors (HDAC2+Zn, PKR+FBP/Mg,
Hb+heme). Ranked by Boltz-2 P(binder):
| target | known binder P(binder) | negatives | verdict |
|---|---|---|---|
| **HDAC2 (+Zn)** | vorinostat **0.9994** | caffeine 0.12, hydroxyurea 0.77 | **PASS — decisive** |
| hemoglobin (+heme) | voxelotor 0.46 | caffeine 0.34, hydroxyurea 0.07 | PASS (weak) |
| PKR (+FBP/Mg) | mitapivat 0.32 | hydroxyurea 0.40 (beat it) | FAIL |
**Headline: HDAC2 + zinc — the exact case Vina failed (7.9 Å redock, no metal term) — co-folding
NAILS** (vorinostat 0.999 vs negatives ~0.1). The data-driven model handles the Zn-chelation
chemistry classical docking could not. The modality pivot is validated on its decisive test.
The first clear positive result in the project after a long string of honest negatives.
Notes: (1) affinity sign confirmed — vorinostat has the lowest affinity_pred_value (1.78,
strongest) AND highest P(binder); ranking by max(affinity) would be backwards (the P(binder) fix
was necessary). (2) 2/3: Hb weak (covalent/tetramer, as predicted), PKR miss (allosteric pocket).
(3) Engineering: had to add cuequivariance kernels to the image; serialize (max_containers=1) so
the weights download once (parallel containers corrupted the checkpoint).
## Step 5 — pose-RMSD confirmation: co-folding reproduces the geometry (2026-06-24)
`scripts/pose_rmsd.py`: superpose predicted HDAC2 onto crystal 4LXZ (Ca), transform the predicted
vorinostat + Zn, compare poses (spyrmsd, in-place).
| metric | co-folding | classical Vina |
|---|---|---|
| protein fold, Ca RMSD over 366 res | **0.14 A** | — |
| **vorinostat pose RMSD** | **0.21 A** PASS | 7.9 A FAIL |
| catalytic Zn placement | 2.73 A (minor) | no metal term |
Co-folding reproduces the fold AND the vorinostat binding pose to **0.21 A (crystal-accurate)** on
the exact Zn-chelation case Vina was off by 7.9 A. HDAC2 is now validated on BOTH axes: affinity
(P(binder)=0.999) and geometry (0.21 A). Minor blemish: the Zn ion itself lands ~2.7 A off (ligand
perfect, metal slightly less). The structure-binding modality is comprehensively validated on its
decisive metal-coordination case.
## Step 6 — Phase 2 screen pilot (HDAC2): recovers the inhibitor class decisively (2026-06-24)
`modal run gpu/modal_app.py::screen --limit 24` — co-fold 24 drugs vs HDAC2 (+Zn), parallel
(~10-wide, cached weights), ranked by P(binder). ~$1.3.
**Top 9 = all HDAC inhibitors** (trichostatin-A, vorinostat, panobinostat, belinostat, scriptaid,
mocetinostat, entinostat, apicidin all P≥0.99; valproic-acid 0.91), then a clean drop to
hydroxyurea 0.78 and non-HDAC drugs sinking to dexamethasone 0.03. The model captures the
**structure-activity gradient**: potent Zn-chelating hydroxamates ~0.99 vs weak fatty-acid
valproate 0.91 vs DNMT inhibitors / IMiDs / steroid near 0. The screen recovers the right
pharmacological class at scale.
**Honest false negative:** romidepsin (potent HDAC inhibitor) ranked low (0.43) — it is a cyclic
depsipeptide PRODRUG (must be reduced to expose its Zn-binding thiol), which co-folding doesn't
model. The screen mishandles non-standard chemotypes (prodrugs, macrocycles).
The screening pipeline is validated. Next: run the full set (incl. the 240 random + negatives) to
hunt for NON-obvious HDAC2 binders (the actual discovery run), ~$15-20.
## Next steps
- [ ] Full screen (300 drugs) vs HDAC2 — discovery run for non-obvious binders.
- [ ] Investigate PKR: allosteric site may need the full assembly / better pocket definition.
- [ ] Phase 2 screen: rank the ~300-drug set against HDAC2 (the validated target) by P(binder);
positive-control recovery test at screen scale.
- [ ] Add a one-time weight-warmup function so post-cache runs go back to fast parallel safely.
- [ ] (optional) Salvage one classical Vina case: PKR with FBP/Mg cofactors RETAINED, to confirm
the harness can validate on a non-metal sickle target.
- [ ] Production receptor prep (Meeko mk_prepare_receptor + protonation) if staying with Vina.
- [ ] §12.9 generative beacon — only after a validated scoring function exists.
> **Hardware note:** this machine is **24 GB** unified memory (not the 96 GB PLAN §2 assumed),
> which caps local AF3-class model inference. Classical docking (above) is unaffected.

329
gpu/modal_app.py Normal file
View File

@@ -0,0 +1,329 @@
"""Ephemeral GPU runner for AF3-class co-folding (PLAN §12, docs/gpu_plan.md).
Serverless: `modal run gpu/modal_app.py` provisions a GPU, runs Phase 1, releases the GPU — zero
idle cost. Model weights cache in a persistent Volume so we never re-pay GPU time to re-download.
Phase 1 (positive-control recovery test, §12.4): co-fold each known binder + a couple of negative
controls against each sickle target and rank by Boltz-2 predicted affinity. The known binder
should win its own target — the test Vina couldn't pass on metal/covalent/allosteric modes.
Affinity ranking avoids the receptor-alignment that pose-RMSD would need (a later refinement).
Setup (one-time): `pip install modal && modal token new`.
Run: `modal run gpu/modal_app.py`.
Helpers below the GPU function run locally (no GPU) and are import-safe for testing.
"""
from __future__ import annotations
import json
import subprocess
from pathlib import Path
import modal
app = modal.App("reverso-binding")
image = (
modal.Image.debian_slim(python_version="3.12")
.apt_install("git", "wget")
# Boltz-2 needs NVIDIA cuequivariance kernels (cuda 12) for inference, plus rdkit/numpy.
.pip_install("boltz", "cuequivariance-torch", "cuequivariance-ops-torch-cu12", "rdkit", "numpy")
)
weights = modal.Volume.from_name("reverso-binding-weights", create_if_missing=True)
WEIGHTS = "/weights"
# target -> (PDB id, crystal-ligand resname, drug name, cofactor/metal CCD codes to co-fold).
# The cofactors are the binding-mode determinants Vina couldn't model: HDAC2 needs the catalytic
# Zn (vorinostat chelates it), PKR the allosteric FBP + Mg, hemoglobin the heme. Co-folding them
# in as CCD ligands is the whole point of the AF3-class pivot. The same cofactors are present when
# co-folding the negatives into that target, for a fair comparison.
TARGETS = {
"hemoglobin": ("5E83", "5L7", "voxelotor", ["HEM"]),
"PKR": ("8XFD", "WV2", "mitapivat", ["FBP", "MG"]),
"HDAC2": ("4LXZ", "SHH", "vorinostat", ["ZN"]),
}
NEGATIVES = ["caffeine", "hydroxyurea"]
# Honest limitation: hemoglobin's voxelotor site sits at the tetramer centre and the bond is
# covalent (Schiff base) — a single-chain + heme model only approximates it, so Hb is the weak
# case. HDAC2 (Zn chelation) and PKR (allosteric + cofactor) are the real tests of whether
# co-folding handles the modes classical docking could not.
# --------------------------------------------------------------------------- local helpers (CPU)
def fetch_pdb(pdb: str, struct_dir: Path = Path("data/raw/structures")) -> Path:
"""Return a local PDB path, downloading from RCSB if absent (keeps the GPU run self-contained)."""
import requests
p = struct_dir / f"{pdb}.pdb"
if not p.exists():
p.parent.mkdir(parents=True, exist_ok=True)
p.write_bytes(requests.get(f"https://files.rcsb.org/download/{pdb}.pdb", timeout=60).content)
return p
def binding_chain_sequence(pdb: str, lig_resname: str) -> str:
"""One-letter sequence of the protein chain nearest the crystal ligand (the binding chain)."""
import gemmi
st = gemmi.read_structure(str(fetch_pdb(pdb)))
model = st[0]
lig_atoms = [a.pos for ch in model for res in ch if res.name == lig_resname for a in res]
if not lig_atoms:
raise ValueError(f"ligand {lig_resname} not found in {pdb}")
lig_center = gemmi.Position(
sum(p.x for p in lig_atoms) / len(lig_atoms),
sum(p.y for p in lig_atoms) / len(lig_atoms),
sum(p.z for p in lig_atoms) / len(lig_atoms),
)
best, best_d = None, 1e9
for ch in model:
poly = ch.get_polymer()
if len(poly) < 20:
continue
d = min((a.pos.dist(lig_center) for res in ch for a in res), default=1e9)
if d < best_d:
best, best_d = poly, d
if best is None:
raise ValueError(f"no protein chain in {pdb}")
return gemmi.one_letter_code(best.extract_sequence()).upper().replace("X", "")
def pubchem_smiles(name: str) -> str:
import requests
for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
try:
d = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}"
f"/property/{prop}/JSON", timeout=30).json()["PropertyTable"]["Properties"][0]
if prop in d:
return d[prop]
except Exception:
continue
raise ValueError(f"no SMILES for {name}")
def build_boltz_yaml(protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str] | None = None,
msa_path: str | None = None) -> str:
"""Boltz-2 YAML: protein + the drug ligand (affinity binder) + any cofactor/metal CCD ligands.
Cofactors/ions are added as `ligand` entries referencing their CCD code (e.g. ZN, MG, FBP,
HEM) so the model places the metal/cofactor that defines the binding mode. Affinity is
predicted on the drug ligand L only. If ``msa_path`` is given, the protein reuses a precomputed
MSA (so a screen against one target queries the MSA server ONCE, not per drug).
"""
prot = [" - protein:", " id: A", f" sequence: {protein_seq}"]
if msa_path:
prot.append(f" msa: {msa_path}")
lines = ["version: 1", "sequences:"] + prot + \
[" - ligand:", " id: L", f" smiles: '{ligand_smiles}'"]
for i, ccd in enumerate(cofactor_ccds or []):
lines += [" - ligand:", f" id: M{i}", f" ccd: {ccd}"]
lines += ["properties:", " - affinity:", " binder: L"]
return "\n".join(lines) + "\n"
# ------------------------------------------------------------------------------- GPU function
# max_containers caps parallel fan-out (cost control). The download race that corrupts the
# checkpoint only happens on a COLD volume; once weights are cached+committed (Phase 1 did this),
# parallel containers just reload them, so a screen can safely run ~10-wide.
@app.function(gpu="L4", image=image, volumes={WEIGHTS: weights}, timeout=1800)
def cache_msa(label: str, protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str]) -> str:
"""Compute the target's MSA ONCE (via the server) and cache the a3m on the Volume.
A screen reuses one target protein for all drugs, so we query the MSA server a single time
here, then every cofold() reuses the cached a3m (no server hammering). Returns the Volume path
of the cached a3m. Doubles as the weight/CCD warmup.
"""
import os
import shutil
os.environ["HF_HOME"] = f"{WEIGHTS}/hf"
boltz_cache = f"{WEIGHTS}/boltz"
os.makedirs(boltz_cache, exist_ok=True)
weights.reload()
work = Path("/tmp") / f"{label}_msa"
work.mkdir(parents=True, exist_ok=True)
(work / "in.yaml").write_text(build_boltz_yaml(protein_seq, ligand_smiles, cofactor_ccds))
out = work / "out"
subprocess.run(["boltz", "predict", str(work / "in.yaml"), "--use_msa_server",
"--cache", boltz_cache, "--out_dir", str(out), "--output_format", "pdb"], check=True)
# Pick the largest a3m that actually looks like FASTA/a3m text (Boltz writes several files;
# some are binary and crash its INPUT parser with KeyError '\x00'). Strip null bytes too.
candidates = sorted(out.rglob("*.a3m"), key=lambda p: p.stat().st_size, reverse=True)
chosen = None
for c in candidates:
raw = c.read_bytes()
if raw[:1] == b">" or b"\n>" in raw[:512]:
chosen = raw.replace(b"\x00", b"")
break
if chosen is None:
weights.commit()
raise RuntimeError(f"no valid a3m; candidates={[(str(p), p.stat().st_size) for p in candidates]}")
msa_dir = Path(WEIGHTS) / "msa"
msa_dir.mkdir(parents=True, exist_ok=True)
dest = msa_dir / f"{label}.a3m"
dest.write_bytes(chosen)
weights.commit()
return str(dest)
@app.function(gpu="L4", image=image, volumes={WEIGHTS: weights}, timeout=3600, max_containers=20)
def cofold(label: str, protein_seq: str, ligand_smiles: str, cofactor_ccds: list[str],
msa_path: str | None = None) -> dict:
"""Co-fold one complex (protein + drug + cofactors) on the GPU; return affinity + P(binder).
Weights persist on the mounted Volume. If ``msa_path`` is given, reuse the cached target MSA
(no server query); else query the MSA server. GPU is released the moment this returns.
"""
import os
os.environ["HF_HOME"] = f"{WEIGHTS}/hf"
boltz_cache = f"{WEIGHTS}/boltz"
os.makedirs(boltz_cache, exist_ok=True)
weights.reload()
work = Path("/tmp") / label
work.mkdir(parents=True, exist_ok=True)
(work / "in.yaml").write_text(build_boltz_yaml(protein_seq, ligand_smiles, cofactor_ccds, msa_path))
out = work / "out"
cmd = ["boltz", "predict", str(work / "in.yaml"),
"--cache", boltz_cache, "--out_dir", str(out), "--output_format", "pdb"]
if not msa_path:
cmd.insert(3, "--use_msa_server")
try:
subprocess.run(cmd, check=True)
finally:
weights.commit() # persist downloaded weights/CCD even if this run fails, so retries skip it
# Affinity is written to a JSON under out/predictions/<name>/; parse defensively (keys vary).
aff = {"affinity_pred_value": None, "affinity_probability_binary": None}
for jf in out.rglob("affinity*.json"):
data = json.loads(jf.read_text())
for k in aff:
if k in data:
aff[k] = data[k]
break
cif = next(out.rglob("*_model_0.pdb"), None) or next(out.rglob("*.pdb"), None)
return {"label": label, "affinity": aff["affinity_pred_value"],
"prob_binder": aff["affinity_probability_binary"],
"structure": cif.read_text() if cif else None}
# ------------------------------------------------------------------------------- driver (local)
@app.local_entrypoint()
def pose() -> None:
"""Save the predicted HDAC2/vorinostat complex for local pose-RMSD validation.
Run: `modal run gpu/modal_app.py::pose`. Weights are cached, so this is one fast GPU call.
The returned PDB (protein + vorinostat + Zn) is scored locally against 4LXZ by
scripts/pose_rmsd.py (align predicted protein to crystal, compare ligand).
"""
target = "HDAC2"
pdb, res, drug, cofactors = TARGETS[target]
seq = binding_chain_sequence(pdb, res)
r = cofold.remote(f"{target}_{drug}_pose", seq, pubchem_smiles(drug), cofactors)
out = Path("data/processed/binding"); out.mkdir(parents=True, exist_ok=True)
if r.get("structure"):
dest = out / f"{target}_{drug}_pred.pdb"
dest.write_text(r["structure"])
print(f"saved {dest}; affinity={r['affinity']}, P(binder)={r['prob_binder']}")
else:
print("no structure returned")
@app.local_entrypoint()
def screen(limit: int = 0) -> None:
"""Phase 2: co-fold the drug set against the validated target (HDAC2 + Zn), rank by P(binder).
`modal run gpu/modal_app.py::screen --limit 24` (pilot; omit --limit for the full set).
Recovery check at scale: the known HDAC inhibitors (related_mechanism) should rank top.
Weights are cached, so this fans out ~10-wide.
"""
import csv
import pandas as pd
target = "HDAC2"
pdb, res, _drug, cofactors = TARGETS[target]
seq = binding_chain_sequence(pdb, res)
df = pd.read_csv("data/processed/drug_set_v1.csv")
df = df[df["canonical_smiles"].notna() & (df["canonical_smiles"] != "-666")].copy()
if limit: # pilot: prioritise mechanism + controls (incl. the HDAC inhibitors) then fill
pri = df[df["inclusion_reason"].isin(["ground_truth", "related_mechanism", "negative_control"])]
df = pd.concat([pri, df.drop(pri.index)]).head(limit)
# 1) compute the target MSA ONCE (also warms weights/CCD), then reuse it for every drug
print(f"computing {target} MSA once (server) ...")
msa_path = cache_msa.remote(target, seq, pubchem_smiles("vorinostat"), cofactors)
print(f"cached MSA: {msa_path}")
# 2) screen all drugs reusing the cached MSA; tolerate per-drug failures (no abort)
jobs = [(f"{target}__{r.pert_iname}", seq, r.canonical_smiles, cofactors, msa_path)
for r in df.itertuples()]
print(f"screening {len(jobs)} drugs vs {target} (+{cofactors}), reusing cached MSA")
results = list(cofold.starmap(jobs, return_exceptions=True))
by = {j[0].split("__")[1]: (r if isinstance(r, dict) else None) for j, r in zip(jobs, results)}
n_fail = sum(1 for v in by.values() if v is None)
if n_fail:
print(f" ({n_fail}/{len(jobs)} drugs failed and were skipped)")
reason = dict(zip(df["pert_iname"], df["inclusion_reason"]))
rows = [{"drug": d, "P_binder": (r or {}).get("prob_binder"),
"affinity": (r or {}).get("affinity"), "inclusion_reason": reason.get(d)}
for d, r in by.items()]
rows = [x for x in rows if x["P_binder"] is not None]
rows.sort(key=lambda x: x["P_binder"], reverse=True)
out = Path("data/processed/binding"); out.mkdir(parents=True, exist_ok=True)
with (out / f"screen_{target}.csv").open("w", newline="") as f:
w = csv.DictWriter(f, fieldnames=["rank", "drug", "P_binder", "affinity", "inclusion_reason"])
w.writeheader()
for i, x in enumerate(rows, 1):
w.writerow({"rank": i, **x})
print(f"\nscreened {len(rows)} drugs vs {target}; top 15 by P(binder):")
for i, x in enumerate(rows[:15], 1):
print(f" {i:2d} {x['drug']:20s} P={x['P_binder']:.3f} [{x['inclusion_reason']}]")
hdac_like = [i for i, x in enumerate(rows, 1) if x["inclusion_reason"] == "related_mechanism"]
if hdac_like:
print(f"\nrelated-mechanism (HDAC inhibitors etc.) ranks: {hdac_like[:10]}")
@app.local_entrypoint()
def main() -> None:
"""Fan out one GPU call per (target, ligand) pair; tabulate affinities; positive-control test."""
jobs = [] # (label, protein_seq, smiles, cofactor_ccds)
for target, (pdb, res, drug, cofactors) in TARGETS.items():
seq = binding_chain_sequence(pdb, res)
for lig in [drug, *NEGATIVES]:
jobs.append((f"{target}_{lig}", seq, pubchem_smiles(lig), cofactors))
cofactor_summary = {t: c[3] for t, c in TARGETS.items()}
print(f"co-folding {len(jobs)} complexes ({len(TARGETS)} targets x {1+len(NEGATIVES)} ligands); "
f"cofactors per target: {cofactor_summary}")
results = list(cofold.starmap(jobs))
by = {r["label"]: r for r in results}
print(f"\n{'target':12s}{'ligand':14s}{'affinity':>10s}{'P(binder)':>11s}")
out_rows = []
for target, (pdb, res, drug, cofactors) in TARGETS.items():
prob = {} # rank by P(binder): unambiguous (higher = more likely a binder). Boltz
for lig in [drug, *NEGATIVES]: # affinity_pred_value is ~log(IC50) (lower=stronger) -- sign
r = by.get(f"{target}_{lig}", {}) # is version-dependent, so don't rank on it.
a, p = r.get("affinity"), r.get("prob_binder")
if p is not None:
prob[lig] = p
print(f"{target:12s}{lig:14s}{(f'{a:.2f}' if a is not None else 'NA'):>10s}"
f"{(f'{p:.2f}' if p is not None else 'NA'):>11s}")
out_rows.append({"target": target, "ligand": lig, "affinity": a, "prob_binder": p,
"is_known_binder": lig == drug})
if prob:
best = max(prob, key=prob.get) # highest P(binder)
print(f" -> {target}: best P(binder) = {best} (known = {drug}) "
f"{'PASS' if best == drug else 'FAIL'}\n")
outdir = Path("data/processed/binding"); outdir.mkdir(parents=True, exist_ok=True)
import csv
with (outdir / "phase1_affinity.csv").open("w", newline="") as f:
w = csv.DictWriter(f, fieldnames=["target", "ligand", "affinity", "prob_binder", "is_known_binder"])
w.writeheader(); w.writerows(out_rows)
print(f"wrote {outdir/'phase1_affinity.csv'}")

View File

@@ -33,6 +33,18 @@ dev = [
"pytest>=8.0", "pytest>=8.0",
"ruff>=0.5", "ruff>=0.5",
] ]
# Structure-based binding track (PLAN §12); see docs/structure_binding_notes.md.
# Non-pip tools (install separately): AutoDock Vina binary (tools/vina, from the AutoDock-Vina
# GitHub release) and open-babel (brew). Boltz-2 + cuequivariance run only in the Modal GPU image
# (gpu/modal_app.py), not locally.
structure = [
"rdkit>=2024.3", # ligand prep, fingerprints
"requests>=2.31", # PubChem / RCSB fetch
"gemmi>=0.6", # structure parsing + superposition
"spyrmsd>=0.9", # symmetry-corrected pose RMSD
"meeko>=0.7", # production docking ligand prep
"modal>=1.5", # ephemeral GPU orchestration
]
[tool.setuptools.packages.find] [tool.setuptools.packages.find]
where = ["."] where = ["."]

View File

@@ -0,0 +1,66 @@
"""Structure-based track, step 0: ligand-based retrieval baseline (PLAN §12.9 engine).
Docking (§12.3) needs a toolchain that doesn't pip-install on ARM Mac (AutoDock Vina) — that's the
next dependency to solve. Meanwhile this runs now with pure RDKit: do any of our 300 drugs sit near
the KNOWN sickle binders (voxelotor, mitapivat, decitabine) in chemical space? This is the
retrieval engine §12.9 would point a generative beacon at, and a sanity check on the ligand data.
NOT docking and NOT a binding claim — chemical similarity only. Similarity != activity (§12.9).
"""
from __future__ import annotations
import pandas as pd
import requests
from rdkit import Chem, DataStructs, RDLogger
from rdkit.Chem import rdFingerprintGenerator
RDLogger.DisableLog("rdApp.*")
MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
# Known sickle binders = positive-control beacons (target in parens).
BINDERS = ["voxelotor", "mitapivat", "decitabine", "vorinostat"]
def pubchem_smiles(name: str) -> str | None:
for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
try:
u = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON"
d = requests.get(u, timeout=30).json()["PropertyTable"]["Properties"][0]
if prop in d:
return d[prop]
except Exception:
continue
return None
def fp(smi: str):
if not isinstance(smi, str) or smi in ("-666", ""):
return None
m = Chem.MolFromSmiles(smi)
return MORGAN.GetFingerprint(m) if m else None
def main() -> None:
binder_smi = {b: pubchem_smiles(b) for b in BINDERS}
print("known-binder SMILES:", {k: (v[:34] + "..." if v else "MISSING") for k, v in binder_smi.items()})
drugs = pd.read_csv("data/processed/drug_set_v1.csv")[["pert_iname", "canonical_smiles", "inclusion_reason"]]
reason = dict(zip(drugs.pert_iname, drugs.inclusion_reason))
drug_fp = {r.pert_iname: fp(r.canonical_smiles) for r in drugs.itertuples()}
drug_fp = {k: v for k, v in drug_fp.items() if v is not None}
print(f"fingerprinted {len(drug_fp)}/{len(drugs)} drugs\n")
for b, smi in binder_smi.items():
bfp = fp(smi)
if bfp is None:
print(f"{b}: no SMILES\n"); continue
sims = sorted(((DataStructs.TanimotoSimilarity(bfp, v), k) for k, v in drug_fp.items()), reverse=True)
print(f"nearest drugs to {b}:")
for s, k in sims[:5]:
print(f" {s:.3f} {k:22s} [{reason.get(k,'?')}]")
print()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,111 @@
"""Structure-based track §12.4: dock known binders + negatives into Hb and PKR.
Positive-control recovery test: voxelotor should dock best into hemoglobin (5E83), mitapivat best
into PKR (8XFD), and unrelated drugs (caffeine, hydroxyurea) should score worse. The box is
centered on each co-crystal ligand (5L7=voxelotor, WV2=mitapivat). AutoDock Vina (mac binary,
Rosetta) + open-babel prep. Affinity in kcal/mol; more negative = stronger predicted binding.
This is a docking baseline, not an efficacy claim (PLAN §12.6).
"""
from __future__ import annotations
import re
import subprocess
import tempfile
from pathlib import Path
import numpy as np
import requests
VINA = "./tools/vina"
STRUCT = Path("data/raw/structures")
WORK = Path("data/processed/docking")
WORK.mkdir(parents=True, exist_ok=True)
TARGETS = {"hemoglobin": ("5E83", "5L7"), "PKR": ("8XFD", "WV2")}
LIGANDS = ["voxelotor", "mitapivat", "decitabine", "hydroxyurea", "caffeine"]
EXPECTED = {"voxelotor": "hemoglobin", "mitapivat": "PKR"}
def pubchem_smiles(name: str) -> str | None:
for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
try:
d = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON",
timeout=30).json()["PropertyTable"]["Properties"][0]
if prop in d:
return d[prop]
except Exception:
continue
return None
def prep_receptor_and_box(pdb: str, lig_res: str):
"""Receptor pdbqt (protein only) + box center/size from one copy of the co-crystal ligand."""
atoms, lig, chosen = [], [], None
for ln in (STRUCT / f"{pdb}.pdb").read_text().splitlines():
if ln.startswith("ATOM"):
atoms.append(ln)
elif ln.startswith("HETATM") and ln[17:20].strip() == lig_res:
key = (ln[21], ln[22:26])
chosen = chosen or key
if key == chosen:
lig.append(ln)
rec_pdb = WORK / f"{pdb}_rec.pdb"
rec_pdb.write_text("\n".join(atoms) + "\nEND\n")
rec_pdbqt = WORK / f"{pdb}_rec.pdbqt"
subprocess.run(["obabel", str(rec_pdb), "-O", str(rec_pdbqt), "-xr", "-p", "7.4"],
capture_output=True, check=True)
xyz = np.array([[float(l[30:38]), float(l[38:46]), float(l[46:54])] for l in lig])
center = xyz.mean(0)
size = np.clip((xyz.max(0) - xyz.min(0)) + 9.0, 18.0, 28.0)
return rec_pdbqt, center, size
def prep_ligand(name: str, smiles: str) -> Path | None:
out = WORK / f"lig_{name}.pdbqt"
r = subprocess.run(["obabel", f"-:{smiles}", "-O", str(out), "--gen3d", "-p", "7.4"],
capture_output=True, text=True)
return out if out.exists() and out.stat().st_size > 0 else None
def dock(rec_pdbqt: Path, lig_pdbqt: Path, center, size) -> float | None:
out = subprocess.run([VINA, "--receptor", str(rec_pdbqt), "--ligand", str(lig_pdbqt),
"--center_x", f"{center[0]:.2f}", "--center_y", f"{center[1]:.2f}",
"--center_z", f"{center[2]:.2f}", "--size_x", f"{size[0]:.1f}",
"--size_y", f"{size[1]:.1f}", "--size_z", f"{size[2]:.1f}",
"--exhaustiveness", "8", "--cpu", "4",
"--out", str(WORK / "o.pdbqt")], capture_output=True, text=True)
m = re.search(r"^\s+1\s+(-?\d+\.\d+)", out.stdout, re.M)
return float(m.group(1)) if m else None
def main() -> None:
smis = {n: pubchem_smiles(n) for n in LIGANDS}
ligs = {n: prep_ligand(n, s) for n, s in smis.items() if s}
ligs = {n: p for n, p in ligs.items() if p}
print(f"prepared ligands: {list(ligs)}")
boxes = {t: prep_receptor_and_box(pdb, lr) for t, (pdb, lr) in TARGETS.items()}
print("prepared receptors:", list(boxes))
print(f"\n{'ligand':14s}" + "".join(f"{t:>13s}" for t in TARGETS))
results = {}
for lname, lpath in ligs.items():
row = {}
for tname, (rec, center, size) in boxes.items():
row[tname] = dock(rec, lpath, center, size)
results[lname] = row
cells = "".join(f"{(f'{row[t]:.1f}' if row[t] is not None else 'NA'):>13s}" for t in TARGETS)
print(f" {lname:12s}{cells}")
print("\n=== positive-control recovery test (§12.4) ===")
for lig, exp_target in EXPECTED.items():
if lig in results:
best = min(results[lig], key=lambda t: results[lig][t] if results[lig][t] is not None else 99)
ok = best == exp_target
print(f" {lig:12s} best target = {best:11s} (expected {exp_target}) -> {'PASS' if ok else 'FAIL'}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,83 @@
"""Push docking to credibility: Meeko ligand prep + in-place spyrmsd, on the clean HDAC2 target.
Isolates the two cheap suspects behind the 7.9 A redocking failure:
1. ligand prep — replace bare obabel --gen3d with Meeko (correct rotatable bonds / AD typing)
2. RMSD metric — replace obrms (superimposes) with spyrmsd in-place, symmetry-corrected
Receptor reused (obabel-protonated 4LXZ with catalytic Zn kept). If redock RMSD drops <2 A, the
docking pipeline is validated and the earlier failure was prep+metric, not docking itself.
"""
from __future__ import annotations
import subprocess
from pathlib import Path
import numpy as np
from rdkit import Chem, RDLogger
from rdkit.Chem import AllChem
from meeko import MoleculePreparation, PDBQTWriterLegacy
from spyrmsd import io as spyio, rmsd as spyrmsd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from scripts.dock_positive_controls import STRUCT, VINA, WORK, pubchem_smiles # noqa: E402
from scripts.dock_validate import dock_pose, extract_crystal_ligand # noqa: E402
RDLogger.DisableLog("rdApp.*")
PDB, RES, DRUG = "4LXZ", "SHH", "vorinostat"
def prep_receptor_keep_zn() -> tuple[Path, np.ndarray, np.ndarray]:
atoms, lig, chosen = [], [], None
for ln in (STRUCT / f"{PDB}.pdb").read_text().splitlines():
if ln.startswith("ATOM") or (ln.startswith("HETATM") and ln[17:20].strip() == "ZN"):
atoms.append(ln)
elif ln.startswith("HETATM") and ln[17:20].strip() == RES:
key = (ln[21], ln[22:26]); chosen = chosen or key
if key == chosen:
lig.append(ln)
recpdb = WORK / f"{PDB}_rec.pdb"; recpdb.write_text("\n".join(atoms) + "\nEND\n")
recpdbqt = WORK / f"{PDB}_rec.pdbqt"
subprocess.run(["obabel", str(recpdb), "-O", str(recpdbqt), "-xr", "-p", "7.4"], capture_output=True)
xyz = np.array([[float(l[30:38]), float(l[38:46]), float(l[46:54])] for l in lig])
return recpdbqt, xyz.mean(0), np.clip((xyz.max(0) - xyz.min(0)) + 8, 16, 26)
def meeko_ligand(smiles: str) -> Path:
mol = Chem.AddHs(Chem.MolFromSmiles(smiles))
AllChem.EmbedMolecule(mol, randomSeed=0xC0FFEE)
AllChem.MMFFOptimizeMolecule(mol)
setups = MoleculePreparation().prepare(mol)
pdbqt, ok, err = PDBQTWriterLegacy.write_string(setups[0])
if not ok:
raise RuntimeError(f"meeko ligand prep failed: {err}")
out = WORK / f"lig_{DRUG}_meeko.pdbqt"; out.write_text(pdbqt)
return out
def inplace_rmsd(crystal_pdb: Path, docked_pdbqt: Path) -> float:
cref = WORK / "xtal.sdf"; cdoc = WORK / "docked.sdf"
subprocess.run(["obabel", str(crystal_pdb), "-O", str(cref)], capture_output=True)
subprocess.run(["obabel", str(docked_pdbqt), "-O", str(cdoc), "-f", "1", "-l", "1"], capture_output=True)
ref, mol = spyio.loadmol(str(cref)), spyio.loadmol(str(cdoc))
ref.strip(); mol.strip()
r = spyrmsd.rmsdwrapper(ref, mol, symmetry=True, minimize=False)
return float(r[0] if isinstance(r, (list, tuple, np.ndarray)) else r)
def main() -> None:
rec, center, size = prep_receptor_keep_zn()
smi = pubchem_smiles(DRUG)
lig = meeko_ligand(smi)
print(f"meeko ligand pdbqt: {lig.name}; box center {center.round(1)} size {size.round(1)}")
pose = dock_pose(rec, lig, center, size) # writes WORK/redock_out.pdbqt
raw = WORK / "redock_out.pdbqt"
xtal = extract_crystal_ligand(PDB, RES)
rmsd = inplace_rmsd(xtal, raw)
verdict = "PASS (<2A)" if rmsd < 2 else ("MARGINAL (<3A)" if rmsd < 3 else "FAIL")
print(f"\nvorinostat -> HDAC2 redock | in-place spyrmsd RMSD = {rmsd:.2f} A {verdict}")
print("(prior obabel-ligand + obrms gave 7.9 A)")
if __name__ == "__main__":
main()

93
scripts/dock_validate.py Normal file
View File

@@ -0,0 +1,93 @@
"""Structure-binding §12.4: redocking-RMSD validation + ligand-efficiency normalization.
Two de-biased checks of the docking baseline:
A. REDOCKING RMSD — redock each co-crystal ligand into its own structure and measure pose RMSD
vs the crystal pose (open-babel obrms, symmetry-aware). <2 A = geometry validated. This is
the gold-standard positive control and is immune to the molecular-size bias that made the
cross-target affinity test inconclusive.
B. LIGAND EFFICIENCY — affinity / heavy-atom count, to de-bias the size effect in the raw scores.
"""
from __future__ import annotations
import re
import subprocess
from pathlib import Path
from rdkit import Chem, RDLogger
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from scripts.dock_positive_controls import ( # noqa: E402
STRUCT, VINA, WORK, TARGETS, pubchem_smiles, prep_ligand, prep_receptor_and_box,
)
RDLogger.DisableLog("rdApp.*")
XTAL_LIG = {"hemoglobin": ("5E83", "5L7", "voxelotor"), "PKR": ("8XFD", "WV2", "mitapivat")}
def extract_crystal_ligand(pdb: str, resname: str) -> Path:
lines, chosen = [], None
for ln in (STRUCT / f"{pdb}.pdb").read_text().splitlines():
if ln.startswith("HETATM") and ln[17:20].strip() == resname:
key = (ln[21], ln[22:26])
chosen = chosen or key
if key == chosen:
lines.append(ln)
out = WORK / f"xtal_{resname}.pdb"
out.write_text("\n".join(lines) + "\nEND\n")
return out
def dock_pose(rec, lig, center, size) -> Path | None:
pose = WORK / "redock_out.pdbqt"
subprocess.run([VINA, "--receptor", str(rec), "--ligand", str(lig),
"--center_x", f"{center[0]:.2f}", "--center_y", f"{center[1]:.2f}",
"--center_z", f"{center[2]:.2f}", "--size_x", f"{size[0]:.1f}",
"--size_y", f"{size[1]:.1f}", "--size_z", f"{size[2]:.1f}",
"--exhaustiveness", "16", "--out", str(pose)], capture_output=True, text=True)
if not pose.exists():
return None
top = WORK / "redock_top.pdb" # first model only
subprocess.run(["obabel", str(pose), "-O", str(top), "-f", "1", "-l", "1"], capture_output=True)
return top
def obrms(ref: Path, test: Path) -> float | None:
# strip H to compare heavy-atom poses; obrms is automorphism-aware
r2, t2 = WORK / "ref_noH.sdf", WORK / "test_noH.sdf"
subprocess.run(["obabel", str(ref), "-d", "-O", str(r2)], capture_output=True)
subprocess.run(["obabel", str(test), "-d", "-O", str(t2)], capture_output=True)
out = subprocess.run(["obrms", str(r2), str(t2)], capture_output=True, text=True).stdout
m = re.search(r"(-?\d+\.\d+)", out)
return float(m.group(1)) if m else None
def main() -> None:
print("=== A. Redocking-RMSD validation ===")
for target, (pdb, resname, drug) in XTAL_LIG.items():
rec, center, size = prep_receptor_and_box(pdb, resname)
smi = pubchem_smiles(drug)
lig = prep_ligand(drug, smi)
xtal = extract_crystal_ligand(pdb, resname)
pose = dock_pose(rec, lig, center, size)
rmsd = obrms(xtal, pose) if pose else None
verdict = "PASS (<2A)" if (rmsd is not None and rmsd < 2.0) else \
("MARGINAL (<3A)" if (rmsd is not None and rmsd < 3.0) else "FAIL")
rstr = f"{rmsd:.2f} A" if rmsd is not None else "NA"
print(f" {drug:12s} -> {target:11s} ({pdb}/{resname}): redock RMSD {rstr} {verdict}")
print("\n=== B. Ligand efficiency (affinity / heavy atoms), de-biasing size ===")
# raw affinities from the cross-dock baseline (scripts/dock_positive_controls.py)
aff = {"voxelotor": {"hemoglobin": -8.1, "PKR": -9.3}, "mitapivat": {"hemoglobin": -10.0, "PKR": -11.2},
"decitabine": {"hemoglobin": -6.6, "PKR": -7.0}, "hydroxyurea": {"hemoglobin": -3.9, "PKR": -3.6},
"caffeine": {"hemoglobin": -6.1, "PKR": -6.4}}
print(f" {'ligand':12s}{'heavy':>6s}" + "".join(f"{t+' LE':>14s}" for t in TARGETS))
for lig, row in aff.items():
ha = Chem.MolFromSmiles(pubchem_smiles(lig)).GetNumHeavyAtoms()
cells = "".join(f"{row[t]/ha:>14.3f}" for t in TARGETS)
print(f" {lig:12s}{ha:>6d}{cells}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,137 @@
"""Experiment: composition-adjusted sickle signature, to fix specificity (option 1).
The v1 signature is confounded by cell composition (SS patients have very different WBC/RBC
than AA controls). GSE35007 *measured* those counts per sample, so we adjust the differential
expression for them directly (a measured-covariate alternative to estimated deconvolution):
expression ~ disease + WBC + RBC + MCV + age + sex (per gene, vectorized OLS)
We compare the composition-ADJUSTED signature against an UNADJUSTED single-study signature
(same samples, model without the covariates), score both with the v1.1 engine (full gene space
+ spec_z), and report the recovery test for each. Writes nothing to committed artifacts.
"""
from __future__ import annotations
import warnings
from pathlib import Path
import GEOparse
import numpy as np
import pandas as pd
from scipy.stats import false_discovery_control
from scipy.stats import t as tdist
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.disease import collapse_probes_to_symbols # noqa: E402
from src.scoring import tau_calibrate # noqa: E402
warnings.filterwarnings("ignore")
PROCESSED = Path("data/processed")
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
SYMBOL_COLS = ["Symbol", "ILMN_Gene", "Gene Symbol", "GeneSymbol"]
def cval(gsm, key):
for c in gsm.metadata.get("characteristics_ch1", []):
if c.lower().startswith(key.lower()):
return c.split(":", 1)[1].strip()
return None
def load_data():
gse = GEOparse.get_GEO(geo="GSE35007", destdir="data/raw/geo", silent=True)
meta = []
for gid, g in gse.gsms.items():
meta.append({"gsm": gid, "hb": cval(g, "hb phenotype"), "wbc": cval(g, "white blood cells"),
"rbc": cval(g, "red blood cells"), "mcv": cval(g, "mean corpuscular volume"),
"age": cval(g, "age"), "sex": cval(g, "Sex")})
meta = pd.DataFrame(meta).set_index("gsm")
for c in ["wbc", "rbc", "mcv", "age"]:
meta[c] = pd.to_numeric(meta[c], errors="coerce")
meta["disease"] = meta["hb"].map({"SS": 1.0, "AA": 0.0})
meta["sex_m"] = (meta["sex"] == "M").astype(float)
keep = meta[meta["hb"].isin(["SS", "AA"])].dropna(subset=["wbc", "rbc", "mcv", "age", "disease"])
expr = pd.DataFrame({gid: gse.gsms[gid].table.set_index("ID_REF")["VALUE"] for gid in keep.index})
expr = expr.apply(pd.to_numeric, errors="coerce").dropna(how="any")
if float(np.nanmax(expr.to_numpy())) > 50:
expr = np.log2(expr.clip(lower=0) + 1.0)
gpl = list(gse.gpls.values())[0]
col = next((c for c in SYMBOL_COLS if c in gpl.table.columns), None)
sym = gpl.table.set_index("ID")[col].astype(str).str.strip().replace({"": np.nan, "nan": np.nan})
return expr, keep, sym.dropna()
def ols_de(expr, design, disease_idx):
"""Vectorized per-gene OLS; return DE table (log_fc=disease coef, pvalue, qvalue)."""
X = design.to_numpy(dtype=float)
Y = expr.T.to_numpy(dtype=float) # samples x genes
n, p = X.shape
XtX_inv = np.linalg.pinv(X.T @ X)
B = XtX_inv @ X.T @ Y
resid = Y - X @ B
dof = n - p
sigma2 = (resid ** 2).sum(0) / dof
se = np.sqrt(sigma2 * XtX_inv[disease_idx, disease_idx])
t = B[disease_idx] / se
pval = 2 * tdist.sf(np.abs(t), dof)
out = pd.DataFrame({"log_fc": B[disease_idx], "pvalue": pval}, index=expr.index).dropna()
out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
return out
def make_signature(de, sym, expr, top_n=250):
de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
sig = de_sym[de_sym["qvalue"] < 0.05]
up = sig[sig["log_fc"] > 0].nsmallest(top_n, "qvalue").index.tolist()
down = sig[sig["log_fc"] < 0].nsmallest(top_n, "qvalue").index.tolist()
return up, down
def evaluate(label, up, down, lincs):
ranked = tau_calibrate(up, down, lincs, n_null=1000)
n = len(ranked)
top10, top25, half = int(n * .10), int(n * .25), n // 2
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
ranked = ranked.join(profiles[["inclusion_reason"]])
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
n_bottom = sum(r > half for r in negs.values())
n_ov = len((set(up) | set(down)) & set(lincs.columns))
print(f"\n### {label}: {len(up)} up / {len(down)} down (query overlap {n_ov})")
print(f" hydroxyurea rank {hu}/{n} (top {100*hu/n:.1f}%) top-10%? {hu <= top10}")
print(f" L-glutamine rank {glut}/{n} (top {100*glut/n:.1f}%) top-25%? {glut <= top25}")
print(f" neg controls bottom-half: {n_bottom}/5 {negs}")
print(" top 8: " + ", ".join(
f"{name}[{r['inclusion_reason'][:3]}]" for name, r in ranked.nsmallest(8, "spec_z").iterrows()))
return ranked
def main():
expr, meta, sym = load_data()
print(f"loaded {expr.shape[1]} samples x {expr.shape[0]} probes; "
f"{int(meta.disease.sum())} SS / {int((meta.disease==0).sum())} AA")
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
base = pd.DataFrame({"intercept": 1.0, "disease": meta["disease"]}, index=meta.index)
adj = base.assign(wbc=meta["wbc"], rbc=meta["rbc"], mcv=meta["mcv"], age=meta["age"], sex_m=meta["sex_m"])
de_unadj = ols_de(expr, base, disease_idx=1)
de_adj = ols_de(expr, adj, disease_idx=1)
up_u, dn_u = make_signature(de_unadj, sym, expr)
up_a, dn_a = make_signature(de_adj, sym, expr)
# how much does adjustment change the gene set?
overlap = len((set(up_u) | set(dn_u)) & (set(up_a) | set(dn_a)))
print(f"\nsignature gene overlap unadjusted vs adjusted: {overlap}/{len(set(up_u)|set(dn_u))}")
evaluate("UNADJUSTED (GSE35007 only)", up_u, dn_u, lincs)
evaluate("COMPOSITION-ADJUSTED", up_a, dn_a, lincs)
if __name__ == "__main__":
main()

102
scripts/exp_genespace.py Normal file
View File

@@ -0,0 +1,102 @@
"""Experiment (v1.1): re-score on a larger LINCS gene space and re-run the recovery test.
v1 used only the 978 landmark genes (12% signature overlap). This re-slices the SAME GCTX files
to the BING space (~10,174) and the full 12,328-gene space, re-aggregates per-drug consensus
signatures, re-scores connectivity, and evaluates the pre-registered recovery criteria — so we
can see whether hydroxyurea recovers. Writes nothing to the committed v1 artifacts.
"""
from __future__ import annotations
import gzip
import io
import json
from pathlib import Path
import pandas as pd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import rank_drugs # noqa: E402
LINCS = Path("data/raw/lincs")
PROCESSED = Path("data/processed")
GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
def read_gz(name):
return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
def gene_ids_for_space(space: str):
g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
if space == "bing":
g = g[g.pr_is_bing == 1]
# 'all' -> keep everything
ids = [str(x) for x in g.pr_gene_id]
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
return ids, id_to_symbol
def extract(space, drug_names, gene_ids, id_to_symbol):
from cmapPy.pandasGEXpress.parse import parse
frames = []
for ph in (1, 2):
sig = read_gz(SIG_INFO[ph])
sig = sig[(sig.pert_type == "trt_cp") & (sig.pert_iname.isin(drug_names))]
if sig.empty:
continue
gct = parse(str(GCTX[ph]), rid=gene_ids, cid=sig.sig_id.tolist())
data = gct.data_df
s2d = dict(zip(sig.sig_id, sig.pert_iname))
frames.append(data.T.groupby(data.columns.map(s2d)).mean())
print(f" [{space}] phase {ph}: {sig.pert_iname.nunique()} drugs sliced", flush=True)
combined = pd.concat(frames).groupby(level=0).mean()
combined.columns = [id_to_symbol.get(c, c) for c in combined.columns]
combined = combined.loc[:, ~combined.columns.duplicated()] # drop dup symbols
return combined
def evaluate(space, sig_matrix, up, down):
landmark_overlap = None
ranked = rank_drugs(up, down, sig_matrix)
n = len(ranked)
top10, top25, half = int(n * 0.10), int(n * 0.25), n // 2
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
ranked = ranked.join(profiles[["inclusion_reason"]])
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
glut_s = ranked.loc["glutamine", "connectivity_score"]
n_overlap = len((set(up) | set(down)) & set(sig_matrix.columns))
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
n_bottom = sum(r > half for r in negs.values())
print(f"\n=== gene space: {space.upper()} ({sig_matrix.shape[1]} genes; query overlap {n_overlap}) ===")
print(f" hydroxyurea: rank {hu}/{n} (top {100*hu/n:.1f}%) top-10%? {hu <= top10}")
print(f" L-glutamine: rank {glut}/{n} (top {100*glut/n:.1f}%), WTCS={glut_s:.3f} top-25%? {glut <= top25}")
print(f" neg controls in bottom half: {n_bottom}/5 {negs}")
crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
print(f" OVERALL: {'PASS' if crit else 'FAIL'}")
print(" top 8:")
for name, r in ranked.nsmallest(8, "connectivity_score").iterrows():
print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} [{r['inclusion_reason']}]")
return ranked
def main():
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
up = [g["gene"] for g in sig["up_regulated"]]
down = [g["gene"] for g in sig["down_regulated"]]
drug_names = set(pd.read_csv(PROCESSED / "drug_set_v1.csv").pert_iname)
for space in ("bing", "all"):
print(f"\n>>> extracting {space} ...", flush=True)
ids, id2sym = gene_ids_for_space(space)
mat = extract(space, drug_names, ids, id2sym)
evaluate(space, mat, up, down)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,118 @@
"""Phase A: calibrate sickle connectivity against a REAL disease-signature reference population.
The v1.1 tau saturated because it used random gene-set nulls. Proper specificity calibration
asks: does a drug reverse SICKLE more than it reverses diseases in general? We download a library
of real disease signatures (Enrichr "Disease Signatures from GEO", up+down), compute each drug's
connectivity to every reference disease, and express its sickle connectivity as a z-score within
that per-drug reference distribution. Broad-effect drugs (reverse everything) -> z~0 -> down-ranked.
Re-runs the recovery test and compares to v1.1 (random-null). Writes nothing committed.
"""
from __future__ import annotations
from pathlib import Path
import numpy as np
import pandas as pd
import requests
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import _ks_connectivity # noqa: E402
PROCESSED = Path("data/processed")
RAW = Path("data/raw/disease_sigs")
ENRICHR = "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName={}"
LIBS = {"up": "Disease_Signatures_from_GEO_up_2014", "down": "Disease_Signatures_from_GEO_down_2014"}
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
def fetch_gmt(name: str) -> dict[str, list[str]]:
RAW.mkdir(parents=True, exist_ok=True)
path = RAW / f"{name}.gmt"
if not path.exists():
r = requests.get(ENRICHR.format(name), timeout=120)
r.raise_for_status()
path.write_text(r.text)
out = {}
for line in path.read_text().splitlines():
parts = line.split("\t")
if len(parts) < 3:
continue
term = parts[0]
genes = [g.split(",")[0].strip().upper() for g in parts[2:] if g.strip()]
out[term] = genes
print(f" {name}: {path.stat().st_size/1e6:.1f} MB, {len(out)} terms")
return out
def build_reference() -> list[dict]:
up = fetch_gmt(LIBS["up"])
down = fetch_gmt(LIBS["down"])
shared = set(up) & set(down)
refs = []
for term in shared:
if "sickle" in term.lower():
continue # exclude the target disease from its own reference population
refs.append({"name": term, "up": up[term], "down": down[term]})
print(f"reference disease signatures (paired up+down, sickle excluded): {len(refs)}")
return refs
def cols_for(genes, gene_to_col):
return np.array([gene_to_col[g] for g in set(genes) if g in gene_to_col], dtype=int)
def main():
import json
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
sk_up = [g["gene"] for g in sig["up_regulated"]]
sk_down = [g["gene"] for g in sig["down_regulated"]]
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
genes = list(lincs.columns)
gene_to_col = {g: i for i, g in enumerate(genes)}
n = len(genes)
R = lincs.rank(axis=1, ascending=False).to_numpy()
refs = build_reference()
# connectivity of every drug to every reference disease -> (n_drugs, n_refs)
C = np.empty((R.shape[0], len(refs)))
for j, d in enumerate(refs):
C[:, j] = _ks_connectivity(R, cols_for(d["up"], gene_to_col), cols_for(d["down"], gene_to_col), n)
# drop reference diseases with too few mapped genes (degenerate columns)
mapped = np.array([len(cols_for(d["up"], gene_to_col)) + len(cols_for(d["down"], gene_to_col)) for d in refs])
keep = mapped >= 10
C = C[:, keep]
print(f"usable reference diseases (>=10 mapped genes): {keep.sum()}")
real = _ks_connectivity(R, cols_for(sk_up, gene_to_col), cols_for(sk_down, gene_to_col), n)
ref_mean, ref_std = C.mean(axis=1), C.std(axis=1)
ref_std[ref_std == 0] = np.nan
spec_z = (real - ref_mean) / ref_std # negative = reverses sickle more than diseases-in-general
ranked = pd.DataFrame({"spec_z": spec_z, "connectivity": real}, index=lincs.index).sort_values("spec_z")
ranked.insert(0, "rank", range(1, len(ranked) + 1))
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
ranked = ranked.join(profiles[["inclusion_reason"]])
N = len(ranked)
top10, top25, half = int(N * .10), int(N * .25), N // 2
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
n_bottom = sum(r > half for r in negs.values())
print("\n==== RECOVERY TEST (reference-calibrated tau) ====")
print(f" hydroxyurea rank {hu}/{N} (top {100*hu/N:.1f}%) top-10%? {hu <= top10} z={ranked.loc['hydroxyurea','spec_z']:.2f}")
print(f" L-glutamine rank {glut}/{N} (top {100*glut/N:.1f}%) top-25%? {glut <= top25}")
print(f" neg controls bottom-half: {n_bottom}/5 {negs}")
crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
print(f" OVERALL: {'PASS' if crit else 'FAIL'}")
print("\n top 12 by reference-calibrated z:")
for name, r in ranked.nsmallest(12, "spec_z").iterrows():
print(f" {int(r['rank']):2d} {name:20s} z={r['spec_z']:6.2f} [{r['inclusion_reason']}]")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,137 @@
"""Phase D: supervised cross-disease repurposing — can labels break the specificity ceiling?
Connectivity alone can't tell therapeutic from coincidental reversal. A supervised model trained
on known drug-disease pairs CAN learn that pattern — given features that expose drug "broadness"
(a drug that reverses everything is non-specific). We train on 839 GEO disease signatures with
Repurposing-Hub indications as labels, evaluate with disease-grouped CV, then apply to HELD-OUT
sickle and check whether the coincidental reversers (norethindrone, ciprofloxacin) finally drop.
Baseline = rank by raw connectivity (the single conn feature). Win = model down-ranks the
negative controls vs baseline while keeping hydroxyurea high.
"""
from __future__ import annotations
import json
import re
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GroupKFold, cross_val_predict
from sklearn.metrics import roc_auc_score
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import _ks_connectivity # noqa: E402
PROCESSED = Path("data/processed")
SIGS = Path("data/raw/disease_sigs")
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
ID_RE = re.compile(r"\s*(c\d{6,}|doid[-\s]?\d+|gse\d+|umls\S+)\s*", re.I)
def clean_disease(term: str) -> str:
return ID_RE.sub(" ", term).strip().lower()
def load_disease_sigs():
def parse(name):
out = {}
for line in (SIGS / name).read_text().splitlines():
p = line.split("\t")
if len(p) < 3:
continue
out[p[0]] = [g.split(",")[0].strip().upper() for g in p[2:] if g.strip()]
return out
up = parse("Disease_Perturbations_from_GEO_up.gmt")
down = parse("Disease_Perturbations_from_GEO_down.gmt")
terms = sorted(set(up) & set(down))
return [{"term": t, "disease": clean_disease(t), "up": up[t], "down": down[t]} for t in terms]
def cols_for(genes, g2c):
return np.array([g2c[g] for g in set(genes) if g in g2c], dtype=int)
def featurize(conn_col, C):
"""Per-(drug,disease) features for one disease column conn_col, given full matrix C."""
drug_mean, drug_std = C.mean(1), C.std(1)
drug_min = C.min(1)
broad = (C < -0.10).mean(1) # fraction of diseases this drug strongly reverses
dmean, dstd = conn_col.mean(), conn_col.std() or 1.0
return np.column_stack([
conn_col, drug_mean, drug_std, drug_min, broad,
np.full_like(conn_col, dmean),
(conn_col - drug_mean) / np.where(drug_std == 0, 1, drug_std), # specificity within drug
(conn_col - dmean) / dstd, # specificity within disease
])
FEATS = ["conn", "drug_mean", "drug_std", "drug_min", "broadness",
"disease_mean", "z_within_drug", "z_within_disease"]
def main():
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
drugs = list(lincs.index)
g2c = {g: i for i, g in enumerate(lincs.columns)}
n = len(lincs.columns)
R = lincs.rank(axis=1, ascending=False).to_numpy()
refs = [d for d in load_disease_sigs()
if len(cols_for(d["up"], g2c)) + len(cols_for(d["down"], g2c)) >= 10]
C = np.column_stack([_ks_connectivity(R, cols_for(d["up"], g2c), cols_for(d["down"], g2c), n) for d in refs])
print(f"{len(drugs)} drugs x {len(refs)} disease signatures; connectivity matrix {C.shape}")
# labels from Repurposing Hub indications
hub = pd.read_csv("data/raw/repurposing_drugs.txt", sep="\t", comment="!", low_memory=False)
ind = {r.pert_iname: [i.strip().lower() for i in re.split(r"[|,]", r.indication) if len(i.strip()) > 3]
for r in hub.itertuples() if isinstance(r.indication, str)}
drug_idx = {d: i for i, d in enumerate(drugs)}
X_rows, y, grp = [], [], []
for j, d in enumerate(refs):
feats = featurize(C[:, j], C)
dz = d["disease"]
for i, drug in enumerate(drugs):
inds = ind.get(drug, [])
label = int(any(dz == k or (len(dz) > 4 and dz in k) or (len(k) > 4 and k in dz) for k in inds))
X_rows.append(feats[i]); y.append(label); grp.append(j)
X = np.array(X_rows); y = np.array(y); grp = np.array(grp)
print(f"pairs: {len(y)}, positives: {y.sum()} ({100*y.mean():.2f}%)")
# disease-grouped CV (generalize to unseen diseases)
clf = GradientBoostingClassifier(random_state=0)
proba = cross_val_predict(clf, X, y, cv=GroupKFold(5), groups=grp, method="predict_proba")[:, 1]
print(f"disease-grouped CV AUC: {roc_auc_score(y, proba):.3f} (conn-only AUC: {roc_auc_score(y, X[:,0]*-1):.3f})")
clf.fit(X, y)
print("feature importances: " + ", ".join(f"{f}={imp:.2f}" for f, imp in
sorted(zip(FEATS, clf.feature_importances_), key=lambda t: -t[1])))
# apply to HELD-OUT sickle
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
sk = _ks_connectivity(R, cols_for([g["gene"] for g in sig["up_regulated"]], g2c),
cols_for([g["gene"] for g in sig["down_regulated"]], g2c), n)
Xs = featurize(sk, C)
p_sickle = clf.predict_proba(Xs)[:, 1]
res = pd.DataFrame({"model_p": p_sickle, "conn": sk}, index=drugs)
res["model_rank"] = res["model_p"].rank(ascending=False).astype(int)
res["conn_rank"] = res["conn"].rank(ascending=True).astype(int) # most negative = best
N = len(res)
print(f"\n{'drug':14s} {'model_rank':>10s} {'conn_rank(base)':>16s}")
for d in ["hydroxyurea", "glutamine"] + NEG5 + ["norethindrone", "ciprofloxacin"]:
if d in res.index:
r = res.loc[d]
print(f" {d:14s} {int(r['model_rank']):6d}/{N} {int(r['conn_rank']):6d}/{N}")
nb_model = sum(res.loc[d, "model_rank"] > N/2 for d in NEG5 if d in res.index)
nb_conn = sum(res.loc[d, "conn_rank"] > N/2 for d in NEG5 if d in res.index)
print(f"\nneg controls bottom-half: model {nb_model}/5 vs baseline {nb_conn}/5")
print("model top 10:", ", ".join(res.sort_values('model_p', ascending=False).head(10).index))
if __name__ == "__main__":
main()

113
scripts/pose_rmsd.py Normal file
View File

@@ -0,0 +1,113 @@
"""Pose-RMSD validation of the Boltz-2 HDAC2/vorinostat prediction (closes the §12.4 loop).
Co-folding predicts the whole complex de novo, so it lands in its own frame. To compare poses:
1. superpose the predicted protein (chain A) onto the crystal 4LXZ binding chain (Ca, gemmi),
2. apply that transform to the predicted vorinostat (chain L) and the predicted Zn (chain M),
3. symmetry-corrected in-place RMSD of the transformed ligand vs the crystal SHH (spyrmsd),
4. bonus: did the catalytic Zn land near the real one?
<2 A ligand RMSD = co-folding reproduces the geometry, not just the affinity ranking -- the test
classical Vina failed on this exact Zn-chelation case (7.9 A).
"""
from __future__ import annotations
import subprocess
from pathlib import Path
import gemmi
import numpy as np
from spyrmsd import io as spyio, rmsd as spyrmsd
PRED = Path("data/processed/binding/HDAC2_vorinostat_pred.pdb")
XTAL = Path("data/raw/structures/4LXZ.pdb")
LIG_RES = "SHH" # crystal vorinostat
WORK = Path("data/processed/binding")
def crystal_binding_chain(st) -> gemmi.Chain:
model = st[0]
lig = [a.pos for ch in model for r in ch if r.name == LIG_RES for a in r]
c = gemmi.Position(sum(p.x for p in lig) / len(lig), sum(p.y for p in lig) / len(lig),
sum(p.z for p in lig) / len(lig))
best, bestd = None, 1e9
for ch in model:
if len(ch.get_polymer()) < 20:
continue
d = min((a.pos.dist(c) for r in ch for a in r), default=1e9)
if d < bestd:
best, bestd = ch, d
return best
def hetatm_lines(pdb: Path, chain: str | None = None, resname: str | None = None) -> list[str]:
out = []
for ln in pdb.read_text().splitlines():
if not ln.startswith(("ATOM", "HETATM")):
continue
if chain and ln[21] != chain:
continue
if resname and ln[17:20].strip() != resname:
continue
out.append(ln)
return out
def transform_and_write(lines: list[str], T: gemmi.Transform, dest: Path) -> Path:
new = []
for ln in lines:
p = T.apply(gemmi.Position(float(ln[30:38]), float(ln[38:46]), float(ln[46:54])))
new.append(f"{ln[:30]}{p.x:8.3f}{p.y:8.3f}{p.z:8.3f}{ln[54:]}")
dest.write_text("\n".join(new) + "\nEND\n")
return dest
def to_sdf(pdb: Path) -> Path:
sdf = pdb.with_suffix(".sdf")
subprocess.run(["obabel", str(pdb), "-O", str(sdf)], capture_output=True)
return sdf
def main() -> None:
pred_st = gemmi.read_structure(str(PRED))
xtal_st = gemmi.read_structure(str(XTAL))
pred_chain = pred_st[0]["A"]
xtal_chain = crystal_binding_chain(xtal_st)
sup = gemmi.calculate_superposition(xtal_chain.get_polymer(), pred_chain.get_polymer(),
gemmi.PolymerType.PeptideL, gemmi.SupSelect.CaP)
T = sup.transform
print(f"protein superposition: Ca RMSD {sup.rmsd:.2f} A over {sup.count} residues")
# predicted ligand (chain L) and Zn (chain M) -> transform into crystal frame
pred_lig = transform_and_write(hetatm_lines(PRED, chain="L"), T, WORK / "pred_lig_aligned.pdb")
pred_zn_lines = hetatm_lines(PRED, chain="M")
# one copy only (4LXZ has 3 SHH copies); keep the first (chain, resseq) group.
shh = hetatm_lines(XTAL, resname=LIG_RES)
first_key = (shh[0][21], shh[0][22:26])
one_copy = [ln for ln in shh if (ln[21], ln[22:26]) == first_key]
crys_lig = WORK / "xtal_lig.pdb"
crys_lig.write_text("\n".join(one_copy) + "\nEND\n")
# ligand pose RMSD (in-place, symmetry-corrected)
ref, mol = spyio.loadmol(str(to_sdf(crys_lig))), spyio.loadmol(str(to_sdf(pred_lig)))
ref.strip(); mol.strip()
rmsd = float(spyrmsd.rmsdwrapper(ref, mol, symmetry=True, minimize=False)[0])
# zinc placement: transformed predicted Zn vs nearest crystal ZN
verdict = "PASS (<2A)" if rmsd < 2 else ("MARGINAL (<3A)" if rmsd < 3 else "FAIL")
print(f"\nvorinostat pose RMSD (co-folded vs crystal) = {rmsd:.2f} A {verdict}")
print("(classical Vina on this Zn-chelation case: 7.9 A)")
pred_zn = next((a.pos for ch in pred_st[0] if ch.name == "M" for r in ch for a in r), None)
xtal_zn = [a.pos for ch in xtal_st[0] for r in ch if r.name == "ZN" for a in r]
if pred_zn and xtal_zn:
q = T.apply(pred_zn)
dz = min(((q.x - z.x) ** 2 + (q.y - z.y) ** 2 + (q.z - z.z) ** 2) ** 0.5 for z in xtal_zn)
print(f"catalytic Zn placement error = {dz:.2f} A "
f"{'(zinc correctly placed)' if dz < 2 else ''}")
if __name__ == "__main__":
main()

139
scripts/week1_explore.py Normal file
View File

@@ -0,0 +1,139 @@
"""Week 1 exploratory run: build per-study DE + concordance for the whole-blood pair.
Read-only decision aid (NOT the final pipeline): downloads GSE35007 + GSE16728, runs DE on
each, collapses to HGNC symbols, computes concordance, and reports whether the sanity-gate
genes survive — so we can choose Tier A (concordance pair) vs Tier B (GSE35007 alone) with
real numbers. Intermediate DE tables are cached under data/raw/geo/.
"""
from __future__ import annotations
import sys
from pathlib import Path
import GEOparse
import numpy as np
import pandas as pd
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.disease import ( # noqa: E402
DISEASE_LABEL,
HEALTHY_LABEL,
collapse_probes_to_symbols,
compute_differential_expression,
concordance_filter,
)
GEO_DIR = Path("data/raw/geo")
GEO_DIR.mkdir(parents=True, exist_ok=True)
SANITY_GENES = ["HBG1", "HBG2", "ALAS2", "CA1", "AHSP", "SLC4A1", "HBB", "ERAF"]
SYMBOL_COL_CANDIDATES = [
"Symbol", "ILMN_Gene", "Gene Symbol", "GENE_SYMBOL", "Gene symbol",
"gene_assignment", "GeneSymbol",
]
def log(msg: str) -> None:
print(msg, flush=True)
def get_symbol_map(gpl) -> pd.Series:
tbl = gpl.table
col = next((c for c in SYMBOL_COL_CANDIDATES if c in tbl.columns), None)
if col is None:
raise RuntimeError(f"No symbol column in GPL {gpl.name}; columns={list(tbl.columns)[:15]}")
s = tbl.set_index("ID")[col].astype(str).str.strip()
# gene_assignment fields are '// SYMBOL // ...'; take the first real token.
if col == "gene_assignment":
s = s.str.split("//").str[1].str.strip()
s = s.replace({"": np.nan, "nan": np.nan, "---": np.nan})
return s.dropna()
def build_expression(gse, sample_ids: list[str]) -> pd.DataFrame:
cols = {}
for gsm_id in sample_ids:
tbl = gse.gsms[gsm_id].table
cols[gsm_id] = tbl.set_index("ID_REF")["VALUE"]
return pd.DataFrame(cols)
def char_value(gsm, key: str) -> str | None:
for c in gsm.metadata.get("characteristics_ch1", []):
if c.lower().startswith(key.lower()):
return c.split(":", 1)[1].strip()
return None
def run_study(gse, groups: dict[str, str], label: str) -> pd.DataFrame:
"""groups: gsm_id -> 'disease'/'healthy'. Returns symbol-level DE table."""
sample_ids = list(groups.keys())
log(f"[{label}] {len(sample_ids)} samples "
f"({sum(v==DISEASE_LABEL for v in groups.values())} disease / "
f"{sum(v==HEALTHY_LABEL for v in groups.values())} healthy)")
expr = build_expression(gse, sample_ids).apply(pd.to_numeric, errors="coerce").dropna(how="all")
sample_groups = pd.Series(groups)
de = compute_differential_expression(expr, sample_groups, method="welch")
gpl = list(gse.gpls.values())[0]
sym = get_symbol_map(gpl)
de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
log(f"[{label}] probes->symbols: {len(de)} probes -> {len(de_sym)} symbols; "
f"sig q<0.05: {(de_sym['qvalue']<0.05).sum()}")
de_sym.to_parquet(GEO_DIR / f"de_{label}.parquet")
return de_sym
def main() -> None:
# --- GSE35007: whole blood, hb phenotype SS (disease) vs AA (healthy) ---
log("downloading GSE35007 ...")
gse1 = GEOparse.get_GEO(geo="GSE35007", destdir=str(GEO_DIR), silent=True)
g1 = {}
for gsm_id, gsm in gse1.gsms.items():
hb = char_value(gsm, "hb phenotype")
if hb == "SS":
g1[gsm_id] = DISEASE_LABEL
elif hb == "AA":
g1[gsm_id] = HEALTHY_LABEL
de1 = run_study(gse1, g1, "GSE35007")
# --- GSE16728: whole blood only (PAXgene preps; exclude PBMC), patient vs control ---
log("downloading GSE16728 ...")
gse2 = GEOparse.get_GEO(geo="GSE16728", destdir=str(GEO_DIR), silent=True)
g2 = {}
for gsm_id, gsm in gse2.gsms.items():
prep = char_value(gsm, "rna prep") or ""
subj = char_value(gsm, "subject") or ""
if "PBMC" in prep:
continue # whole-blood only
if subj.lower().startswith("sickle"):
g2[gsm_id] = DISEASE_LABEL
elif subj.lower().startswith("control"):
g2[gsm_id] = HEALTHY_LABEL
de2 = run_study(gse2, g2, "GSE16728")
# --- Concordance ---
keep, summary = concordance_filter(de1, de2)
log("\n==== CONCORDANCE (whole-blood pair) ====")
log(f"genes tested in both: {summary.n_genes_tested}")
log(f"concordant (q<0.05 both + same sign): {summary.n_concordant} "
f"(up={summary.n_up}, down={summary.n_down})")
keep.to_parquet(GEO_DIR / "concordant_genes.parquet")
log("\n==== SANITY-GATE GENES ====")
for g in SANITY_GENES:
in1 = g in de1.index
in2 = g in de2.index
row1 = de1.loc[g] if in1 else None
row2 = de2.loc[g] if in2 else None
concord = "YES" if g in keep.index else "no"
d1 = f"lfc={row1['log_fc']:+.2f},q={row1['qvalue']:.1e}" if in1 else "absent"
d2 = f"lfc={row2['log_fc']:+.2f},q={row2['qvalue']:.1e}" if in2 else "absent"
log(f" {g:7s} | GSE35007 {d1:28s} | GSE16728 {d2:28s} | concordant={concord}")
log("\n==== TOP 15 CONCORDANT (by worst-case q) ====")
log(keep.head(15).to_string())
if __name__ == "__main__":
main()

121
scripts/week1_finalize.py Normal file
View File

@@ -0,0 +1,121 @@
"""Finalize the Week 1 sickle cell signature (Tier A, whole-blood concordance pair).
Reads the cached concordant gene table, maps symbols -> Entrez/Ensembl via mygene, attaches
two-study provenance + concordance summary, and persists
data/processed/sickle_cell_signature_v1.json.
Tier note: GSE16728 is exactly 10/group, which fails assign_tier()'s strict n>10. Tier A is
assigned explicitly here on the basis of cross-study, cross-platform, cross-population
concordance (the founder's decision), with the n=10 / prep-merge caveat documented in the
tier rationale and limitations.
"""
from __future__ import annotations
import sys
from pathlib import Path
import mygene
import pandas as pd
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.disease import ( # noqa: E402
ConcordanceSummary,
SignatureProvenance,
StudyProvenance,
build_signature,
persist_signature,
)
from src.provenance import ConfidenceTier # noqa: E402
CREATED_DATE = "2026-06-23"
GEO_DIR = Path("data/raw/geo")
def map_ids(symbols: list[str]) -> pd.DataFrame:
mg = mygene.MyGeneInfo()
res = mg.querymany(
symbols, scopes="symbol", fields="entrezgene,ensembl.gene",
species="human", as_dataframe=True, verbose=False,
)
res = res[~res.index.duplicated(keep="first")]
def _ensembl(v):
if isinstance(v, list) and v:
v = v[0]
if isinstance(v, dict):
return v.get("gene")
return v
id_map = pd.DataFrame(index=res.index)
id_map["entrez_id"] = res["entrezgene"] if "entrezgene" in res.columns else None
ens_col = "ensembl.gene" if "ensembl.gene" in res.columns else ("ensembl" if "ensembl" in res.columns else None)
id_map["ensembl_id"] = res[ens_col].map(_ensembl) if ens_col else None
return id_map
def main() -> None:
concordant = pd.read_parquet(GEO_DIR / "concordant_genes.parquet")
print(f"loaded {len(concordant)} concordant genes "
f"({(concordant['log_fc']>0).sum()} up / {(concordant['log_fc']<0).sum()} down)")
id_map = map_ids(list(concordant.index))
print(f"mygene mapped {id_map['entrez_id'].notna().sum()}/{len(id_map)} symbols to Entrez")
provenance = SignatureProvenance(
studies=[
StudyProvenance(
geo_accession="GSE35007", n_disease=190, n_healthy=12,
platform="Illumina HumanHT-12 V4 (GPL10558)",
tissue="whole blood", method="welch",
),
StudyProvenance(
geo_accession="GSE16728", n_disease=10, n_healthy=10,
platform="Affymetrix HG-U133 Plus 2.0 (GPL570)",
tissue="whole blood (PAXgene; globin-reduced + total RNA preps merged)",
method="welch",
),
],
concordance=ConcordanceSummary(
n_genes_tested=16208, n_concordant=671, n_up=444, n_down=227,
),
created_date=CREATED_DATE,
)
tier_rationale = (
"Measured RNA expression concordant across two independent peer-reviewed whole-blood "
"studies on different platforms (Illumina GPL10558, Affymetrix GPL570) and populations "
"(pediatric West-African GSE35007; adult GSE16728). Tier A rests on cross-study / "
"cross-platform concordance. Caveat: GSE16728 is exactly 10/group (two PAXgene preps "
"merged), which does not meet the strict n>10 per-study threshold on its own."
)
limitations = [
"Whole-blood expression is partly driven by cell-composition differences "
"(reticulocytosis in sickle patients), not pure disease-state regulation; the "
"signature is dominated by erythroid markers (CA1, AHSP, SLC4A1). v2 needs cell-type "
"deconvolution.",
"Fetal-hemoglobin genes (HBG1/HBG2) are NOT captured: flat in GSE35007 and removed by "
"GSE16728's globin-depleted RNA prep. The signature therefore does not directly encode "
"the HbF axis that hydroxyurea acts on, which may weaken the hydroxyurea recovery test.",
"GSE35007 is highly imbalanced (SS n=190 vs AA n=12); the healthy arm is small.",
"GSE16728 whole-blood arm is small (10/group) and merges two RNA-prep batches.",
"Cross-platform probe->symbol collapse (keep max-mean-expression probe) loses "
"isoform-level resolution and drops genes absent from either platform.",
]
signature = build_signature(
concordant, provenance,
tier=ConfidenceTier.A,
tier_rationale=tier_rationale,
limitations=limitations,
id_map=id_map,
top_n=250,
)
path = persist_signature(signature)
print(f"wrote {path}")
print(f"signature: {len(signature.up_regulated)} up / {len(signature.down_regulated)} down, "
f"tier {signature.confidence_tier.value}")
if __name__ == "__main__":
main()

91
scripts/week2_assemble.py Normal file
View File

@@ -0,0 +1,91 @@
"""Week 2, task 4: assemble drug_profiles_v1.parquet (PLAN §6).
Joins the curated drug set + ChEMBL enrichment + LINCS consensus signatures into one profile
table. Each drug carries a confidence tier: LINCS is a single source, so signature-backed drugs
are Tier B at best (assign_tier with single_source=True); drugs with no signature are Tier C and
marked not-scored (not dropped silently — PLAN §6 Week 3 task 2).
The 978-gene signature order is the column order of lincs_signatures_v1.parquet (landmark
symbols); each profile's `lincs_signature` is that vector (or null).
"""
from __future__ import annotations
import ast
from pathlib import Path
import pandas as pd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.drugs import persist_drug_profiles # noqa: E402
from src.provenance import ConfidenceTier, assign_tier # noqa: E402
PROCESSED = Path("data/processed")
DRUG_SET = PROCESSED / "drug_set_v1.csv"
CHEMBL = Path("data/raw/chembl/chembl_enrichment.parquet")
LINCS_SIG = PROCESSED / "lincs_signatures_v1.parquet"
def main() -> None:
drugs = pd.read_csv(DRUG_SET)
chembl = pd.read_parquet(CHEMBL)[
["pert_iname", "chembl_id", "pref_name", "smiles", "mechanism_of_action", "targets"]
]
sigs = pd.read_parquet(LINCS_SIG) # rows=pert_iname, cols=978 landmark symbols
gene_order = list(sigs.columns)
df = drugs.merge(chembl, on="pert_iname", how="left")
rows = []
for r in df.itertuples():
has_sig = r.pert_iname in sigs.index
vector = sigs.loc[r.pert_iname].tolist() if has_sig else None
# LINCS = single source => Tier B max when measured; no signature => Tier C.
tier = assign_tier(
is_measured=has_sig, n_per_group=None, peer_reviewed=True, single_source=True
) if has_sig else ConfidenceTier.C
targets = r.targets
if isinstance(targets, str):
try:
targets = ast.literal_eval(targets)
except (ValueError, SyntaxError):
targets = []
elif hasattr(targets, "tolist"): # numpy ndarray from parquet round-trip
targets = targets.tolist()
elif targets is None or (not isinstance(targets, (list, tuple))):
targets = []
rows.append({
"name": r.pert_iname,
"chembl_id": r.chembl_id if pd.notna(r.chembl_id) else None,
"pref_name": r.pref_name if pd.notna(r.pref_name) else None,
"inchikey": r.inchi_key if pd.notna(r.inchi_key) else None,
"smiles": r.smiles if pd.notna(r.smiles) else None,
"targets": list(targets),
"mechanism_of_action": r.mechanism_of_action if pd.notna(r.mechanism_of_action) else None,
"inclusion_reason": r.inclusion_reason,
"lincs_phase": r.phase,
"scored": has_sig,
"lincs_signature": vector,
"confidence_tier": tier.value,
})
profiles = pd.DataFrame(rows)
# Persist the gene order alongside, so Week-3 scoring can align the vectors.
(PROCESSED / "lincs_gene_order.txt").write_text("\n".join(gene_order))
path = persist_drug_profiles(profiles)
print(f"drug_profiles_v1: {len(profiles)} drugs")
print(f" scored (have LINCS signature): {profiles['scored'].sum()}")
print(f" not scored: {(~profiles['scored']).sum()}")
print(" by inclusion reason (scored rate):")
print(profiles.groupby("inclusion_reason")["scored"].agg(["sum", "count"]).to_string())
print(" tier split:", profiles["confidence_tier"].value_counts().to_dict())
for gt in ["hydroxyurea", "glutamine"]:
row = profiles[profiles["name"] == gt]
print(f" ground truth '{gt}': scored={bool(row['scored'].iloc[0]) if len(row) else 'ABSENT'}")
print(f"wrote {path}")
if __name__ == "__main__":
main()

99
scripts/week2_chembl.py Normal file
View File

@@ -0,0 +1,99 @@
"""Week 2, task 2: enrich the drug set with ChEMBL structure/target/mechanism data.
Drugs are matched to ChEMBL by the InChIKey already carried from LINCS pert_info (reliable),
then mechanism-of-action and target names are pulled. Compounds absent from ChEMBL (many
research/tool compounds in the random arm) keep null ChEMBL fields — they still have LINCS
signatures for scoring; only the Week-3 mechanistic prior won't apply. Output cached to
data/raw/chembl/chembl_enrichment.parquet.
"""
from __future__ import annotations
from pathlib import Path
import pandas as pd
from chembl_webresource_client.new_client import new_client
DRUG_SET = Path("data/processed/drug_set_v1.csv")
OUT = Path("data/raw/chembl/chembl_enrichment.parquet")
BATCH = 40
def chunks(seq, n):
for i in range(0, len(seq), n):
yield seq[i:i + n]
def main() -> None:
drugs = pd.read_csv(DRUG_SET)
inchikeys = sorted({k for k in drugs["inchi_key"].dropna() if isinstance(k, str) and len(k) > 10})
print(f"{len(drugs)} drugs; {len(inchikeys)} usable InChIKeys to resolve")
molecule = new_client.molecule
mechanism = new_client.mechanism
target = new_client.target
# 1) InChIKey -> ChEMBL molecule (id, name, smiles)
mol_rows = []
for i, batch in enumerate(chunks(inchikeys, BATCH)):
res = molecule.filter(molecule_structures__standard_inchi_key__in=batch).only(
["molecule_chembl_id", "pref_name", "molecule_structures"])
for m in res:
ms = m.get("molecule_structures") or {}
mol_rows.append({
"chembl_id": m["molecule_chembl_id"],
"pref_name": m.get("pref_name"),
"smiles": ms.get("canonical_smiles"),
"inchi_key": ms.get("standard_inchi_key"),
})
print(f" molecules batch {i+1}: cumulative {len(mol_rows)} hits", flush=True)
mols = pd.DataFrame(mol_rows).drop_duplicates("inchi_key")
chembl_ids = sorted(mols["chembl_id"].unique())
print(f"resolved {len(mols)} molecules -> {len(chembl_ids)} ChEMBL ids")
# 2) ChEMBL id -> mechanism of action + target ids
mech_rows = []
for batch in chunks(chembl_ids, BATCH):
for m in mechanism.filter(molecule_chembl_id__in=batch).only(
["molecule_chembl_id", "mechanism_of_action", "target_chembl_id"]):
mech_rows.append(m)
mech = pd.DataFrame(mech_rows)
print(f"mechanism records: {len(mech)}")
# 3) target id -> name
tgt_names = {}
if not mech.empty:
tids = sorted({t for t in mech["target_chembl_id"].dropna().unique()})
for batch in chunks(tids, BATCH):
for t in target.filter(target_chembl_id__in=batch).only(["target_chembl_id", "pref_name"]):
tgt_names[t["target_chembl_id"]] = t.get("pref_name")
# aggregate mechanism/targets per molecule
def agg(df):
moa = sorted({x for x in df["mechanism_of_action"].dropna()})
tns = sorted({tgt_names.get(t) for t in df["target_chembl_id"].dropna() if tgt_names.get(t)})
return pd.Series({"mechanism_of_action": "; ".join(moa) or None, "targets": tns})
if not mech.empty:
per_mol = mech.groupby("molecule_chembl_id").apply(agg, include_groups=False).reset_index()
per_mol = per_mol.rename(columns={"molecule_chembl_id": "chembl_id"})
mols = mols.merge(per_mol, on="chembl_id", how="left")
else:
mols["mechanism_of_action"] = None
mols["targets"] = None
# join back to the drug set on inchi_key
enriched = drugs.merge(mols, on="inchi_key", how="left", suffixes=("", "_chembl"))
OUT.parent.mkdir(parents=True, exist_ok=True)
enriched.to_parquet(OUT, index=False)
n_resolved = enriched["chembl_id"].notna().sum()
n_moa = enriched["mechanism_of_action"].notna().sum()
print(f"\nenriched {len(enriched)} drugs: {n_resolved} matched ChEMBL, {n_moa} have MoA")
print(f"by reason, ChEMBL match rate:")
print(enriched.assign(matched=enriched["chembl_id"].notna()).groupby("inclusion_reason")["matched"].mean().round(2).to_string())
print(f"wrote {OUT}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,131 @@
"""Week 2, task 1: curate the deliberately-composed ~300-drug set (PLAN §6).
Composition: 2 ground-truth + ~50 related-mechanism + ~50 negative controls + ~200 random.
The universe is restricted to compounds that actually have a LINCS Level-5 signature (in
Phase I and/or Phase II), so every curated drug is scorable. Output: drug_set_v1.csv.
"""
from __future__ import annotations
import gzip
import io
from pathlib import Path
import pandas as pd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src import RANDOM_SEED # noqa: E402
LINCS = Path("data/raw/lincs")
OUT = Path("data/processed/drug_set_v1.csv")
GROUND_TRUTH = ["hydroxyurea", "glutamine"] # glutamine == L-glutamine in LINCS
# Curated by mechanism (PLAN §6). Intersected with the LINCS catalog below, so misses are
# silently dropped — we keep whatever actually has a signature.
RELATED_MECHANISM = [
# HbF inducers / epigenetic
"decitabine", "azacitidine", "vorinostat", "panobinostat", "romidepsin", "entinostat",
"mocetinostat", "belinostat", "pomalidomide", "lenalidomide", "thalidomide", "apicidin",
"trichostatin-a", "scriptaid", "valproic-acid",
# NO / vascular
"sildenafil", "tadalafil", "nitroprusside",
# antioxidants
"n-acetyl-cysteine", "resveratrol", "curcumin", "quercetin", "sulforaphane",
# anti-inflammatory studied in SCD
"dexamethasone", "prednisolone", "hydrocortisone", "ibuprofen", "indomethacin",
"sulfasalazine", "montelukast", "aspirin",
# iron / heme / SCD-adjacent
"hemin", "deferoxamine", "deferasirox", "simvastatin", "atorvastatin", "ticagrelor",
]
NEGATIVE_CONTROL = [
# antifungals
"fluconazole", "ketoconazole", "itraconazole", "clotrimazole", "terbinafine", "miconazole",
# antihistamines
"loratadine", "cetirizine", "fexofenadine", "diphenhydramine", "chlorpheniramine",
"astemizole",
# antibiotics
"amoxicillin", "ciprofloxacin", "doxycycline", "trimethoprim", "azithromycin", "tetracycline",
"nitrofurantoin",
# hormones / contraceptives
"levonorgestrel", "ethinyl-estradiol", "norethindrone", "medroxyprogesterone-acetate",
# misc unrelated
"omeprazole", "ranitidine", "loperamide", "caffeine", "acetaminophen", "lidocaine",
]
# Fill the random sample so the total set is ~300 (the denominator the pre-registered
# recovery-test thresholds assume: "top 30 of 300"). Curated mechanism/control drugs are
# capped by what LINCS actually contains, so the random arm absorbs the remainder.
TARGET_TOTAL = 300
def load_catalog() -> pd.DataFrame:
"""Compounds with >=1 Level-5 signature, annotated with phase + inchi/smiles."""
def read_gz(fn, **kw):
return pd.read_csv(io.BytesIO(gzip.decompress(Path(fn).read_bytes())), sep="\t", **kw)
sig1 = read_gz(LINCS / "GSE92742_sig_info.txt.gz", low_memory=False)
sig2 = read_gz(LINCS / "GSE70138_sig_info.txt.gz", low_memory=False)
cp1 = set(sig1[sig1["pert_type"] == "trt_cp"]["pert_iname"])
cp2 = set(sig2[sig2["pert_type"] == "trt_cp"]["pert_iname"])
pert1 = read_gz(LINCS / "GSE92742_pert_info.txt.gz", low_memory=False)
pert2 = read_gz(LINCS / "GSE70138_pert_info.txt.gz", low_memory=False)
info = pd.concat([pert1, pert2], ignore_index=True)
info = info[info["pert_type"] == "trt_cp"].drop_duplicates("pert_iname", keep="first")
info = info.set_index("pert_iname")
names = cp1 | cp2
rows = []
for nm in names:
phase = "both" if nm in cp1 and nm in cp2 else ("P1" if nm in cp1 else "P2")
rec = info.loc[nm] if nm in info.index else None
rows.append({
"pert_iname": nm,
"phase": phase,
"pert_id": rec["pert_id"] if rec is not None else None,
"inchi_key": rec["inchi_key"] if rec is not None else None,
"canonical_smiles": rec["canonical_smiles"] if rec is not None else None,
})
return pd.DataFrame(rows).set_index("pert_iname")
def pick(catalog: pd.DataFrame, names: list[str], reason: str) -> pd.DataFrame:
present = [n for n in names if n in catalog.index]
missing = [n for n in names if n not in catalog.index]
if missing:
print(f" [{reason}] {len(present)}/{len(names)} in LINCS; dropped: {missing}")
out = catalog.loc[present].copy()
out["inclusion_reason"] = reason
return out
def main() -> None:
catalog = load_catalog()
print(f"LINCS scorable compound universe: {len(catalog)}")
gt = pick(catalog, GROUND_TRUTH, "ground_truth")
rel = pick(catalog, RELATED_MECHANISM, "related_mechanism")
neg = pick(catalog, NEGATIVE_CONTROL, "negative_control")
chosen = pd.concat([gt, rel, neg])
remaining = catalog.drop(index=chosen.index)
n_random = TARGET_TOTAL - len(chosen)
rand = remaining.sample(n=n_random, random_state=RANDOM_SEED).copy()
rand["inclusion_reason"] = "general_sample"
drug_set = pd.concat([gt, rel, neg, rand]).reset_index()
OUT.parent.mkdir(parents=True, exist_ok=True)
drug_set.to_csv(OUT, index=False)
print(f"\ndrug_set_v1.csv: {len(drug_set)} drugs")
print(drug_set["inclusion_reason"].value_counts().to_string())
print(f"phase split:\n{drug_set['phase'].value_counts().to_string()}")
print(f"wrote {OUT}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,107 @@
"""Week 2, task 3: extract per-drug LINCS L1000 consensus signatures.
For each drug in the set, slice its Level-5 MODZ signatures (978 landmark genes x its sig_ids)
out of the big GCTX via cmapPy, then aggregate to ONE consensus 978-vector per drug by mean
across its sig_ids (the "MODZ aggregation across cell lines/replicates" of PLAN §6). hydroxyurea
lives in Phase II, L-glutamine in Phase I, so both phases are processed and merged.
Output: data/processed/lincs_signatures_v1.parquet (rows = pert_iname, cols = 978 landmark
gene symbols, values = mean MODZ z-score).
Usage:
python scripts/week2_lincs_extract.py --phase 2 # test on Phase II (already downloaded)
python scripts/week2_lincs_extract.py # both phases (needs Phase I gunzipped)
"""
from __future__ import annotations
import argparse
import gzip
import io
from pathlib import Path
import pandas as pd
from cmapPy.pandasGEXpress.parse import parse
LINCS = Path("data/raw/lincs")
DRUG_SET = Path("data/processed/drug_set_v1.csv")
OUT = Path("data/processed/lincs_signatures_v1.parquet")
GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
def read_gz_tsv(name: str) -> pd.DataFrame:
return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
"""Gene row-ids + id->symbol map for the scored gene space.
v1.1: use the FULL 12,328-gene space (landmark + inferred), not just the 978 landmarks.
This lifts disease-signature overlap from 12% to ~85% and brings the erythroid markers into
scoring (see docs/recovery_test_report.md). Inferred genes are model-predicted (noisier).
"""
g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
ids = [str(x) for x in g["pr_gene_id"]]
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
return ids, id_to_symbol
def extract_phase(phase: int, drug_names: set[str], landmark_ids: list[str]) -> pd.DataFrame:
"""Return DataFrame: rows=pert_iname, cols=landmark gene_id (str), one mean vector per drug."""
sig = read_gz_tsv(SIG_INFO[phase])
sig = sig[(sig["pert_type"] == "trt_cp") & (sig["pert_iname"].isin(drug_names))]
if sig.empty:
return pd.DataFrame()
sig_ids = sig["sig_id"].tolist()
print(f"[phase {phase}] {sig['pert_iname'].nunique()} drugs, {len(sig_ids)} signatures to slice", flush=True)
gctoo = parse(str(GCTX[phase]), rid=landmark_ids, cid=sig_ids)
data = gctoo.data_df # rows=gene_id, cols=sig_id
sig_to_drug = dict(zip(sig["sig_id"], sig["pert_iname"]))
# mean across each drug's sig_ids -> one consensus vector per drug
per_drug = data.T.groupby(data.columns.map(sig_to_drug)).mean()
print(f"[phase {phase}] aggregated to {len(per_drug)} drug consensus vectors", flush=True)
return per_drug # rows=pert_iname, cols=gene_id
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--phase", type=int, choices=[1, 2], default=None, help="single phase (test)")
args = ap.parse_args()
drugs = pd.read_csv(DRUG_SET)
drug_names = set(drugs["pert_iname"])
landmark_ids, id_to_symbol = landmark_ids_and_symbols()
phases = [args.phase] if args.phase else [1, 2]
frames = []
for ph in phases:
if not GCTX[ph].exists():
print(f"[phase {ph}] {GCTX[ph]} missing — skipping")
continue
frames.append(extract_phase(ph, drug_names, landmark_ids))
frames = [f for f in frames if not f.empty]
if not frames:
print("no signatures extracted")
return
# A drug present in both phases: average the two phase consensus vectors.
combined = pd.concat(frames).groupby(level=0).mean()
combined.columns = [id_to_symbol.get(c, c) for c in combined.columns] # gene_id -> symbol
covered = sorted(set(combined.index))
missing = sorted(drug_names - set(covered))
print(f"\nsignatures extracted for {len(covered)}/{len(drug_names)} drugs")
for gt in ["hydroxyurea", "glutamine"]:
print(f" ground truth '{gt}': {'OK' if gt in covered else 'MISSING'}")
if args.phase is None:
combined.to_parquet(OUT)
print(f"wrote {OUT} ({combined.shape[0]} drugs x {combined.shape[1]} landmark genes)")
if missing:
print(f"{len(missing)} drugs without signature (will be marked not-scored in Week 3)")
if __name__ == "__main__":
main()

82
scripts/week3_scoring.py Normal file
View File

@@ -0,0 +1,82 @@
"""Week 3 (v1.1): connectivity scoring over the full gene space with tau calibration.
Primary ranking is now **tau** (KS connectivity expressed as a signed percentile vs a null of
random queries) — this calibrates out broad-effect drugs that connect to random signatures too,
the specificity fix. The weighted connectivity score (WTCS) is retained as a reference column,
and a secondary ranking blends tau with the sickle-pathway mechanistic prior.
Output: data/results/ranked_candidates_v1.csv.
"""
from __future__ import annotations
import json
from pathlib import Path
import pandas as pd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import ( # noqa: E402
connectivity_score, mechanistic_prior, normalize_scores, persist_ranking, tau_calibrate,
)
PROCESSED = Path("data/processed")
N_NULL = 1000
PRIOR_LAMBDA = 0.5 # spec_z credit per matched sickle pathway, for the blended ranking
def main() -> None:
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
up = [g["gene"] for g in sig["up_regulated"]]
down = [g["gene"] for g in sig["down_regulated"]]
sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 12,328 genes
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
n_up = len(set(up) & set(sig_matrix.columns))
n_down = len(set(down) & set(sig_matrix.columns))
print(f"gene space: {sig_matrix.shape[1]} genes; query overlap {n_up} up + {n_down} down = {n_up + n_down}")
# primary: tau calibration
print(f"computing tau over {N_NULL} random-query permutations ...", flush=True)
ranked = tau_calibrate(up, down, sig_matrix, n_null=N_NULL)
# reference: weighted connectivity score (WTCS) + NCS
wtcs = pd.Series({d: connectivity_score(up, down, sig_matrix.loc[d]) for d in sig_matrix.index},
name="connectivity_score")
ranked["connectivity_score"] = wtcs
ranked["normalized_score"] = normalize_scores(wtcs)
# metadata + mechanistic prior
ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
ranked["mechanistic_prior"] = ranked["targets"].apply(
lambda t: mechanistic_prior(list(t) if t is not None else []))
ranked["known_targets"] = ranked["targets"].apply(
lambda t: "; ".join(t) if t is not None and len(t) else "")
ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})
# secondary, prior-weighted ranking (relevant drugs pushed toward more-negative spec_z)
ranked["blended_score"] = ranked["spec_z"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)
out = ranked.rename_axis("drug_name").reset_index()[[
"rank", "drug_name", "chembl_id", "spec_z", "tau", "connectivity_ks", "connectivity_score",
"inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
]]
path = persist_ranking(out)
print(f"wrote {path} ({len(out)} drugs)")
print("\n--- sanity peek (spec_z ranking) ---")
for gt in ["hydroxyurea", "glutamine"]:
r = ranked.loc[gt]
print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {100*r['rank']/len(ranked):.0f}%), "
f"spec_z={r['spec_z']:.2f}")
print(" top 10 by spec_z:")
for name, r in ranked.nsmallest(10, "spec_z").iterrows():
print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:6.2f} [{r['inclusion_reason']:16s}] "
f"{str(r['known_targets'])[:38]}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,81 @@
"""Week 4: formal recovery test against the pre-registered criteria (PLAN §6).
Pre-registered criteria (committed in docs/recovery_test_report.md before this run):
- hydroxyurea in top 10% (top 30 of 300), AND
- L-glutamine in top 25% (top 75) OR documented unscorable due to missing LINCS signature, AND
- >=4 of 5 pre-specified negative controls in the bottom half.
The 5 negative controls are pre-specified here by a category rule (one per category, alphabetically
first available) so the choice does not peek at ranks. Primary ranking = raw connectivity.
"""
from __future__ import annotations
from pathlib import Path
import pandas as pd
RANKED = Path("data/results/ranked_candidates_v1.csv")
# One per unrelated category, alphabetical-first — chosen without looking at ranks.
NEG_CONTROL_CATEGORIES = {
"antifungal": ["clotrimazole", "fluconazole", "itraconazole", "ketoconazole", "miconazole", "terbinafine"],
"antihistamine": ["astemizole", "cetirizine", "diphenhydramine", "fexofenadine", "loratadine"],
"antibiotic": ["azithromycin", "ciprofloxacin", "doxycycline", "tetracycline", "trimethoprim"],
"hormone": ["ethinyl-estradiol", "levonorgestrel", "medroxyprogesterone-acetate", "norethindrone"],
"misc": ["caffeine", "lidocaine", "loperamide", "omeprazole", "ranitidine"],
}
def main() -> None:
df = pd.read_csv(RANKED).set_index("drug_name")
n = len(df)
top10_cut, top25_cut, half = int(n * 0.10), int(n * 0.25), n // 2
def rk(name):
return int(df.loc[name, "rank"]) if name in df.index else None
hu, glut = rk("hydroxyurea"), rk("glutamine")
glut_z = df.loc["glutamine", "spec_z"]
# pick negative controls present in the ranking
negs = {}
for cat, options in NEG_CONTROL_CATEGORIES.items():
pick = next((d for d in options if d in df.index), None)
if pick:
negs[pick] = (cat, rk(pick))
print("=" * 60)
print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}")
print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%) -> top-10%? {hu <= top10_cut}")
print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), spec_z={glut_z:+.2f} "
f"-> top-25%? {glut <= top25_cut} (positive z => does not reverse; has a signature)")
print("\nnegative controls (pre-specified, 1 per category):")
n_bottom = 0
for d, (cat, r) in negs.items():
in_bottom = r > half
n_bottom += in_bottom
print(f" {d:18s} [{cat:13s}] rank {r:3d} bottom-half? {in_bottom}")
print(f" -> {n_bottom}/5 in bottom half (need >=4)")
crit_hu = hu <= top10_cut
crit_glut = glut <= top25_cut
crit_neg = n_bottom >= 4
overall = crit_hu and crit_glut and crit_neg
print(f"\nCRITERIA: hydroxyurea={crit_hu}, L-glutamine={crit_glut}, neg-controls={crit_neg}")
print(f"OVERALL (raw ranking): {'PASS' if overall else 'FAIL'}")
# secondary prior-weighted view (reported, not the primary criterion)
hu_b = int(df.loc["hydroxyurea", "blended_rank"])
print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} "
f"(top {100*hu_b/n:.1f}%)")
print("\n--- TOP 10 (primary spec_z ranking) ---")
top10 = df.sort_values("rank").head(10)
for name, r in top10.iterrows():
print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:+.2f} "
f"[{r['inclusion_reason']}] {str(r['known_targets'])[:45]}")
if __name__ == "__main__":
main()

89
src/binding.py Normal file
View File

@@ -0,0 +1,89 @@
"""Structure-based binding track (PLAN §12).
Two capabilities:
- ligand-based retrieval (RDKit, works now): find existing drugs near a query molecule in
chemical space — validated, and the engine behind §12.9 generative-guided retrieval.
- structure-based docking (§12.3): score whether a ligand binds a target pocket. Blocked on an
ARM-Mac docking toolchain (AutoDock Vina does not pip-install); see ``dock`` for options.
Caveat carried throughout: chemical similarity != activity, and docking != efficacy (§12.6).
"""
from __future__ import annotations
from pathlib import Path
from rdkit import Chem, DataStructs, RDLogger
from rdkit.Chem import rdFingerprintGenerator
RDLogger.DisableLog("rdApp.*")
_MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
STRUCT_DIR = Path("data/raw/structures")
# Known sickle small-molecule binders, by target (positive controls for the §12.4 recovery test).
KNOWN_BINDERS = {
"hemoglobin": "voxelotor",
"PKR": "mitapivat",
"DNMT1": "decitabine",
"HDAC": "vorinostat",
}
# Curated target structures (PLAN §12.2). Add PDB ids as the harness grows.
TARGET_PDB = {
"hemoglobin": "5E83", # hemoglobin + voxelotor (GBT440)
"PKR": "8XFD", # pyruvate kinase R + mitapivat
}
def morgan_fp(smiles: str):
"""Morgan (ECFP4) fingerprint, or None for invalid / missing SMILES ('-666', '')."""
if not isinstance(smiles, str) or smiles in ("-666", ""):
return None
mol = Chem.MolFromSmiles(smiles)
return _MORGAN.GetFingerprint(mol) if mol else None
def tanimoto(smiles_a: str, smiles_b: str) -> float | None:
fa, fb = morgan_fp(smiles_a), morgan_fp(smiles_b)
if fa is None or fb is None:
return None
return DataStructs.TanimotoSimilarity(fa, fb)
def retrieve_nearest(
query_smiles: str,
library: dict[str, str],
top_n: int = 5,
) -> list[tuple[float, str]]:
"""Rank a library of {name: smiles} by Tanimoto similarity to a query molecule.
This is the §12.9 retrieval step: the query may be a known binder (positive-control beacon)
or a generated idealised binder; the returned existing drugs are repurposing candidates that
STILL require docking/validation (similarity != activity).
"""
qfp = morgan_fp(query_smiles)
if qfp is None:
raise ValueError("invalid query SMILES")
sims = []
for name, smi in library.items():
fp = morgan_fp(smi)
if fp is not None:
sims.append((DataStructs.TanimotoSimilarity(qfp, fp), name))
return sorted(sims, reverse=True)[:top_n]
def dock(target: str, ligand_smiles: str) -> float:
"""Dock a ligand into a target pocket and return a binding score (PLAN §12.3).
Blocked: AutoDock Vina does not pip-install on ARM Mac and no docking binary is on PATH.
Resolve the toolchain first (one of):
- conda/micromamba: ``vina`` (conda-forge) or ``smina`` (bioconda), osx-arm64 builds
- an AF3-class co-folding model on GPU: Boltz-2 / Chai-1 / DiffDock (also predicts affinity)
- x86 Vina binary under Rosetta 2
Then: fetch TARGET_PDB[target], define the pocket box, prep the ligand (Meeko), score.
"""
raise NotImplementedError(
"Docking toolchain unresolved on ARM Mac (PLAN §12.6 pitfall 4 / §12.8). "
"See docstring for options."
)

View File

@@ -1,19 +1,27 @@
"""Disease signature construction. """Disease signature construction.
Week 1 (PLAN.md §6). Builds a Tier-A sickle cell signature from GEO expression data via Week 1 (PLAN.md §6). Builds a Tier-A sickle cell signature from GEO expression data via
differential expression, then persists it with full provenance to differential expression on TWO studies, keeping only genes that are concordant across both
``data/processed/sickle_cell_signature_v1.json``. (the multi-source evidence that earns Tier A — see project memory). Persists with full
provenance to ``data/processed/sickle_cell_signature_v1.json``.
This module defines the persisted schema (pydantic) and the construction stubs. The actual Pipeline (driven from ``notebooks/02_disease_signature.ipynb``):
data pull + differential expression is driven from ``notebooks/02_disease_signature.ipynb``. per study: compute_differential_expression -> collapse_probes_to_symbols
across: concordance_filter -> build_signature -> persist_signature
Conventions:
- log_fc > 0 => up-regulated in DISEASE vs HEALTHY (Week 1 refinement R1)
- cross-study join key is the HGNC gene symbol (R2)
""" """
from __future__ import annotations from __future__ import annotations
from pathlib import Path from pathlib import Path
import numpy as np
import pandas as pd import pandas as pd
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from scipy.stats import false_discovery_control, ttest_ind
from . import PIPELINE_VERSION, PROCESSED_DIR from . import PIPELINE_VERSION, PROCESSED_DIR
from .provenance import ConfidenceTier from .provenance import ConfidenceTier
@@ -22,6 +30,10 @@ from .provenance import ConfidenceTier
TOP_N_PER_DIRECTION = 250 TOP_N_PER_DIRECTION = 250
QVALUE_CUTOFF = 0.05 QVALUE_CUTOFF = 0.05
# Group labels used throughout. The disease group is the "treatment" arm in DE contrasts.
DISEASE_LABEL = "disease"
HEALTHY_LABEL = "healthy"
class GeneEntry(BaseModel): class GeneEntry(BaseModel):
"""A single differentially expressed gene in the signature.""" """A single differentially expressed gene in the signature."""
@@ -33,15 +45,37 @@ class GeneEntry(BaseModel):
qvalue: float qvalue: float
class SignatureProvenance(BaseModel): class StudyProvenance(BaseModel):
"""Provenance block for a disease signature (PLAN.md §6 schema).""" """Provenance for one contributing GEO study (Week 1 refinement R6)."""
geo_accession: str geo_accession: str
n_disease: int n_disease: int
n_healthy: int n_healthy: int
platform: str platform: str
method: str = Field(..., description="Differential expression method, e.g. 'limma', 'deseq2'.") tissue: str
method: str = Field(..., description="DE method: 'welch' (microarray) or 'deseq2' (RNA-seq).")
class ConcordanceSummary(BaseModel):
"""How the two studies were reconciled into the signature (R6)."""
rule: str = Field(
default="q<0.05 in both studies AND same sign of log_fc in both",
description="The concordance rule applied across studies.",
)
n_genes_tested: int = Field(..., description="Genes present in both studies after symbol collapse.")
n_concordant: int
n_up: int
n_down: int
class SignatureProvenance(BaseModel):
"""Provenance block for a disease signature (revised for 2-study concordance, R6)."""
studies: list[StudyProvenance]
concordance: ConcordanceSummary
created_date: str created_date: str
pipeline_version: str = PIPELINE_VERSION
class DiseaseSignature(BaseModel): class DiseaseSignature(BaseModel):
@@ -58,44 +92,213 @@ class DiseaseSignature(BaseModel):
limitations: list[str] limitations: list[str]
def _looks_like_linear_intensity(expression: pd.DataFrame) -> bool:
"""Heuristic: microarray data still on a linear scale has a large dynamic range.
log2-transformed intensities are typically <~20; raw intensities run into the thousands.
"""
return float(np.nanmax(expression.to_numpy())) > 50.0
def _welch_de(expression: pd.DataFrame, groups: pd.Series) -> pd.DataFrame:
"""Per-gene Welch t-test + Benjamini-Hochberg, the limma-equivalent for microarray.
Assumes ``expression`` is genes (rows) x samples (columns), on (or coercible to) a log2
scale. ``log_fc`` is mean(disease) - mean(healthy) (R1 sign convention).
"""
expr = expression.copy()
if _looks_like_linear_intensity(expr):
# log2(x+1) guards against zeros/negatives while keeping the transform standard.
expr = np.log2(expr.clip(lower=0) + 1.0)
disease_cols = groups.index[groups == DISEASE_LABEL]
healthy_cols = groups.index[groups == HEALTHY_LABEL]
disease = expr[disease_cols].to_numpy()
healthy = expr[healthy_cols].to_numpy()
log_fc = disease.mean(axis=1) - healthy.mean(axis=1)
# Welch's t-test (unequal variance) per gene across samples.
_, pvalue = ttest_ind(disease, healthy, axis=1, equal_var=False, nan_policy="omit")
out = pd.DataFrame({"log_fc": log_fc, "pvalue": pvalue}, index=expr.index)
out = out.dropna(subset=["pvalue"])
out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
return out.sort_values("qvalue")
def _deseq2_de(counts: pd.DataFrame, groups: pd.Series) -> pd.DataFrame:
"""RNA-seq differential expression via pydeseq2.
``counts`` is genes (rows) x samples (columns) of raw integer counts. Returns the same
schema as ``_welch_de`` (log_fc = log2FC of disease vs healthy, plus pvalue/qvalue).
"""
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
# pydeseq2 wants samples (rows) x genes (cols), with an aligned metadata frame.
counts_t = counts.T.round().astype(int)
metadata = pd.DataFrame({"condition": groups.reindex(counts_t.index).to_numpy()}, index=counts_t.index)
dds = DeseqDataSet(counts=counts_t, metadata=metadata, design="~condition", quiet=True)
dds.deseq2()
stats = DeseqStats(dds, contrast=["condition", DISEASE_LABEL, HEALTHY_LABEL], quiet=True)
stats.summary()
res = stats.results_df.rename(columns={"log2FoldChange": "log_fc", "pvalue": "pvalue", "padj": "qvalue"})
return res[["log_fc", "pvalue", "qvalue"]].dropna(subset=["qvalue"]).sort_values("qvalue")
def compute_differential_expression( def compute_differential_expression(
expression: pd.DataFrame, expression: pd.DataFrame,
sample_groups: pd.Series, sample_groups: pd.Series,
*, *,
method: str, method: str,
) -> pd.DataFrame: ) -> pd.DataFrame:
"""Compute gene-level log fold change and adjusted p-values. """Compute gene-level log fold change and adjusted p-values for one study.
For RNA-seq use ``pydeseq2``; for microarray log2-transform/normalize and use a
limma-equivalent (PLAN.md §6, Week 1 task 4).
Args: Args:
expression: Genes (rows) x samples (columns) expression matrix. expression: Genes (rows) x samples (columns). Microarray intensities or RNA-seq counts.
sample_groups: Per-sample group label ('disease' / 'healthy'), indexed by sample. sample_groups: Per-sample 'disease'/'healthy' label, indexed by sample id (column).
method: 'deseq2' (RNA-seq) or 'limma' (microarray). method: 'welch' (microarray, PLAN choice) or 'deseq2' (RNA-seq).
Returns: Returns:
A table indexed by gene with at least ``log_fc`` and ``qvalue`` columns. Table indexed by gene/probe with ``log_fc``, ``pvalue``, ``qvalue`` (R1 sign convention).
""" """
raise NotImplementedError("Differential expression: implement in Week 1 (notebook 02).") groups = sample_groups.reindex(expression.columns)
if groups.isna().any():
missing = list(groups.index[groups.isna()])
raise ValueError(f"sample_groups missing labels for samples: {missing[:5]}...")
if method == "welch":
return _welch_de(expression, groups)
if method == "deseq2":
return _deseq2_de(expression, groups)
raise ValueError(f"Unknown method {method!r}; expected 'welch' or 'deseq2'.")
def collapse_probes_to_symbols(
de_table: pd.DataFrame,
probe_to_symbol: pd.Series,
expression_for_ranking: pd.DataFrame | None = None,
) -> pd.DataFrame:
"""Collapse a probe-level DE table to one row per HGNC symbol (R2).
When multiple probes map to the same symbol, keep the probe with the highest mean
expression (the standard collapseRows heuristic) if ``expression_for_ranking`` is given;
otherwise keep the probe with the smallest qvalue.
Args:
de_table: DE table indexed by probe id (from ``compute_differential_expression``).
probe_to_symbol: Probe id -> HGNC symbol mapping.
expression_for_ranking: Optional genes x samples matrix to rank duplicate probes.
Returns:
DE table indexed by HGNC symbol.
"""
df = de_table.copy()
df["symbol"] = probe_to_symbol.reindex(df.index)
df = df.dropna(subset=["symbol"])
if expression_for_ranking is not None:
rank_key = expression_for_ranking.reindex(df.index).mean(axis=1)
df = df.assign(_rank=rank_key).sort_values("_rank", ascending=False)
else:
df = df.sort_values("qvalue", ascending=True)
collapsed = df[~df["symbol"].duplicated(keep="first")].set_index("symbol")
return collapsed.drop(columns=[c for c in ("_rank",) if c in collapsed.columns])
def concordance_filter(
de_a: pd.DataFrame,
de_b: pd.DataFrame,
*,
qvalue_cutoff: float = QVALUE_CUTOFF,
) -> tuple[pd.DataFrame, ConcordanceSummary]:
"""Keep only genes concordant across two symbol-level DE tables (R3).
A gene qualifies iff q<``qvalue_cutoff`` in BOTH studies AND log_fc has the same sign in
both. Reported ``log_fc`` = mean of the two; ``qvalue`` = max of the two (worst case).
Ranked by ``max(q_a, q_b)`` ascending.
Returns:
(concordant_table indexed by symbol with log_fc/qvalue, ConcordanceSummary).
"""
merged = de_a.join(de_b, lsuffix="_a", rsuffix="_b", how="inner")
n_tested = len(merged)
sig_both = (merged["qvalue_a"] < qvalue_cutoff) & (merged["qvalue_b"] < qvalue_cutoff)
same_sign = np.sign(merged["log_fc_a"]) == np.sign(merged["log_fc_b"])
keep = merged[sig_both & same_sign].copy()
keep["log_fc"] = (keep["log_fc_a"] + keep["log_fc_b"]) / 2.0
keep["qvalue"] = keep[["qvalue_a", "qvalue_b"]].max(axis=1)
keep = keep[["log_fc", "qvalue"]].sort_values("qvalue")
summary = ConcordanceSummary(
n_genes_tested=n_tested,
n_concordant=len(keep),
n_up=int((keep["log_fc"] > 0).sum()),
n_down=int((keep["log_fc"] < 0).sum()),
)
return keep, summary
def build_signature( def build_signature(
de_table: pd.DataFrame, concordant_table: pd.DataFrame,
provenance: SignatureProvenance, provenance: SignatureProvenance,
*, *,
tier: ConfidenceTier, tier: ConfidenceTier,
tier_rationale: str, tier_rationale: str,
limitations: list[str], limitations: list[str],
id_map: pd.DataFrame | None = None,
top_n: int = TOP_N_PER_DIRECTION, top_n: int = TOP_N_PER_DIRECTION,
qvalue_cutoff: float = QVALUE_CUTOFF,
) -> DiseaseSignature: ) -> DiseaseSignature:
"""Assemble a ``DiseaseSignature`` from a differential expression table. """Assemble a ``DiseaseSignature`` from the concordant gene table.
Takes the top ``top_n`` up- and down-regulated genes (by qvalue, cut at Takes the top ``top_n`` up- and down-regulated genes by qvalue per direction (R3). If
``qvalue_cutoff``) per PLAN.md §6, Week 1 task 5. fewer than ``top_n`` qualify in a direction, takes all available (the caller should have
logged the count). ``id_map`` optionally provides 'entrez_id'/'ensembl_id' columns indexed
by symbol (from ``mygene``).
Args:
concordant_table: Output of ``concordance_filter`` (indexed by symbol).
provenance: Fully populated ``SignatureProvenance`` (both studies + concordance).
tier: Confidence tier (Tier A only if genuinely multi-source).
tier_rationale: One-line justification of the tier.
limitations: Honest limitations list (R8).
id_map: Optional symbol -> {entrez_id, ensembl_id} table.
top_n: Max genes per direction.
Returns:
A populated ``DiseaseSignature``.
""" """
raise NotImplementedError("Signature assembly: implement in Week 1 (notebook 02).")
def _entries(rows: pd.DataFrame) -> list[GeneEntry]:
entries: list[GeneEntry] = []
for symbol, row in rows.iterrows():
ids = id_map.loc[symbol] if (id_map is not None and symbol in id_map.index) else None
entries.append(
GeneEntry(
gene=str(symbol),
entrez_id=(str(ids["entrez_id"]) if ids is not None and pd.notna(ids.get("entrez_id")) else None),
ensembl_id=(str(ids["ensembl_id"]) if ids is not None and pd.notna(ids.get("ensembl_id")) else None),
log_fc=float(row["log_fc"]),
qvalue=float(row["qvalue"]),
)
)
return entries
up = concordant_table[concordant_table["log_fc"] > 0].sort_values("qvalue").head(top_n)
down = concordant_table[concordant_table["log_fc"] < 0].sort_values("qvalue").head(top_n)
return DiseaseSignature(
up_regulated=_entries(up),
down_regulated=_entries(down),
provenance=provenance,
confidence_tier=tier,
tier_rationale=tier_rationale,
limitations=limitations,
)
def persist_signature(signature: DiseaseSignature, out_path: Path | None = None) -> Path: def persist_signature(signature: DiseaseSignature, out_path: Path | None = None) -> Path:

View File

@@ -1,33 +1,65 @@
"""CMap-style connectivity scoring — the matching engine. """CMap-style connectivity scoring — the matching engine (Week 3, PLAN §6).
Week 3 (PLAN.md §6). Scores each drug's LINCS signature against the disease signature using Scores each drug's LINCS consensus signature against the disease signature using the weighted
weighted Kolmogorov-Smirnov enrichment (Lamb 2006 / Subramanian 2017). Strongly *negative* Kolmogorov-Smirnov / GSEA enrichment statistic (Lamb 2006; Subramanian 2017). The query is the
connectivity = strong reversal of the disease signature = candidate match. disease up/down gene sets; the reference is each drug's 978 landmark genes ranked by z-score.
Uses ``cmapPy`` as the reference implementation. ``tests/test_scoring.py`` verifies the Sign convention (PLAN §6): strongly **negative** connectivity = strong **reversal** of the
implementation against a known reference. disease signature = candidate match. A drug that down-regulates the disease's up-genes and
up-regulates its down-genes scores negative.
""" """
from __future__ import annotations from __future__ import annotations
from pathlib import Path from pathlib import Path
import numpy as np
import pandas as pd import pandas as pd
from pydantic import BaseModel
from . import RESULTS_DIR from . import RANDOM_SEED, RESULTS_DIR
# Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
# Keys are pathway categories; values are substrings matched (case-insensitive) against a
# drug's ChEMBL target names.
SICKLE_PATHWAYS: dict[str, tuple[str, ...]] = {
"hbf_epigenetic": ("histone deacetylase", "hdac", "methyltransferase", "dnmt",
"ribonucleoside-diphosphate reductase", "ribonucleotide reductase"),
"hemoglobin": ("hemoglobin", "globin"),
"no_signaling": ("nitric oxide", "guanylate cyclase", "phosphodiesterase 5", "pde5"),
"inflammation": ("cyclooxygenase", "prostaglandin", "nf-kappa", "interleukin",
"leukotriene", "selectin", "tumor necrosis factor"),
"oxidative_stress": ("glutathione", "superoxide", "nadph oxidase", "thioredoxin", "nrf2"),
}
class ConnectivityResult(BaseModel): def _enrichment_score(drug_profile: pd.Series, gene_set: set[str], weight: float = 1.0) -> float:
"""Connectivity score for a single drug against the disease signature.""" """Weighted GSEA/KS enrichment score of ``gene_set`` in a drug's ranked profile.
chembl_id: str The profile is ranked by z-score (most up-regulated first). Hits increment a running sum in
drug_name: str proportion to ``|z|**weight``; misses decrement uniformly. ES is the max signed deviation
connectivity_score: float | None # None when the drug has no LINCS signature. from zero. ES>0 => set enriched among up-regulated genes; ES<0 => among down-regulated.
normalized_score: float | None = None """
p_value: float | None = None s = drug_profile.sort_values(ascending=False)
scored: bool # False => no signature available, not scored (do not skip silently). genes = s.index.to_numpy()
n_genes_overlap: int | None = None vals = s.to_numpy(dtype=float)
hit = np.fromiter((g in gene_set for g in genes), dtype=bool, count=len(genes))
n_hit = int(hit.sum())
n = len(genes)
if n_hit == 0 or n_hit == n:
return 0.0
w = (np.abs(vals) ** weight) * hit
sum_hit = w.sum()
if sum_hit == 0:
return 0.0
inc = w / sum_hit
dec = (~hit) / (n - n_hit)
running = np.cumsum(inc - dec)
hi, lo = running.max(), running.min()
return float(hi if abs(hi) >= abs(lo) else lo)
def connectivity_score( def connectivity_score(
@@ -35,46 +67,157 @@ def connectivity_score(
down_genes: list[str], down_genes: list[str],
drug_signature: pd.Series, drug_signature: pd.Series,
) -> float: ) -> float:
"""Weighted KS connectivity score for one drug vs the disease up/down gene sets. """Weighted connectivity score (WTCS) for one drug vs the disease up/down sets.
Only the intersection of disease-signature genes and LINCS landmark genes is scored; Only query genes present in the drug's profile index (the 978 landmarks) are used — callers
callers must record the overlap count (PLAN.md §6, Week 3 task 2). should record the overlap count (PLAN §6 Week 3 task 2). Returns the WTCS: if the two
enrichment scores share a sign the result is 0 (ambiguous), else ``(ES_up - ES_down)/2``.
Args: Negative => reversal => candidate.
up_genes: Disease up-regulated gene identifiers.
down_genes: Disease down-regulated gene identifiers.
drug_signature: Drug's expression vector indexed by gene identifier.
Returns:
Connectivity score in roughly [-1, 1]; strongly negative = strong reversal.
""" """
raise NotImplementedError("Connectivity scoring: implement in Week 3 (notebook 04).") profile_genes = set(drug_signature.index)
up = set(up_genes) & profile_genes
down = set(down_genes) & profile_genes
es_up = _enrichment_score(drug_signature, up)
es_down = _enrichment_score(drug_signature, down)
if np.sign(es_up) == np.sign(es_down):
return 0.0
return (es_up - es_down) / 2.0
def normalize_scores(scores: pd.Series) -> pd.Series:
"""Signed normalization (NCS, Subramanian 2017): divide by the mean magnitude of same-sign
scores, so positive and negative tails are separately scaled to a mean magnitude of 1."""
out = scores.astype(float).copy()
pos_mean = scores[scores > 0].mean()
neg_mean = scores[scores < 0].abs().mean()
if pos_mean and not np.isnan(pos_mean):
out[scores > 0] = scores[scores > 0] / pos_mean
if neg_mean and not np.isnan(neg_mean):
out[scores < 0] = scores[scores < 0] / neg_mean
return out
def rank_drugs( def rank_drugs(
signature_up: list[str], up_genes: list[str],
signature_down: list[str], down_genes: list[str],
drug_profiles: pd.DataFrame, signature_matrix: pd.DataFrame,
) -> pd.DataFrame: ) -> pd.DataFrame:
"""Score and rank all drugs against the disease signature. """Score and rank all drugs (rows of ``signature_matrix``: drug x landmark-gene z-scores).
Drugs without a LINCS signature are marked ``scored=False`` and excluded from the ranking Returns a table indexed by drug with ``rank`` (1 = strongest reversal = most negative),
rather than dropped silently (PLAN.md §6, Week 3 task 2). ``connectivity_score`` and ``normalized_score``. Drugs are expected to all have signatures
here; signature-less drugs are handled (marked not-scored) by the orchestration layer per
Returns a ranked table with the columns described in PLAN.md §6 (rank, drug_name, PLAN §6 Week 3 task 2.
chembl_id, connectivity_score, normalized_score, p_value, inclusion_reason,
known_targets, mechanism_summary).
""" """
raise NotImplementedError("Drug ranking: implement in Week 3 (notebook 04).") scores = pd.Series(
{drug: connectivity_score(up_genes, down_genes, signature_matrix.loc[drug])
for drug in signature_matrix.index},
name="connectivity_score",
)
df = pd.DataFrame({"connectivity_score": scores, "normalized_score": normalize_scores(scores)})
df = df.sort_values("connectivity_score") # most negative (reversal) first
df.insert(0, "rank", range(1, len(df) + 1))
return df
def mechanistic_prior(targets: list[str]) -> float: def mechanistic_prior(targets: list[str]) -> float:
"""Prior weight for a drug based on sickle-cell-relevant target pathways. """Count of sickle-cell-relevant pathway categories a drug's targets hit (PLAN §6 task 3).
Pathways of interest: HbF regulation, hemoglobin, NO signaling, inflammation, oxidative Higher = more mechanistically plausible. Used to build the secondary, prior-weighted ranking
stress (PLAN.md §6, Week 3 task 3). Used to build the secondary, prior-weighted ranking. alongside the raw connectivity ranking.
""" """
raise NotImplementedError("Mechanistic prior: implement in Week 3 (notebook 04).") if not targets:
return 0.0
text = " ; ".join(str(t) for t in targets).lower()
return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))
# --- KS connectivity + tau calibration (v1.1) -----------------------------------------------
# Unweighted Kolmogorov-Smirnov connectivity (Lamb 2006) is O(k) per query (depends only on the
# ranks of the query genes), which makes a permutation null over many random queries cheap. tau
# expresses each drug's real connectivity as a signed percentile within its own null — so
# broad-effect drugs that connect to *random* signatures too get down-weighted (specificity).
def _ks_es(rank_matrix: np.ndarray, query_cols: np.ndarray, n_genes: int) -> np.ndarray:
"""Vectorized unweighted KS enrichment score of a query gene set for every drug.
``rank_matrix`` is (n_drugs, n_genes) of 1..N rank positions (1 = most up-regulated).
Returns one ES per drug; ES>0 => query enriched among up-regulated genes.
"""
k = len(query_cols)
if k == 0:
return np.zeros(rank_matrix.shape[0])
p = np.sort(rank_matrix[:, query_cols], axis=1) # (n_drugs, k), positions ascending
j = np.arange(1, k + 1)
a = (j / k - p / n_genes).max(axis=1)
b = (p / n_genes - (j - 1) / k).max(axis=1)
return np.where(a >= b, a, -b)
def _ks_connectivity(rank_matrix: np.ndarray, up_cols: np.ndarray, down_cols: np.ndarray,
n_genes: int) -> np.ndarray:
"""KS connectivity per drug: (ES_up - ES_down)/2. Negative=reversal.
Note: unlike WTCS, this does NOT zero same-sign (ambiguous) connections — same-sign ES
partially cancel and land near 0 naturally. Hard-zeroing would collapse the random-query
null to a spike at 0 and make tau saturate, so the continuous form is required for calibration.
"""
es_up = _ks_es(rank_matrix, up_cols, n_genes)
es_down = _ks_es(rank_matrix, down_cols, n_genes)
return (es_up - es_down) / 2.0
def tau_calibrate(
up_genes: list[str],
down_genes: list[str],
signature_matrix: pd.DataFrame,
n_null: int = 1000,
seed: int = RANDOM_SEED,
) -> pd.DataFrame:
"""Rank drugs by tau: each drug's KS connectivity as a signed percentile vs a null of
random queries of the same size (PLAN §6; CMap tau, Subramanian 2017).
tau in [-100, 100]: -100 => reverses our signature more specifically than any random query
(strong, specific candidate); ~0 => connects to our signature no more than to random ones
(broad-effect / non-specific). Ranked by tau ascending (rank 1 = most specific reversal).
"""
genes = list(signature_matrix.columns)
gene_to_col = {g: i for i, g in enumerate(genes)}
n = len(genes)
rank_matrix = signature_matrix.rank(axis=1, ascending=False).to_numpy()
up_cols = np.array([gene_to_col[g] for g in set(up_genes) if g in gene_to_col], dtype=int)
down_cols = np.array([gene_to_col[g] for g in set(down_genes) if g in gene_to_col], dtype=int)
real = _ks_connectivity(rank_matrix, up_cols, down_cols, n)
rng = np.random.default_rng(seed)
k_up, k_down = len(up_cols), len(down_cols)
null = np.empty((rank_matrix.shape[0], n_null))
for m in range(n_null):
samp = rng.choice(n, size=k_up + k_down, replace=False)
null[:, m] = _ks_connectivity(rank_matrix, samp[:k_up], samp[k_up:], n)
null_mean = null.mean(axis=1)
null_std = null.std(axis=1)
null_std[null_std == 0] = np.nan
# Per-drug standardized connectivity: how many SDs the real reversal is below what random
# queries of the same size produce against THIS drug. Continuous (no percentile floor), so it
# discriminates within the strong-reversal tail. Negative = specific reversal.
spec_z = (real - null_mean) / null_std
frac = (null <= real[:, None]).mean(axis=1)
tau = 100.0 * (2.0 * frac - 1.0) # retained for reference; saturates at +/-100 in the tail
df = pd.DataFrame(
{"connectivity_ks": real, "null_mean": null_mean, "spec_z": spec_z, "tau": tau},
index=signature_matrix.index,
)
df = df.sort_values("spec_z") # most negative z = most specific reversal
df.insert(0, "rank", range(1, len(df) + 1))
return df
def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path: def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:

122
tests/test_disease.py Normal file
View File

@@ -0,0 +1,122 @@
"""Tests for the Week 1 signature-construction logic on synthetic data.
These verify the DE / probe-collapse / concordance math without touching the network, so the
pipeline is trustworthy before it is pointed at real GEO studies.
"""
from __future__ import annotations
import numpy as np
import pandas as pd
import pytest
from src.disease import (
DISEASE_LABEL,
HEALTHY_LABEL,
ConcordanceSummary,
SignatureProvenance,
StudyProvenance,
build_signature,
collapse_probes_to_symbols,
compute_differential_expression,
concordance_filter,
)
from src.provenance import ConfidenceTier
def _synthetic_study(seed: int, n_per_group: int = 12) -> tuple[pd.DataFrame, pd.Series]:
"""Genes x samples matrix where UP is higher and DOWN is lower in disease."""
rng = np.random.default_rng(seed)
genes = ["UP1", "UP2", "DOWN1", "DOWN2", "NOISE1", "NOISE2"]
samples = [f"d{i}" for i in range(n_per_group)] + [f"h{i}" for i in range(n_per_group)]
groups = pd.Series([DISEASE_LABEL] * n_per_group + [HEALTHY_LABEL] * n_per_group, index=samples)
base = rng.normal(8.0, 0.3, size=(len(genes), len(samples)))
df = pd.DataFrame(base, index=genes, columns=samples)
disease = groups == DISEASE_LABEL
df.loc["UP1", disease] += 3.0
df.loc["UP2", disease] += 2.0
df.loc["DOWN1", disease] -= 3.0
df.loc["DOWN2", disease] -= 2.0
return df, groups
def test_welch_de_recovers_direction_and_significance():
expr, groups = _synthetic_study(seed=1)
de = compute_differential_expression(expr, groups, method="welch")
assert de.loc["UP1", "log_fc"] > 0 and de.loc["UP1", "qvalue"] < 0.05
assert de.loc["DOWN1", "log_fc"] < 0 and de.loc["DOWN1", "qvalue"] < 0.05
# Pure-noise genes should not be significant.
assert de.loc["NOISE1", "qvalue"] > 0.05
def test_compute_de_rejects_unlabelled_samples():
expr, groups = _synthetic_study(seed=2)
with pytest.raises(ValueError):
compute_differential_expression(expr, groups.iloc[:-3], method="welch")
def test_collapse_probes_keeps_highest_mean_expression():
de = pd.DataFrame(
{"log_fc": [1.0, 2.0, -1.0], "pvalue": [0.1, 0.2, 0.3], "qvalue": [0.1, 0.2, 0.3]},
index=["probeA1", "probeA2", "probeB1"],
)
probe_to_symbol = pd.Series({"probeA1": "GENEA", "probeA2": "GENEA", "probeB1": "GENEB"})
expr = pd.DataFrame(
{"s1": [1.0, 100.0, 5.0], "s2": [1.0, 100.0, 5.0]},
index=["probeA1", "probeA2", "probeB1"],
)
collapsed = collapse_probes_to_symbols(de, probe_to_symbol, expression_for_ranking=expr)
assert set(collapsed.index) == {"GENEA", "GENEB"}
# probeA2 has the higher mean expression, so its log_fc (2.0) should win.
assert collapsed.loc["GENEA", "log_fc"] == 2.0
def test_concordance_filter_keeps_only_agreeing_genes():
de_a = pd.DataFrame(
{"log_fc": [2.0, -2.0, 1.5, 0.1], "qvalue": [0.001, 0.001, 0.2, 0.001]},
index=["UP1", "DOWN1", "WEAK", "DISAGREE"],
)
de_b = pd.DataFrame(
{"log_fc": [1.8, -2.2, 1.4, -0.1], "qvalue": [0.002, 0.002, 0.2, 0.002]},
index=["UP1", "DOWN1", "WEAK", "DISAGREE"],
)
keep, summary = concordance_filter(de_a, de_b)
assert set(keep.index) == {"UP1", "DOWN1"} # WEAK fails q-cut; DISAGREE flips sign
assert keep.loc["UP1", "log_fc"] == pytest.approx(1.9) # mean of the two
assert keep.loc["UP1", "qvalue"] == pytest.approx(0.002) # max of the two
assert isinstance(summary, ConcordanceSummary)
assert summary.n_genes_tested == 4 and summary.n_concordant == 2
assert summary.n_up == 1 and summary.n_down == 1
def test_build_signature_splits_directions_and_respects_top_n():
concordant = pd.DataFrame(
{
"log_fc": [3.0, 2.0, 1.0, -1.0, -2.0],
"qvalue": [0.001, 0.002, 0.003, 0.004, 0.005],
},
index=["UP1", "UP2", "UP3", "DOWN1", "DOWN2"],
)
prov = SignatureProvenance(
studies=[
StudyProvenance(geo_accession="GSE1", n_disease=12, n_healthy=12,
platform="P", tissue="whole blood", method="welch"),
StudyProvenance(geo_accession="GSE2", n_disease=15, n_healthy=11,
platform="P", tissue="whole blood", method="welch"),
],
concordance=ConcordanceSummary(n_genes_tested=100, n_concordant=5, n_up=3, n_down=2),
created_date="2026-06-23",
)
sig = build_signature(
concordant, prov, tier=ConfidenceTier.A,
tier_rationale="Two-study concordance", limitations=["cell-composition confound"],
top_n=2,
)
assert [g.gene for g in sig.up_regulated] == ["UP1", "UP2"] # top 2 up by qvalue
assert [g.gene for g in sig.down_regulated] == ["DOWN1", "DOWN2"]
assert sig.confidence_tier == ConfidenceTier.A

View File

@@ -1,14 +1,13 @@
"""Tests for the matching engine and provenance logic. """Tests for the matching engine and provenance logic.
The headline test (PLAN.md §6, Week 3 task 4) verifies connectivity scoring against a known Connectivity tests (PLAN.md §6, Week 3 task 4) pin the weighted-KS scorer against hand-built
reference within tolerance; it is marked xfail until the scorer is implemented in Week 3. reference profiles. The tier-assignment tests pin the rules from PLAN.md §3 so the most
The tier-assignment tests run today — they pin the rules from PLAN.md §3 so the most
commercially important design decision can't silently drift. commercially important design decision can't silently drift.
""" """
from __future__ import annotations from __future__ import annotations
import pandas as pd
import pytest import pytest
from src.provenance import ConfidenceTier, assign_tier from src.provenance import ConfidenceTier, assign_tier
@@ -52,14 +51,103 @@ class TestAssignTier:
assert assign_tier(**kwargs) == ConfidenceTier.B assert assign_tier(**kwargs) == ConfidenceTier.B
@pytest.mark.xfail(reason="Connectivity scoring implemented in Week 3 (notebook 04).", strict=True) class TestConnectivityScore:
def test_connectivity_score_matches_reference(): """Reference checks for the weighted-KS connectivity score (PLAN §6 Week 3 task 4).
"""Verify connectivity scoring against a CMap/cmapPy reference within tolerance.
PLAN.md §6, Week 3 task 4. Replace this body with a known reference example Query: up = {U1, U2}, down = {D1, D2}. We build drug profiles with a known relationship to
(disease up/down sets + drug signature -> expected score) once the scorer exists. the query and assert the sign/ordering the CMap convention requires.
""" """
from src.scoring import connectivity_score
score = connectivity_score(up_genes=[], down_genes=[], drug_signature=None) # noqa UP = ["U1", "U2"]
assert score == pytest.approx(0.0, abs=1e-6) DOWN = ["D1", "D2"]
@staticmethod
def _profile(values: dict[str, float]) -> pd.Series:
# 20 filler genes at ~0 so the query genes sit clearly at the extremes.
base = {f"N{i}": 0.01 * ((i % 5) - 2) for i in range(20)}
base.update(values)
return pd.Series(base)
def test_perfect_reversal_is_strongly_negative(self):
from src.scoring import connectivity_score
# Drug pushes disease-up genes DOWN (very negative) and disease-down genes UP (very
# positive) => reversal => negative connectivity.
prof = self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7})
assert connectivity_score(self.UP, self.DOWN, prof) < -0.4
def test_perfect_mimic_is_strongly_positive(self):
from src.scoring import connectivity_score
prof = self._profile({"U1": 8, "U2": 7, "D1": -8, "D2": -7})
assert connectivity_score(self.UP, self.DOWN, prof) > 0.4
def test_reversal_beats_mimic_and_null(self):
from src.scoring import connectivity_score
rev = connectivity_score(self.UP, self.DOWN, self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7}))
mimic = connectivity_score(self.UP, self.DOWN, self._profile({"U1": 8, "U2": 7, "D1": -8, "D2": -7}))
null = connectivity_score(self.UP, self.DOWN, self._profile({"U1": 0.2, "U2": -0.1, "D1": 0.1, "D2": -0.2}))
assert rev < null < mimic
assert abs(null) < abs(rev)
def test_same_sign_enrichment_returns_zero(self):
from src.scoring import connectivity_score
# Both up- and down-sets at the top => same-sign ES => ambiguous => 0 (WTCS rule).
prof = self._profile({"U1": 8, "U2": 7, "D1": 6, "D2": 5})
assert connectivity_score(self.UP, self.DOWN, prof) == 0.0
def test_genes_absent_from_profile_are_ignored(self):
from src.scoring import connectivity_score
prof = self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7})
# Adding a query gene not in the profile must not change the score.
s1 = connectivity_score(self.UP, self.DOWN, prof)
s2 = connectivity_score(self.UP + ["NOT_IN_PROFILE"], self.DOWN, prof)
assert s1 == pytest.approx(s2)
class TestMechanisticPrior:
def test_counts_distinct_sickle_pathways(self):
from src.scoring import mechanistic_prior
# ribonucleotide reductase (hydroxyurea) -> hbf_epigenetic category.
assert mechanistic_prior(["Ribonucleoside-diphosphate reductase RR1"]) == 1.0
# DNMT (epigenetic) + hemoglobin -> two categories.
assert mechanistic_prior(["DNA (cytosine-5)-methyltransferase 1", "Hemoglobin subunit beta"]) == 2.0
assert mechanistic_prior([]) == 0.0
assert mechanistic_prior(["Some unrelated kinase"]) == 0.0
class TestTauCalibration:
"""tau should reward a SPECIFIC reverser and give a near-zero score to a noise drug."""
@staticmethod
def _matrix() -> pd.DataFrame:
genes = [f"U{i}" for i in range(5)] + [f"D{i}" for i in range(5)] + [f"G{i}" for i in range(40)]
rng_vals = {g: 0.01 * ((hash(g) % 7) - 3) for g in genes} # tiny deterministic noise
# specific reverser: query-up genes at the bottom, query-down at the top, rest ~0
specific = dict(rng_vals)
for i in range(5):
specific[f"U{i}"] = -8 - i
specific[f"D{i}"] = 8 + i
noise = dict(rng_vals)
return pd.DataFrame([specific, noise], index=["specific", "noise"])[genes]
def test_specific_reverser_has_strongly_negative_tau(self):
from src.scoring import tau_calibrate
up = [f"U{i}" for i in range(5)]
down = [f"D{i}" for i in range(5)]
out = tau_calibrate(up, down, self._matrix(), n_null=300, seed=0)
# Ranked by spec_z (continuous); the specific reverser is the most negative.
assert out.loc["specific", "spec_z"] < -2
assert out.loc["specific", "spec_z"] < out.loc["noise", "spec_z"]
assert out.loc["specific", "tau"] < -50 # tau also flags it (may saturate near -100)
assert out.loc["specific", "rank"] == 1
def test_rank_drugs_orders_by_reversal():
from src.scoring import rank_drugs
genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]
base = {g: 0.0 for g in genes}
reverser = {**base, "U1": -8, "U2": -7, "D1": 8, "D2": 7}
mimic = {**base, "U1": 8, "U2": 7, "D1": -8, "D2": -7}
matrix = pd.DataFrame([reverser, mimic], index=["reverser", "mimic"])
ranked = rank_drugs(["U1", "U2"], ["D1", "D2"], matrix)
assert ranked.loc["reverser", "rank"] == 1
assert ranked.loc["reverser", "connectivity_score"] < ranked.loc["mimic", "connectivity_score"]