Compare commits

...

7 Commits

Author SHA1 Message Date
817bcda7dc Structure-binding track: scaffold + ligand-retrieval baseline
Start the structure-based binding branch (PLAN §12), baseline-first.

- src/binding.py: validated RDKit ligand retrieval (morgan_fp, tanimoto,
  retrieve_nearest = the §12.9 engine) + dock() stub documenting the
  blocked ARM-Mac toolchain
- scripts/binding_ligand_baseline.py: 300 drugs vs known binders
- docs/structure_binding_notes.md: status, toolchain blocker, next steps
- pyproject: [structure] extra (rdkit); data/raw/structures/ for PDBs

Step-0 finding: retrieval engine VALIDATED on in-set classes
(decitabine->azacitidine 0.62; vorinostat->scriptaid/belinostat) but the
distinctive binders voxelotor/mitapivat have no analog in our 300-drug
set (Tanimoto ~0.2). Needs (a) bigger library, (b) real docking (§12.3),
which is blocked on the ARM-Mac docking toolchain (§12.6 pitfall 4).
Structures 5E83 (Hb+voxelotor) and 8XFD (PKR+mitapivat) fetched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:53:27 +02:00
6c2c71d73d PLAN §12.9: leave door open for generative-guided retrieval
Reframe de novo generation into the repurposing frame per the founder's
idea: use a pocket-conditioned generative model (TargetDiff/DiffSBDD/
Pocket2Mol) to propose an idealised binder as a SEARCH BEACON, then
retrieve the nearest EXISTING drugs by chemical similarity (Tanimoto/
embedding) as repurposing candidates — the generated molecule is never
synthesised.

Caveats kept honest: generated molecules used only as beacons (often
synthetically invalid); similarity != activity, so retrieved neighbours
still must be docked + pass the binding recovery test; open question
whether it beats brute-force docking the existing library. Explore only
after the §12.3-12.4 docking baseline is validated. §12.7 exclusion
reworded to point here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:43:25 +02:00
7449dbeefb Scope Phase 2 structure-based binding track into PLAN (§12)
Add a scoped (not committed) follow-on track pivoting modality from
expression-connectivity to structure-based drug-target binding, motivated
by the empirical finding that the expression modality is signal-dead for
this task (relational-only supervised AUC = 0.49, chance).

§12 covers: the evidence for the pivot, a sickle-specific druggable target
shortlist with known-binder positive controls (Hb/voxelotor, PKR/mitapivat,
DNMT1/decitabine, LSD1, HDAC, EHMT2, PDE9), method (classical docking
baseline -> AF3-class co-folding: Boltz-2/Chai-1/DiffDock), a pre-registered
binding recovery test, integration with the expression layer as the real
prize, honest pitfalls (binding != efficacy, BCL11A untractable, GPU breaks
the all-local assumption), and open decisions before committing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:40:18 +02:00
649f617019 Phase D: supervised cross-disease (0.925 AUC degree-bias mirage)
Train GradientBoosting on 300 drugs x 839 GEO disease signatures with
Repurposing-Hub indications as labels (432 positives), disease-grouped CV.

Finding: 0.925 CV AUC looks like a win but is a MIRAGE. Feature
importances are all drug-level (drug_std 0.33, drug_mean 0.30,
broadness 0.17); drug-disease connectivity importance = 0.01. The model
learned a drug-POPULARITY prior, not disease-specific matching. On
held-out sickle it ranks hydroxyurea 231/300 (worse than baseline) and
tops out with promiscuous drugs (dexamethasone, methotrexate). Classic
degree-bias trap. Connectivity also has ~chance AUC (0.51) for predicting
approved indications.

Both obvious approaches now fail instructively: unsupervised = specificity
ceiling; naive supervised = degree bias. Real progress needs degree-
debiased training + much larger clean labels (a research effort).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:31:32 +02:00
0ce688449d Phase A: reference-library tau (negative result on specificity)
Calibrate sickle connectivity against a real disease-signature
reference population (Enrichr Disease_Signatures_from_GEO, 141 diseases)
instead of random gene-set nulls — proper CMap tau.

Finding: hydroxyurea still recovers (rank 25, top 8%), but negative-
control specificity is UNCHANGED (2/5; norethindrone + ciprofloxacin
still top). The reference-calibrated ranking is nearly identical to the
random-null ranking. Third independent fix (after gene-space expansion
and composition adjustment) that recovers hydroxyurea but does NOT fix
specificity.

Conclusion: unsupervised connectivity has a specificity ceiling — it
cannot distinguish therapeutic reversal from coincidental transcriptional
anti-correlation. Breaking it needs external signal (supervised labels
or mechanistic filtering), not more calibration. Disease-signature
library cached at data/raw/disease_sigs/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:19:26 +02:00
b62048614d Experiment: composition-adjusted signature (negative result)
Test whether fixing the cell-composition confound rescues the recovery
test. GSE35007 measured WBC/RBC/MCV per sample, so adjust DE directly
for them (per-gene OLS: expression ~ disease + WBC + RBC + MCV + age +
sex) — a measured-covariate deconvolution, compared vs unadjusted.

Finding (negative, informative): adjustment HURT hydroxyurea (rank
20 -> 35, fails top-10%) — it was recovering partly via composition
genes — and did NOT fix negative-control specificity (still 2/5). Only
gain: top hits become more mechanistically coherent (resveratrol,
aspirin enter). Conclusion: cleaning the disease signature does not
rescue connectivity; the binding constraint is matching-side (needs a
reference-signature library for proper specificity calibration), which
is a multi-disease investment. v1.1 signature NOT replaced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:05:08 +02:00
3417f85eb1 v1.1: full gene space + specificity z-score; hydroxyurea recovers
Post-hoc improvement after the pre-registered v1 recovery test failed.
Two changes, diagnosing v1's failure:
- score on the full 12,328-gene LINCS space (week2_lincs_extract.py),
  lifting signature overlap from 12% to 85% (brings erythroid markers in)
- src/scoring.py: KS connectivity + per-drug specificity z-score
  (spec_z = SDs below a 1,000 random-query null). Primary ranking is
  now spec_z. (Textbook tau saturated at +/-100 for a coherent query —
  documented; needs a reference-signature library, a v2 item.)
- week3_scoring.py: spec_z primary + WTCS reference + prior-blended
- tests: tau/spec_z calibration test; 19 passing
- scripts/exp_genespace.py: the BING vs all-12,328 comparison

Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%),
confirming the v1 failure was the landmark bottleneck not the algorithm.
Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite),
and negative controls (norethindrone, ciprofloxacin) rank top-3 —
connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory,
not a confirmatory test; reported as such in recovery_test_report.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:57:30 +02:00
18 changed files with 1105 additions and 150 deletions

140
PLAN.md
View File

@@ -426,3 +426,143 @@ This MVP exists in a broader strategic context that was built through ~7 expert
- **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product. - **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product.
The MVP's job is to produce one credible result. Everything else follows from that. The MVP's job is to produce one credible result. Everything else follows from that.
---
## 12. Phase 2 track — Structure-based binding (scoped 2026-06-23)
> **Status: scoped, not committed.** This is a follow-on track proposed *after* the MVP and its
> follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what
> those experiments empirically showed. Read §911 and the experiment commits first.
### 12.1 Why pivot modality (the evidence, not a hunch)
The expression-connectivity approach was built, validated, and pushed hard (gene-space
expansion, cell-composition deconvolution, reference-library tau, supervised learning). The
empirical verdict:
- Connectivity **recovers hydroxyurea** (top ~68%) but **cannot achieve specificity**
unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four
independent methods.
- A supervised model on indication labels hit **0.925 CV AUC** but it was a **degree-bias
mirage**: it learned drug popularity, not disease matching (it ranked hydroxyurea *231/300*).
- The decisive test: with drug-popularity features removed, the model trained on the actual
drugdisease connectivity scored **AUC 0.491 — pure chance**. **The expression-connectivity
modality contains essentially no disease-specific therapeutic signal for this task.**
This is a *signal* problem, not a *model* problem no amount of model sophistication (diffusion,
GNNs, etc.) extracts signal that isn't in the data. The response is to **change modality** to one
with a strong, physical, drug-specific signal: **does a molecule bind a sickle-relevant target?**
A drug that binds HbS is mechanistically specific by construction the opposite of a coincidental
expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is
itself a diffusion model) actually has traction, because structure has physical ground truth.
### 12.2 Targets (sickle-specific, druggable, structurally characterised)
Small molecules only 2). Curated shortlist with public structures and, crucially, **known
small-molecule binders to serve as positive controls**:
| Target | Mechanism in sickle | Known binder (positive control) |
|---|---|---|
| Hemoglobin (HBB/HBA tetramer, HbS) | Anti-polymerisation; R-state stabiliser | **voxelotor** (binds α-Val1) |
| PKR (PKLR, red-cell pyruvate kinase) | Activator 2,3-BPG O2 affinity | **mitapivat**, etavopivat |
| DNMT1 | HbF induction (de-repress γ-globin) | **decitabine**, azacitidine |
| LSD1 / KDM1A | HbF induction | tranylcypromine analogues |
| HDAC1/2 | HbF induction | vorinostat, panobinostat |
| EHMT2 (G9a) | HbF induction | UNC0642 (tool) |
| PDE9 | cGMP, anti-adhesion | PF-04447943 (sickle trial) |
Hard/excluded for v1: **BCL11A** (transcription factor, no classic pocket the γ-globin master
repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism.
### 12.3 Method (baseline → generative co-folding)
1. **Prepare structures.** Pull target structures from the PDB; AF3/Boltz-predict any missing.
2. **Prepare ligands.** Reuse the existing ~300-drug set (we already have canonical SMILES from
ChEMBL); expandable to the full ChEMBL/LINCS catalogue.
3. **Dock + score**, in increasing sophistication:
- **Baseline:** classical docking (AutoDock Vina / smina) fast, CPU, well-understood.
- **Generative co-folding:** an open AlphaFold3-class model **Boltz-2** (predicts the
proteinligand complex *and* a binding-affinity estimate, MIT-licensed), **Chai-1**, or
**DiffDock** (a diffusion model for docking the legitimate home for the "diffusion"
instinct). These predict the bound pose directly and tend to beat classical docking.
- Report both; the baseline keeps us honest about whether the ML model adds anything.
### 12.4 Validation (a real recovery test, like §6 Week 4)
Pre-register before scoring: **the known structure-based sickle drugs must rank as top binders to
their targets** voxelotorhemoglobin, mitapivatPKR, decitabineDNMT1. Negative controls
(unrelated drugs) must *not* bind these pockets. This is a cleaner recovery test than the
expression one, because binding is mechanistically specific it should not have the
coincidental-reverser problem that sank the connectivity approach.
### 12.5 The real prize — integrate, don't replace
The long-term value is **both modalities together**: a candidate that *reverses the disease
signature* (expression) **and** *binds a sickle-relevant target* (structure) is far more credible
than either alone. Structure supplies the specificity the expression layer lacks; expression
supplies the systems-level, target-agnostic view structure lacks. The platform thesis 11)
two databases + a matching engine extends naturally to a third (structures) feeding the same
confidence-tiered data layer.
### 12.6 Honest pitfalls (do not ignore)
1. **Binding ≠ efficacy.** A molecule can bind and do nothing therapeutic. Structure-based hits
are still hypotheses (cf. §9.7).
2. **Only covers the enzyme/pocket subset.** Sickle's biggest lever (γ-globin reactivation via
BCL11A) is largely transcriptional and not small-molecule-tractable structure-based screening
is blind to it. Be explicit about coverage.
3. **Docking/affinity accuracy is limited.** Pose prediction is decent; absolute affinity is hard.
Validate on known binders before trusting novel scores.
4. **Compute.** AF3-class models are GPU-heavy; the local Mac Studio 2) may not suffice this
track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule.
5. **Moat.** Structures and tools are public; the proprietary value is the curated target list,
the integration with the expression layer, and provenance/tiering not the docker.
### 12.7 Explicitly NOT in this track
Free energy perturbation / MD-based affinity; covalent docking; **de novo generation of molecules
as final candidates to synthesise** (design, not repurposing but see §12.9 for the
generate-then-retrieve hybrid, which *is* repurposing); BCL11A or any non-pocket target;
biologics; combination binding.
### 12.8 Open decisions before committing
- **Tooling:** classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend:
baseline first, for an honest reference the lesson of the whole expression arc.)
- **Compute:** secure a GPU environment (the "all local" §2 assumption breaks here).
- **Scope of v1:** the 7-target shortlist above, or start with just Hb + PKR (the two with the
cleanest positive controls) to de-risk the harness before scaling targets.
### 12.9 Door left open — generative-guided retrieval (generate → match existing)
A legitimate way to bring generative models *into the repurposing frame* (vs the design frame
excluded in §12.7): don't generate molecules to synthesise generate them as a **search beacon**.
**The idea.** Use a pocket-conditioned generative model (target-conditioned diffusion e.g.
TargetDiff, DiffSBDD, Pocket2Mol) to propose idealised binders for a sickle target. Then retrieve
the **nearest existing drugs** to those proposals by chemical similarity (Tanimoto over Morgan
fingerprints, or a learned molecular embedding). The retrieved neighbours real, already-approved
or clinical molecules are the repurposing candidates. The generated molecule is never made; it
only *defines a region of chemical space* worth searching. This is the user-proposed framing and
it is sound: "generate the ideal, then find what we already have nearby."
**Why it could add value.** It can point at scaffolds / regions a known-binder-seeded or
brute-force docking sweep would miss, and it makes the target's binding requirements explicit as
geometry rather than as a single reference ligand.
**Honest caveats (why it's a *door*, not a commitment).**
1. **Generated molecules are often synthetically unrealistic / invalid** which is exactly why
they must be used only as beacons, never as candidates.
2. **Similarity ≠ activity.** Activity cliffs mean a near-neighbour of a good binder can be inert.
So retrieved neighbours do **not** bypass validation they must still be docked/scored 12.3)
and clear the binding recovery test 12.4). The generative step *proposes*; it does not *prove*.
3. **Marginal-value question.** Directly docking the whole existing library 12.3) already covers
chemical space. Whether generate-then-retrieve beats that by efficiency or by surfacing
non-obvious scaffolds is an open empirical question and needs a head-to-head before it earns
real investment.
4. **Only as good as the pocket conditioning** of the generator, and the chemistry of the target.
**Status:** explore only *after* the §12.312.4 docking harness works and is validated on the
known binders same discipline as everywhere else: prove the baseline, then test whether the
fancier method adds anything.

View File

View File

@@ -61,9 +61,10 @@ Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action. 38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
- **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's - **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
not-single-source rule). not-single-source rule).
- **Signature↔landmark overlap:** only 56/477 (12%) of the disease signature genes are LINCS - **Gene space (v1.1):** scoring uses the full **12,328-gene** LINCS space, not just the 978
landmarks, so connectivity scoring (Week 3) uses a 30-up/26-down query. The erythroid hallmark landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger
genes (CA1, AHSP, SLC4A1, HBG) are NOT landmarks. This is a key limitation for the recovery test. space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from
LINCS entirely and remain unscoreable.
- Reproduce: `week2_curate_drugset.py``week2_chembl.py` → download Level-5 GCTX → - Reproduce: `week2_curate_drugset.py``week2_chembl.py` → download Level-5 GCTX →
`week2_lincs_extract.py``week2_assemble.py`. `week2_lincs_extract.py``week2_assemble.py`.

View File

@@ -34,20 +34,28 @@ Source: PLAN.md §9.
7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses 7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
to test, not discoveries. Use careful language in any write-up. to test, not discoveries. Use careful language in any write-up.
8. **Only 12% of the signature is LINCS-scorable (56/477 genes).** The 978 landmark genes (from 8. **Gene-space bottleneck (v1 → fixed in v1.1).** v1 scored on only the 978 landmark genes (12%
cancer cell lines) miss the erythroid hallmark genes (CA1, AHSP, SLC4A1, HBG). Connectivity signature overlap) — the main driver of the v1 failure. v1.1 uses the full 12,328-gene space
scoring runs on a thin inflammation/metabolic slice — the single biggest driver of the (85% overlap) and recovers hydroxyurea. HBG1/HBG2 remain absent from LINCS entirely.
recovery-test failure. v2 fix: signature prediction or a mechanism graph to score the other 88%.
## Recovery test outcome (Week 4) 9. **No reference-signature library for tau.** Textbook CMap tau saturated at ±100 (a coherent
query always out-connects random gene sets). v1.1 substitutes a per-drug specificity z-score.
Proper tau needs a library of real reference signatures — a v2 / curated-data item.
The MVP **failed** all three pre-registered criteria on the primary raw ranking (hydroxyurea 10. **Negative-control criterion may be invalid for connectivity scoring.** Unrelated drugs
rank 40/top 13%; L-glutamine rank 100/WTCS=0; 1/5 negative controls in bottom half). The failure (norethindrone, ciprofloxacin) rank as top specific reversers — connectivity measures
is fully attributable to signature/assay data limitations above, not the matching algorithm. See expression reversal, not therapeutic relatedness.
## Recovery test outcome
Pre-registered test (**v1, confirmatory**): **FAILED** all three criteria (hydroxyurea rank
40/top 13%; L-glutamine rank 100; 1/5 negative controls bottom-half). Post-hoc (**v1.1,
exploratory**): hydroxyurea recovers to rank 18 (top 6%, passes), but L-glutamine (rank 213, does
not reverse) and negative controls (2/5) still fail → overall still FAIL. See
`recovery_test_report.md`. `recovery_test_report.md`.
| Drug | Issue | Handling | | Drug | Issue | v1.1 status |
|---|---|---| |---|---|---|
| hydroxyurea | HbF mechanism not in scorable gene space | scored (rank 40); recovered only by prior-weighted ranking | | hydroxyurea | needed the full gene space | rank 18 (top 6%) — recovered post-hoc |
| L-glutamine | signature present but WTCS ambiguous (=0) | scored (rank 100); no reversal signal | | L-glutamine | metabolite, no reversal signal (positive connectivity) | rank 213 — genuine negative |
| all 300 | had LINCS signatures | 0 marked "not scored" — coverage was not the issue; specificity was | | neg controls | reverse the generic inflammation signature | 2/5 bottom-half — criterion questionable |

View File

@@ -1,7 +1,7 @@
# Sickle Cell Repurposing — Recovery Test Report # Sickle Cell Repurposing — Recovery Test Report
> **Status: COMPLETE.** Reproduce with `scripts/week1_*` → `week2_*` → `week3_scoring.py` → > **Status: COMPLETE (v1 confirmatory + v1.1 exploratory).** Reproduce with `scripts/week1_*` →
> `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist. > `week2_*` → `week3_scoring.py` → `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.
## Pre-registered success criteria ## Pre-registered success criteria
@@ -12,118 +12,116 @@ The MVP passes if:
missing LINCS signature, **AND** missing LINCS signature, **AND**
- At least **4 of 5** negative-control drugs rank in the **bottom half**. - At least **4 of 5** negative-control drugs rank in the **bottom half**.
_Pre-registered in the scaffold commit (`b731478`) before any scoring was run. Primary ranking _Pre-registered in the scaffold commit (`b731478`) before any scoring. **Primary (confirmatory)
= raw connectivity. The 5 negative controls were pre-specified by category rule (one per analysis = v1**: 978 landmark genes, weighted connectivity score (WTCS). The 5 negative controls
category, alphabetically first available) without inspecting ranks._ were pre-specified by category rule without inspecting ranks._
--- ---
## Section 1 — Methodology ## Section 1 — Methodology
We built a sickle cell disease signature from **two independent whole-blood microarray studies** A sickle cell disease signature was built from **two whole-blood microarray studies** (GSE35007
(GSE35007, Illumina, SS vs AA; GSE16728, Affymetrix, patient vs control), keeping the **671 Illumina SS-vs-AA; GSE16728 Affymetrix patient-vs-control), keeping the **671 genes concordant**
genes concordant** (q<0.05, same direction) across both a cross-platform, cross-population across both (q<0.05, same direction) a cross-platform Tier-A signature (250 up / 227 down).
Tier-A signature (250 up / 227 down). We built profiles for **300 small molecules** (2 Profiles were built for **300 small molecules** (2 ground-truth; 32 related-mechanism; 26
ground-truth: hydroxyurea, L-glutamine; 32 related-mechanism; 26 negative controls; 240 random), negative controls; 240 random), each with a **LINCS L1000** consensus signature (mean Level-5
each with a consensus **LINCS L1000** signature (mean of Level-5 MODZ z-scores across cell MODZ across cell lines, both CMap phases). Drugs were ranked by **CMap connectivity scoring**
lines, 978 landmark genes, both CMap phases). We ranked drugs by **CMap connectivity scoring** (Kolmogorov-Smirnov, Lamb 2006 / Subramanian 2017): negative = reversal = candidate.
(weighted-KS, Lamb 2006 / Subramanian 2017): strongly negative = strong reversal of the disease
signature = candidate. A secondary ranking blends connectivity with a mechanistic prior over
sickle-relevant target pathways.
## Section 2 — Recovery test result — **FAIL** (primary ranking) **v1 (pre-registered/confirmatory):** scored on the 978 landmark genes with WTCS.
**v1.1 (post-hoc/exploratory):** after v1 failed, two changes were made to diagnose why (a)
score on the full **12,328-gene** space (landmark overlap 12% 85%, bringing the erythroid
markers in); (b) add a **per-drug specificity z-score** (`spec_z`): how many SDs the real
connectivity is below a null of 1,000 random queries of the same size against that drug. Because
these changes followed inspection of the v1 result, **v1.1 is exploratory, not a confirmatory
test of the pre-registered hypothesis.**
| Drug | Rank | Percentile | Pass? | ## Section 2 — Recovery test result
|---|---|---|---|
| Hydroxyurea | 40 / 300 | top 13.3% | (needs top 30) |
| L-glutamine | 100 / 300 | top 33.3% | (WTCS=0, ambiguous; has a signature so not "missing") |
Negative controls (pre-specified; expected: bottom half): | Criterion | v1 (confirmatory) | v1.1 (exploratory) |
|---|---|---|
| Hydroxyurea top-10% (≤30) | rank **40** (13.3%) | rank **18** (6.0%) |
| L-glutamine top-25% (≤75) | rank 100, WTCS=0 | rank 213, spec_z=+0.98 |
| 4/5 neg controls bottom-half | 1/5 | 2/5 |
| **Overall** | **FAIL** | **FAIL** (but hydroxyurea recovered) |
| Control | Category | Rank | Bottom half? | v1.1 negative controls: clotrimazole 258 ✅, astemizole 211 ✅, azithromycin 142 ❌,
|---|---|---|---| ethinyl-estradiol 114 ❌, caffeine 77 ❌.
| clotrimazole | antifungal | 89 | |
| astemizole | antihistamine | 291 | |
| azithromycin | antibiotic | 82 | |
| ethinyl-estradiol | hormone | 98 | |
| caffeine | misc | 84 | |
**Only 1/5 negative controls in the bottom half (need ≥4).** **Honest reading.** The **pre-registered test FAILED (v1).** The post-hoc v1.1 changes
**recover hydroxyurea** (rank 40 18, passing top-10%) strong evidence that the v1 failure was
driven by the 978-landmark bottleneck, not the algorithm. But two failures survive into v1.1, and
both are now precisely diagnosed:
**Overall: FAIL on all three pre-registered criteria.** This is reported as-is, without 1. **L-glutamine does not reverse the signature** (positive connectivity, spec_z=+0.98). This is
adjustment. For context only (not the pre-registered criterion): the secondary intrinsic to its LINCS data a metabolite with no reversal signal not a coverage gap. More
mechanistic-prior ranking places hydroxyurea at **rank 7 (top 2.3%)** but that ranking uses genes cannot fix it.
prior knowledge of the drug's target, so it cannot be claimed as a blind recovery. 2. **The negative-control criterion is arguably invalid for connectivity scoring.** Two
"negative controls" (norethindrone, ciprofloxacin) rank in the top 3 by spec_z. Connectivity
measures *expression reversal*, not *therapeutic relatedness* an antibiotic or contraceptive
can still down-regulate the inflammation genes that dominate the scorable signature. The test
design conflates the two.
**Why it failed — the honest diagnosis.** The disease signature is dominated by erythroid / A note on the calibration: textbook CMap **tau** (percentile vs a reference population) was
reticulocyte biology (CA1, AHSP, SLC4A1) and the HbF axis that hydroxyurea actually acts on implemented but **saturated at ±100** here, because a coherent real query always out-connects
(HBG1/HBG2) was lost (flat in GSE35007; removed by GSE16728's globin-depleted prep). Worse, random gene sets proper tau needs a library of *real* reference signatures, which this MVP
only **56 of 477 signature genes (12%) are LINCS landmark genes** and none of the erythroid lacks. The continuous `spec_z` is the workable substitute.
hallmark genes are. So connectivity scoring ran on a thin, inflammation-heavy 30-up/26-down
query. The engine is effectively scoring reversal of sickle's *inflammation* axis, not its
*erythroid* axis which is why hydroxyurea (an HbF inducer / antiproliferative) is not
recovered, and why unrelated drugs get spurious mild-reversal scores (poor specificity).
## Section 3 — Top 10 candidates (raw connectivity) ## Section 3 — Top 10 candidates (v1.1 spec_z)
| Rank | Drug | Score | Known target / mechanism | Plausibility | | Rank | Drug | spec_z | Inclusion | Read |
|---|---|---|---|---| |---|---|---|---|---|
| 1 | laropiprant | 0.417 | Prostaglandin D2 receptor antagonist | Anti-inflammatory coherent with inflammation-axis reversal | | 1 | reserpic-acid | 3.80 | random | reserpine metabolite; non-obvious |
| 2 | BRD-K62768824 | 0.396 | (tool compound, no annotation) | Likely broad-effect false positive | | 2 | norethindrone | 3.78 | **negative control** | false positive (see §2) |
| 3 | BRD-K71353154 | 0.393 | (tool compound) | Likely false positive | | 3 | ciprofloxacin | 3.61 | **negative control** | false positive |
| 4 | lisinopril | 0.358 | ACE inhibitor | **Non-obvious; see §4** | | 4 | resveratrol | 3.46 | related-mechanism | antioxidant studied in SCD coherent |
| 5 | BRD-K53443165 | 0.358 | (tool compound) | Likely false positive | | 5 | BRD-K57490754 | 3.37 | random | tool compound |
| 6 | talnetant | 0.347 | Neurokinin-3 (NK3) receptor antagonist | No obvious sickle rationale | | 6 | anastrozole | 3.27 | random | aromatase inhibitor |
| 7 | BRD-K46936109 | 0.342 | (tool compound) | Likely false positive | | 710 | BRD-* / palmitoylethanolamide | ~3.1 | random | mostly tool compounds |
| 8 | lawsone | 0.340 | Naphthoquinone (henna pigment) | No obvious rationale; possible redox effect |
| 9 | BRD-K85763971 | 0.338 | (tool compound) | Likely false positive |
| 10 | BRD-K36516410 | 0.323 | (tool compound) | Likely false positive |
As anticipated (PLAN §9.4), the raw top-10 is dominated by unannotated broad-effect tool That two negative controls outrank hydroxyurea is the single most informative result here see §4.
compounds these are **not** credible candidates and are not over-interpreted.
## Section 4 — One non-obvious candidate worth investigating ## Section 4 — One non-obvious result worth investigating
**Lisinopril (ACE inhibitor), rank 4.** This is the most interesting non-obvious hit: ACE The most useful finding is **not** a candidate drug but the **negative-control failure**:
inhibitors are already used clinically in sickle cell disease for **renal protection** unrelated drugs (norethindrone, ciprofloxacin) score as strong specific reversers. This is a
(reducing albuminuria / progression of sickle nephropathy), via mechanisms independent of the real, generalizable lesson for a signature whose *scorable* portion is generic
HbF pathway. Surfacing an agent with a genuine, mechanistically distinct sickle-cell rationale inflammation/metabolic genes, connectivity rewards any broad transcriptional perturbation that
from an inflammation/vascular-flavoured signature is a small but real signal that the matching touches those genes. The honest implication: **this signature is not specific enough to
approach can point at non-obvious biology. **This is a computational hypothesis, not a discriminate true repurposing candidates from incidental expression reversers.** Of the
discovery**, and the connectivity rationale here (inflammation-axis reversal) is not the same as plausibly-real hits, **resveratrol (rank 4)** an antioxidant with prior sickle cell literature
lisinopril's known renal mechanism, so the match should be treated as suggestive only. is the most defensible, but it is a hypothesis, not a discovery.
## Section 5 — Honest limitations ## Section 5 — Honest limitations
1. **Cell-composition confound** the whole-blood signature is dominated by reticulocyte/ 1. **Pre-registered test failed; the pass is post-hoc.** v1.1's hydroxyurea recovery is
erythroid markers (composition, not pure disease-state regulation). v2 needs deconvolution. exploratory and must be re-validated on a held-out disease before any claim is made.
2. **Missing HbF axis** HBG1/HBG2 absent (globin depletion + flat in GSE35007), so the 2. **Missing HbF axis** HBG1/HBG2 are absent from LINCS entirely (not just landmarks), so the
signature cannot encode the pathway hydroxyurea acts on. pathway hydroxyurea acts on can never be scored by this method.
3. **12% signature↔landmark overlap** only 56/477 genes are LINCS landmarks; the erythroid 3. **Signature specificity** scorable genes are inflammation/metabolic; negative controls
hallmark genes are not scorable. The query collapses to a generic inflammation/metabolic slice. reverse them too. Connectivity therapeutic relatedness.
4. **LINCS cell-line bias** landmark signatures come from cancer cell lines (PLAN §9.2); poorly 4. **Cell-composition confound** the whole-blood signature is reticulocyte-dominated.
suited to a blood disease. 5. **LINCS cancer-cell-line bias**, and **no reference-signature library** for proper tau.
5. **Poor negative-control specificity** unrelated drugs received mild reversal scores; the 6. **No mechanistic validation** all hits are computational hypotheses.
thin query yields a noisy connectivity distribution.
6. **No mechanistic validation** these are connectivity hypotheses, not validated predictions.
## Section 6 — What v2 would fix ## Section 6 — What v2 would fix
- **Cell-type deconvolution** of the disease signature to separate disease-state regulation from - **A reference-signature library** to make tau (proper specificity calibration) work the
composition, recovering specificity. single biggest fix to the negative-control problem, and a direct use of the curated-data moat.
- **A non-globin-depleted, RNA-seq whole-blood study** to retain the HbF axis. - **Cell-type deconvolution** + a non-globin-depleted RNA-seq study to recover a more specific,
- **Signature prediction** (DeepCE-style) or a mechanism/knowledge graph to score the ~88% of HbF-containing signature.
the signature that has no LINCS landmark the single biggest lever on this result. - **Signature prediction / mechanism graph** to score genes with no LINCS measurement.
- **A second disease** to test generalization (sickle results alone do not prove the platform - **A second disease** to test generalization and to honestly re-validate the v1.1 method
PLAN §9.5). (PLAN §9.5).
--- ---
### Bottom line ### Bottom line
The pipeline is reproducible end-to-end and the method is sound, but on this signature it **does The pre-registered recovery test **failed**. Post-hoc diagnosis shows the dominant cause was a
not recover the known sickle cell drugs**. The failure is fully explained by signature/assay fixable gene-space bottleneck correcting it **recovers hydroxyurea** but also surfaces a
data limitations (erythroid biology lost; 12% landmark overlap), not by a flaw in the matching deeper, genuine limitation: this whole-blood signature is **not specific enough** for
algorithm. The most valuable output of this MVP is therefore a precise, honest map of *what data connectivity scoring to separate real candidates from incidental reversers (negative controls
quality the method needs to work* which is exactly the de-risking the proof-of-concept was rank at the top). The MVP's real deliverable is a precise, honest map of *what it takes to make
meant to deliver. this method work*: a more specific (deconvolved, HbF-containing) signature and a reference library
for calibration exactly the curated-data investments the platform thesis is built on.

View File

@@ -0,0 +1,34 @@
# Structure-based binding track — working notes
Branch `structure-based-binding`. Implements PLAN §12. Baseline-first, start with the two cleanest
targets (Hemoglobin + PKR), de-risk the harness before scaling.
## Status (2026-06-23)
**Toolchain check (PLAN §12.6 pitfall 4, confirmed real):**
- ✅ RDKit installs on ARM Mac — ligand side ready.
- ❌ AutoDock Vina does NOT pip-install on ARM Mac; no docking binary available. Docking (§12.3)
is **blocked on toolchain** — must resolve via conda/micromamba (`vina`/`smina`), a GPU AF3-class
model (Boltz-2/Chai-1/DiffDock), or an x86 Vina binary under Rosetta.
**Structures obtained:** `5E83` (hemoglobin + voxelotor), `8XFD` (PKR + mitapivat) in
`data/raw/structures/`.
**Step 0 — ligand-based retrieval baseline (`scripts/binding_ligand_baseline.py`):**
RDKit Tanimoto of our 300 drugs vs known sickle binders.
- Engine VALIDATED on in-set classes: `decitabine`→azacitidine (0.62); `vorinostat`→scriptaid
(0.42), belinostat (0.28). Correctly clusters DNMT1 / HDAC HbF-inducers.
- But voxelotor / mitapivat have **no analog** in our set (max Tanimoto ~0.200.26). A 300-drug
library is too sparse to contain look-alikes of distinctive scaffolds.
**Takeaways:**
1. Ligand retrieval works but needs a **bigger drug library** to be useful for distinctive targets.
2. The targets without in-set analogs (Hb, PKR) need **actual docking** (§12.3) — which scores
binding directly, no look-alike required. That is the gating next step, and it needs the
toolchain solved.
## Next steps
- [ ] Resolve the docking toolchain (recommend: micromamba + smina/vina, CPU, no GPU needed for baseline).
- [ ] Dock the known binders (voxelotor→5E83, mitapivat→8XFD) as positive controls (§12.4 recovery test).
- [ ] Expand the ligand library (full ChEMBL/LINCS) for retrieval to have reach.
- [ ] Only then: AF3-class co-folding (Boltz-2/DiffDock) vs the docking baseline; and §12.9 generative beacon.

View File

@@ -33,6 +33,12 @@ dev = [
"pytest>=8.0", "pytest>=8.0",
"ruff>=0.5", "ruff>=0.5",
] ]
# Structure-based binding track (PLAN §12). Docking tool (vina/smina) is NOT pip-installable on
# ARM Mac — install via conda/micromamba or use a GPU AF3-class model; see docs/structure_binding_notes.md.
structure = [
"rdkit>=2024.3",
"requests>=2.31",
]
[tool.setuptools.packages.find] [tool.setuptools.packages.find]
where = ["."] where = ["."]

View File

@@ -0,0 +1,66 @@
"""Structure-based track, step 0: ligand-based retrieval baseline (PLAN §12.9 engine).
Docking (§12.3) needs a toolchain that doesn't pip-install on ARM Mac (AutoDock Vina) — that's the
next dependency to solve. Meanwhile this runs now with pure RDKit: do any of our 300 drugs sit near
the KNOWN sickle binders (voxelotor, mitapivat, decitabine) in chemical space? This is the
retrieval engine §12.9 would point a generative beacon at, and a sanity check on the ligand data.
NOT docking and NOT a binding claim — chemical similarity only. Similarity != activity (§12.9).
"""
from __future__ import annotations
import pandas as pd
import requests
from rdkit import Chem, DataStructs, RDLogger
from rdkit.Chem import rdFingerprintGenerator
RDLogger.DisableLog("rdApp.*")
MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
# Known sickle binders = positive-control beacons (target in parens).
BINDERS = ["voxelotor", "mitapivat", "decitabine", "vorinostat"]
def pubchem_smiles(name: str) -> str | None:
for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
try:
u = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON"
d = requests.get(u, timeout=30).json()["PropertyTable"]["Properties"][0]
if prop in d:
return d[prop]
except Exception:
continue
return None
def fp(smi: str):
if not isinstance(smi, str) or smi in ("-666", ""):
return None
m = Chem.MolFromSmiles(smi)
return MORGAN.GetFingerprint(m) if m else None
def main() -> None:
binder_smi = {b: pubchem_smiles(b) for b in BINDERS}
print("known-binder SMILES:", {k: (v[:34] + "..." if v else "MISSING") for k, v in binder_smi.items()})
drugs = pd.read_csv("data/processed/drug_set_v1.csv")[["pert_iname", "canonical_smiles", "inclusion_reason"]]
reason = dict(zip(drugs.pert_iname, drugs.inclusion_reason))
drug_fp = {r.pert_iname: fp(r.canonical_smiles) for r in drugs.itertuples()}
drug_fp = {k: v for k, v in drug_fp.items() if v is not None}
print(f"fingerprinted {len(drug_fp)}/{len(drugs)} drugs\n")
for b, smi in binder_smi.items():
bfp = fp(smi)
if bfp is None:
print(f"{b}: no SMILES\n"); continue
sims = sorted(((DataStructs.TanimotoSimilarity(bfp, v), k) for k, v in drug_fp.items()), reverse=True)
print(f"nearest drugs to {b}:")
for s, k in sims[:5]:
print(f" {s:.3f} {k:22s} [{reason.get(k,'?')}]")
print()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,137 @@
"""Experiment: composition-adjusted sickle signature, to fix specificity (option 1).
The v1 signature is confounded by cell composition (SS patients have very different WBC/RBC
than AA controls). GSE35007 *measured* those counts per sample, so we adjust the differential
expression for them directly (a measured-covariate alternative to estimated deconvolution):
expression ~ disease + WBC + RBC + MCV + age + sex (per gene, vectorized OLS)
We compare the composition-ADJUSTED signature against an UNADJUSTED single-study signature
(same samples, model without the covariates), score both with the v1.1 engine (full gene space
+ spec_z), and report the recovery test for each. Writes nothing to committed artifacts.
"""
from __future__ import annotations
import warnings
from pathlib import Path
import GEOparse
import numpy as np
import pandas as pd
from scipy.stats import false_discovery_control
from scipy.stats import t as tdist
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.disease import collapse_probes_to_symbols # noqa: E402
from src.scoring import tau_calibrate # noqa: E402
warnings.filterwarnings("ignore")
PROCESSED = Path("data/processed")
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
SYMBOL_COLS = ["Symbol", "ILMN_Gene", "Gene Symbol", "GeneSymbol"]
def cval(gsm, key):
for c in gsm.metadata.get("characteristics_ch1", []):
if c.lower().startswith(key.lower()):
return c.split(":", 1)[1].strip()
return None
def load_data():
gse = GEOparse.get_GEO(geo="GSE35007", destdir="data/raw/geo", silent=True)
meta = []
for gid, g in gse.gsms.items():
meta.append({"gsm": gid, "hb": cval(g, "hb phenotype"), "wbc": cval(g, "white blood cells"),
"rbc": cval(g, "red blood cells"), "mcv": cval(g, "mean corpuscular volume"),
"age": cval(g, "age"), "sex": cval(g, "Sex")})
meta = pd.DataFrame(meta).set_index("gsm")
for c in ["wbc", "rbc", "mcv", "age"]:
meta[c] = pd.to_numeric(meta[c], errors="coerce")
meta["disease"] = meta["hb"].map({"SS": 1.0, "AA": 0.0})
meta["sex_m"] = (meta["sex"] == "M").astype(float)
keep = meta[meta["hb"].isin(["SS", "AA"])].dropna(subset=["wbc", "rbc", "mcv", "age", "disease"])
expr = pd.DataFrame({gid: gse.gsms[gid].table.set_index("ID_REF")["VALUE"] for gid in keep.index})
expr = expr.apply(pd.to_numeric, errors="coerce").dropna(how="any")
if float(np.nanmax(expr.to_numpy())) > 50:
expr = np.log2(expr.clip(lower=0) + 1.0)
gpl = list(gse.gpls.values())[0]
col = next((c for c in SYMBOL_COLS if c in gpl.table.columns), None)
sym = gpl.table.set_index("ID")[col].astype(str).str.strip().replace({"": np.nan, "nan": np.nan})
return expr, keep, sym.dropna()
def ols_de(expr, design, disease_idx):
"""Vectorized per-gene OLS; return DE table (log_fc=disease coef, pvalue, qvalue)."""
X = design.to_numpy(dtype=float)
Y = expr.T.to_numpy(dtype=float) # samples x genes
n, p = X.shape
XtX_inv = np.linalg.pinv(X.T @ X)
B = XtX_inv @ X.T @ Y
resid = Y - X @ B
dof = n - p
sigma2 = (resid ** 2).sum(0) / dof
se = np.sqrt(sigma2 * XtX_inv[disease_idx, disease_idx])
t = B[disease_idx] / se
pval = 2 * tdist.sf(np.abs(t), dof)
out = pd.DataFrame({"log_fc": B[disease_idx], "pvalue": pval}, index=expr.index).dropna()
out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
return out
def make_signature(de, sym, expr, top_n=250):
de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
sig = de_sym[de_sym["qvalue"] < 0.05]
up = sig[sig["log_fc"] > 0].nsmallest(top_n, "qvalue").index.tolist()
down = sig[sig["log_fc"] < 0].nsmallest(top_n, "qvalue").index.tolist()
return up, down
def evaluate(label, up, down, lincs):
ranked = tau_calibrate(up, down, lincs, n_null=1000)
n = len(ranked)
top10, top25, half = int(n * .10), int(n * .25), n // 2
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
ranked = ranked.join(profiles[["inclusion_reason"]])
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
n_bottom = sum(r > half for r in negs.values())
n_ov = len((set(up) | set(down)) & set(lincs.columns))
print(f"\n### {label}: {len(up)} up / {len(down)} down (query overlap {n_ov})")
print(f" hydroxyurea rank {hu}/{n} (top {100*hu/n:.1f}%) top-10%? {hu <= top10}")
print(f" L-glutamine rank {glut}/{n} (top {100*glut/n:.1f}%) top-25%? {glut <= top25}")
print(f" neg controls bottom-half: {n_bottom}/5 {negs}")
print(" top 8: " + ", ".join(
f"{name}[{r['inclusion_reason'][:3]}]" for name, r in ranked.nsmallest(8, "spec_z").iterrows()))
return ranked
def main():
expr, meta, sym = load_data()
print(f"loaded {expr.shape[1]} samples x {expr.shape[0]} probes; "
f"{int(meta.disease.sum())} SS / {int((meta.disease==0).sum())} AA")
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
base = pd.DataFrame({"intercept": 1.0, "disease": meta["disease"]}, index=meta.index)
adj = base.assign(wbc=meta["wbc"], rbc=meta["rbc"], mcv=meta["mcv"], age=meta["age"], sex_m=meta["sex_m"])
de_unadj = ols_de(expr, base, disease_idx=1)
de_adj = ols_de(expr, adj, disease_idx=1)
up_u, dn_u = make_signature(de_unadj, sym, expr)
up_a, dn_a = make_signature(de_adj, sym, expr)
# how much does adjustment change the gene set?
overlap = len((set(up_u) | set(dn_u)) & (set(up_a) | set(dn_a)))
print(f"\nsignature gene overlap unadjusted vs adjusted: {overlap}/{len(set(up_u)|set(dn_u))}")
evaluate("UNADJUSTED (GSE35007 only)", up_u, dn_u, lincs)
evaluate("COMPOSITION-ADJUSTED", up_a, dn_a, lincs)
if __name__ == "__main__":
main()

102
scripts/exp_genespace.py Normal file
View File

@@ -0,0 +1,102 @@
"""Experiment (v1.1): re-score on a larger LINCS gene space and re-run the recovery test.
v1 used only the 978 landmark genes (12% signature overlap). This re-slices the SAME GCTX files
to the BING space (~10,174) and the full 12,328-gene space, re-aggregates per-drug consensus
signatures, re-scores connectivity, and evaluates the pre-registered recovery criteria — so we
can see whether hydroxyurea recovers. Writes nothing to the committed v1 artifacts.
"""
from __future__ import annotations
import gzip
import io
import json
from pathlib import Path
import pandas as pd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import rank_drugs # noqa: E402
LINCS = Path("data/raw/lincs")
PROCESSED = Path("data/processed")
GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
def read_gz(name):
return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
def gene_ids_for_space(space: str):
g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
if space == "bing":
g = g[g.pr_is_bing == 1]
# 'all' -> keep everything
ids = [str(x) for x in g.pr_gene_id]
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
return ids, id_to_symbol
def extract(space, drug_names, gene_ids, id_to_symbol):
from cmapPy.pandasGEXpress.parse import parse
frames = []
for ph in (1, 2):
sig = read_gz(SIG_INFO[ph])
sig = sig[(sig.pert_type == "trt_cp") & (sig.pert_iname.isin(drug_names))]
if sig.empty:
continue
gct = parse(str(GCTX[ph]), rid=gene_ids, cid=sig.sig_id.tolist())
data = gct.data_df
s2d = dict(zip(sig.sig_id, sig.pert_iname))
frames.append(data.T.groupby(data.columns.map(s2d)).mean())
print(f" [{space}] phase {ph}: {sig.pert_iname.nunique()} drugs sliced", flush=True)
combined = pd.concat(frames).groupby(level=0).mean()
combined.columns = [id_to_symbol.get(c, c) for c in combined.columns]
combined = combined.loc[:, ~combined.columns.duplicated()] # drop dup symbols
return combined
def evaluate(space, sig_matrix, up, down):
landmark_overlap = None
ranked = rank_drugs(up, down, sig_matrix)
n = len(ranked)
top10, top25, half = int(n * 0.10), int(n * 0.25), n // 2
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
ranked = ranked.join(profiles[["inclusion_reason"]])
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
glut_s = ranked.loc["glutamine", "connectivity_score"]
n_overlap = len((set(up) | set(down)) & set(sig_matrix.columns))
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
n_bottom = sum(r > half for r in negs.values())
print(f"\n=== gene space: {space.upper()} ({sig_matrix.shape[1]} genes; query overlap {n_overlap}) ===")
print(f" hydroxyurea: rank {hu}/{n} (top {100*hu/n:.1f}%) top-10%? {hu <= top10}")
print(f" L-glutamine: rank {glut}/{n} (top {100*glut/n:.1f}%), WTCS={glut_s:.3f} top-25%? {glut <= top25}")
print(f" neg controls in bottom half: {n_bottom}/5 {negs}")
crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
print(f" OVERALL: {'PASS' if crit else 'FAIL'}")
print(" top 8:")
for name, r in ranked.nsmallest(8, "connectivity_score").iterrows():
print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} [{r['inclusion_reason']}]")
return ranked
def main():
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
up = [g["gene"] for g in sig["up_regulated"]]
down = [g["gene"] for g in sig["down_regulated"]]
drug_names = set(pd.read_csv(PROCESSED / "drug_set_v1.csv").pert_iname)
for space in ("bing", "all"):
print(f"\n>>> extracting {space} ...", flush=True)
ids, id2sym = gene_ids_for_space(space)
mat = extract(space, drug_names, ids, id2sym)
evaluate(space, mat, up, down)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,118 @@
"""Phase A: calibrate sickle connectivity against a REAL disease-signature reference population.
The v1.1 tau saturated because it used random gene-set nulls. Proper specificity calibration
asks: does a drug reverse SICKLE more than it reverses diseases in general? We download a library
of real disease signatures (Enrichr "Disease Signatures from GEO", up+down), compute each drug's
connectivity to every reference disease, and express its sickle connectivity as a z-score within
that per-drug reference distribution. Broad-effect drugs (reverse everything) -> z~0 -> down-ranked.
Re-runs the recovery test and compares to v1.1 (random-null). Writes nothing committed.
"""
from __future__ import annotations
from pathlib import Path
import numpy as np
import pandas as pd
import requests
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import _ks_connectivity # noqa: E402
PROCESSED = Path("data/processed")
RAW = Path("data/raw/disease_sigs")
ENRICHR = "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName={}"
LIBS = {"up": "Disease_Signatures_from_GEO_up_2014", "down": "Disease_Signatures_from_GEO_down_2014"}
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
def fetch_gmt(name: str) -> dict[str, list[str]]:
RAW.mkdir(parents=True, exist_ok=True)
path = RAW / f"{name}.gmt"
if not path.exists():
r = requests.get(ENRICHR.format(name), timeout=120)
r.raise_for_status()
path.write_text(r.text)
out = {}
for line in path.read_text().splitlines():
parts = line.split("\t")
if len(parts) < 3:
continue
term = parts[0]
genes = [g.split(",")[0].strip().upper() for g in parts[2:] if g.strip()]
out[term] = genes
print(f" {name}: {path.stat().st_size/1e6:.1f} MB, {len(out)} terms")
return out
def build_reference() -> list[dict]:
up = fetch_gmt(LIBS["up"])
down = fetch_gmt(LIBS["down"])
shared = set(up) & set(down)
refs = []
for term in shared:
if "sickle" in term.lower():
continue # exclude the target disease from its own reference population
refs.append({"name": term, "up": up[term], "down": down[term]})
print(f"reference disease signatures (paired up+down, sickle excluded): {len(refs)}")
return refs
def cols_for(genes, gene_to_col):
return np.array([gene_to_col[g] for g in set(genes) if g in gene_to_col], dtype=int)
def main():
import json
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
sk_up = [g["gene"] for g in sig["up_regulated"]]
sk_down = [g["gene"] for g in sig["down_regulated"]]
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
genes = list(lincs.columns)
gene_to_col = {g: i for i, g in enumerate(genes)}
n = len(genes)
R = lincs.rank(axis=1, ascending=False).to_numpy()
refs = build_reference()
# connectivity of every drug to every reference disease -> (n_drugs, n_refs)
C = np.empty((R.shape[0], len(refs)))
for j, d in enumerate(refs):
C[:, j] = _ks_connectivity(R, cols_for(d["up"], gene_to_col), cols_for(d["down"], gene_to_col), n)
# drop reference diseases with too few mapped genes (degenerate columns)
mapped = np.array([len(cols_for(d["up"], gene_to_col)) + len(cols_for(d["down"], gene_to_col)) for d in refs])
keep = mapped >= 10
C = C[:, keep]
print(f"usable reference diseases (>=10 mapped genes): {keep.sum()}")
real = _ks_connectivity(R, cols_for(sk_up, gene_to_col), cols_for(sk_down, gene_to_col), n)
ref_mean, ref_std = C.mean(axis=1), C.std(axis=1)
ref_std[ref_std == 0] = np.nan
spec_z = (real - ref_mean) / ref_std # negative = reverses sickle more than diseases-in-general
ranked = pd.DataFrame({"spec_z": spec_z, "connectivity": real}, index=lincs.index).sort_values("spec_z")
ranked.insert(0, "rank", range(1, len(ranked) + 1))
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
ranked = ranked.join(profiles[["inclusion_reason"]])
N = len(ranked)
top10, top25, half = int(N * .10), int(N * .25), N // 2
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
n_bottom = sum(r > half for r in negs.values())
print("\n==== RECOVERY TEST (reference-calibrated tau) ====")
print(f" hydroxyurea rank {hu}/{N} (top {100*hu/N:.1f}%) top-10%? {hu <= top10} z={ranked.loc['hydroxyurea','spec_z']:.2f}")
print(f" L-glutamine rank {glut}/{N} (top {100*glut/N:.1f}%) top-25%? {glut <= top25}")
print(f" neg controls bottom-half: {n_bottom}/5 {negs}")
crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
print(f" OVERALL: {'PASS' if crit else 'FAIL'}")
print("\n top 12 by reference-calibrated z:")
for name, r in ranked.nsmallest(12, "spec_z").iterrows():
print(f" {int(r['rank']):2d} {name:20s} z={r['spec_z']:6.2f} [{r['inclusion_reason']}]")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,137 @@
"""Phase D: supervised cross-disease repurposing — can labels break the specificity ceiling?
Connectivity alone can't tell therapeutic from coincidental reversal. A supervised model trained
on known drug-disease pairs CAN learn that pattern — given features that expose drug "broadness"
(a drug that reverses everything is non-specific). We train on 839 GEO disease signatures with
Repurposing-Hub indications as labels, evaluate with disease-grouped CV, then apply to HELD-OUT
sickle and check whether the coincidental reversers (norethindrone, ciprofloxacin) finally drop.
Baseline = rank by raw connectivity (the single conn feature). Win = model down-ranks the
negative controls vs baseline while keeping hydroxyurea high.
"""
from __future__ import annotations
import json
import re
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GroupKFold, cross_val_predict
from sklearn.metrics import roc_auc_score
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import _ks_connectivity # noqa: E402
PROCESSED = Path("data/processed")
SIGS = Path("data/raw/disease_sigs")
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
ID_RE = re.compile(r"\s*(c\d{6,}|doid[-\s]?\d+|gse\d+|umls\S+)\s*", re.I)
def clean_disease(term: str) -> str:
return ID_RE.sub(" ", term).strip().lower()
def load_disease_sigs():
def parse(name):
out = {}
for line in (SIGS / name).read_text().splitlines():
p = line.split("\t")
if len(p) < 3:
continue
out[p[0]] = [g.split(",")[0].strip().upper() for g in p[2:] if g.strip()]
return out
up = parse("Disease_Perturbations_from_GEO_up.gmt")
down = parse("Disease_Perturbations_from_GEO_down.gmt")
terms = sorted(set(up) & set(down))
return [{"term": t, "disease": clean_disease(t), "up": up[t], "down": down[t]} for t in terms]
def cols_for(genes, g2c):
return np.array([g2c[g] for g in set(genes) if g in g2c], dtype=int)
def featurize(conn_col, C):
"""Per-(drug,disease) features for one disease column conn_col, given full matrix C."""
drug_mean, drug_std = C.mean(1), C.std(1)
drug_min = C.min(1)
broad = (C < -0.10).mean(1) # fraction of diseases this drug strongly reverses
dmean, dstd = conn_col.mean(), conn_col.std() or 1.0
return np.column_stack([
conn_col, drug_mean, drug_std, drug_min, broad,
np.full_like(conn_col, dmean),
(conn_col - drug_mean) / np.where(drug_std == 0, 1, drug_std), # specificity within drug
(conn_col - dmean) / dstd, # specificity within disease
])
FEATS = ["conn", "drug_mean", "drug_std", "drug_min", "broadness",
"disease_mean", "z_within_drug", "z_within_disease"]
def main():
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
drugs = list(lincs.index)
g2c = {g: i for i, g in enumerate(lincs.columns)}
n = len(lincs.columns)
R = lincs.rank(axis=1, ascending=False).to_numpy()
refs = [d for d in load_disease_sigs()
if len(cols_for(d["up"], g2c)) + len(cols_for(d["down"], g2c)) >= 10]
C = np.column_stack([_ks_connectivity(R, cols_for(d["up"], g2c), cols_for(d["down"], g2c), n) for d in refs])
print(f"{len(drugs)} drugs x {len(refs)} disease signatures; connectivity matrix {C.shape}")
# labels from Repurposing Hub indications
hub = pd.read_csv("data/raw/repurposing_drugs.txt", sep="\t", comment="!", low_memory=False)
ind = {r.pert_iname: [i.strip().lower() for i in re.split(r"[|,]", r.indication) if len(i.strip()) > 3]
for r in hub.itertuples() if isinstance(r.indication, str)}
drug_idx = {d: i for i, d in enumerate(drugs)}
X_rows, y, grp = [], [], []
for j, d in enumerate(refs):
feats = featurize(C[:, j], C)
dz = d["disease"]
for i, drug in enumerate(drugs):
inds = ind.get(drug, [])
label = int(any(dz == k or (len(dz) > 4 and dz in k) or (len(k) > 4 and k in dz) for k in inds))
X_rows.append(feats[i]); y.append(label); grp.append(j)
X = np.array(X_rows); y = np.array(y); grp = np.array(grp)
print(f"pairs: {len(y)}, positives: {y.sum()} ({100*y.mean():.2f}%)")
# disease-grouped CV (generalize to unseen diseases)
clf = GradientBoostingClassifier(random_state=0)
proba = cross_val_predict(clf, X, y, cv=GroupKFold(5), groups=grp, method="predict_proba")[:, 1]
print(f"disease-grouped CV AUC: {roc_auc_score(y, proba):.3f} (conn-only AUC: {roc_auc_score(y, X[:,0]*-1):.3f})")
clf.fit(X, y)
print("feature importances: " + ", ".join(f"{f}={imp:.2f}" for f, imp in
sorted(zip(FEATS, clf.feature_importances_), key=lambda t: -t[1])))
# apply to HELD-OUT sickle
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
sk = _ks_connectivity(R, cols_for([g["gene"] for g in sig["up_regulated"]], g2c),
cols_for([g["gene"] for g in sig["down_regulated"]], g2c), n)
Xs = featurize(sk, C)
p_sickle = clf.predict_proba(Xs)[:, 1]
res = pd.DataFrame({"model_p": p_sickle, "conn": sk}, index=drugs)
res["model_rank"] = res["model_p"].rank(ascending=False).astype(int)
res["conn_rank"] = res["conn"].rank(ascending=True).astype(int) # most negative = best
N = len(res)
print(f"\n{'drug':14s} {'model_rank':>10s} {'conn_rank(base)':>16s}")
for d in ["hydroxyurea", "glutamine"] + NEG5 + ["norethindrone", "ciprofloxacin"]:
if d in res.index:
r = res.loc[d]
print(f" {d:14s} {int(r['model_rank']):6d}/{N} {int(r['conn_rank']):6d}/{N}")
nb_model = sum(res.loc[d, "model_rank"] > N/2 for d in NEG5 if d in res.index)
nb_conn = sum(res.loc[d, "conn_rank"] > N/2 for d in NEG5 if d in res.index)
print(f"\nneg controls bottom-half: model {nb_model}/5 vs baseline {nb_conn}/5")
print("model top 10:", ", ".join(res.sort_values('model_p', ascending=False).head(10).index))
if __name__ == "__main__":
main()

View File

@@ -36,9 +36,15 @@ def read_gz_tsv(name: str) -> pd.DataFrame:
def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]: def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
lm = pd.read_csv(LINCS / "landmark_genes.csv") """Gene row-ids + id->symbol map for the scored gene space.
ids = [str(x) for x in lm["pr_gene_id"]]
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in lm.itertuples()} v1.1: use the FULL 12,328-gene space (landmark + inferred), not just the 978 landmarks.
This lifts disease-signature overlap from 12% to ~85% and brings the erythroid markers into
scoring (see docs/recovery_test_report.md). Inferred genes are model-predicted (noisier).
"""
g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
ids = [str(x) for x in g["pr_gene_id"]]
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
return ids, id_to_symbol return ids, id_to_symbol

View File

@@ -1,13 +1,11 @@
"""Week 3: run connectivity scoring over all drugs -> ranked_candidates_v1.csv (PLAN §6). """Week 3 (v1.1): connectivity scoring over the full gene space with tau calibration.
Loads the disease signature + the 300 drug LINCS signatures, computes the weighted-KS Primary ranking is now **tau** (KS connectivity expressed as a signed percentile vs a null of
connectivity score per drug, and produces two rankings: random queries) — this calibrates out broad-effect drugs that connect to random signatures too,
1. raw connectivity (most negative = strongest reversal = rank 1) the specificity fix. The weighted connectivity score (WTCS) is retained as a reference column,
2. a secondary ranking blending connectivity with a mechanistic prior (sickle-relevant and a secondary ranking blends tau with the sickle-pathway mechanistic prior.
target pathways), to temper broad-effect drugs (HDAC/kinase) that dominate raw rankings.
The formal recovery test (ground-truth + negative-control evaluation against the pre-registered Output: data/results/ranked_candidates_v1.csv.
criteria) is Week 4; this script only prints a sanity peek.
""" """
from __future__ import annotations from __future__ import annotations
@@ -19,10 +17,13 @@ import pandas as pd
import sys import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import mechanistic_prior, persist_ranking, rank_drugs # noqa: E402 from src.scoring import ( # noqa: E402
connectivity_score, mechanistic_prior, normalize_scores, persist_ranking, tau_calibrate,
)
PROCESSED = Path("data/processed") PROCESSED = Path("data/processed")
PRIOR_LAMBDA = 0.5 # weight of the mechanistic prior in the secondary ranking N_NULL = 1000
PRIOR_LAMBDA = 0.5 # spec_z credit per matched sickle pathway, for the blended ranking
def main() -> None: def main() -> None:
@@ -30,52 +31,51 @@ def main() -> None:
up = [g["gene"] for g in sig["up_regulated"]] up = [g["gene"] for g in sig["up_regulated"]]
down = [g["gene"] for g in sig["down_regulated"]] down = [g["gene"] for g in sig["down_regulated"]]
sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 978 symbols sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 12,328 genes
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name") profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
landmark = set(sig_matrix.columns) n_up = len(set(up) & set(sig_matrix.columns))
n_up_ov = len(set(up) & landmark) n_down = len(set(down) & set(sig_matrix.columns))
n_down_ov = len(set(down) & landmark) print(f"gene space: {sig_matrix.shape[1]} genes; query overlap {n_up} up + {n_down} down = {n_up + n_down}")
print(f"query overlap with 978 landmarks: {n_up_ov} up + {n_down_ov} down = {n_up_ov + n_down_ov}")
print(f"scoring {len(sig_matrix)} drugs (all scored; 0 without signature)")
ranked = rank_drugs(up, down, sig_matrix) # primary: tau calibration
print(f"computing tau over {N_NULL} random-query permutations ...", flush=True)
ranked = tau_calibrate(up, down, sig_matrix, n_null=N_NULL)
# attach metadata + mechanistic prior # reference: weighted connectivity score (WTCS) + NCS
wtcs = pd.Series({d: connectivity_score(up, down, sig_matrix.loc[d]) for d in sig_matrix.index},
name="connectivity_score")
ranked["connectivity_score"] = wtcs
ranked["normalized_score"] = normalize_scores(wtcs)
# metadata + mechanistic prior
ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]]) ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
ranked["mechanistic_prior"] = ranked["targets"].apply( ranked["mechanistic_prior"] = ranked["targets"].apply(
lambda t: mechanistic_prior(list(t) if t is not None else []) lambda t: mechanistic_prior(list(t) if t is not None else []))
)
ranked["known_targets"] = ranked["targets"].apply( ranked["known_targets"] = ranked["targets"].apply(
lambda t: "; ".join(t) if t is not None and len(t) else "" lambda t: "; ".join(t) if t is not None and len(t) else "")
)
ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"}) ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})
# secondary, prior-weighted ranking: relevant drugs pushed toward better (more negative) # secondary, prior-weighted ranking (relevant drugs pushed toward more-negative spec_z)
ranked["blended_score"] = ranked["normalized_score"] - PRIOR_LAMBDA * ranked["mechanistic_prior"] ranked["blended_score"] = ranked["spec_z"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int) ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)
out = ranked.rename_axis("drug_name").reset_index()[[ out = ranked.rename_axis("drug_name").reset_index()[[
"rank", "drug_name", "chembl_id", "connectivity_score", "normalized_score", "rank", "drug_name", "chembl_id", "spec_z", "tau", "connectivity_ks", "connectivity_score",
"inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary", "inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
]] ]]
path = persist_ranking(out) path = persist_ranking(out)
print(f"wrote {path} ({len(out)} drugs)") print(f"wrote {path} ({len(out)} drugs)")
# --- sanity peek (formal recovery test is Week 4) --- print("\n--- sanity peek (spec_z ranking) ---")
print("\n--- sanity peek (raw connectivity rank) ---")
for gt in ["hydroxyurea", "glutamine"]: for gt in ["hydroxyurea", "glutamine"]:
r = ranked.loc[gt] r = ranked.loc[gt]
pct = 100 * r["rank"] / len(ranked) print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {100*r['rank']/len(ranked):.0f}%), "
print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {pct:.0f}%), " f"spec_z={r['spec_z']:.2f}")
f"score={r['connectivity_score']:.3f}") print(" top 10 by spec_z:")
neg = ranked[ranked["inclusion_reason"] == "negative_control"] for name, r in ranked.nsmallest(10, "spec_z").iterrows():
print(f" negative controls in bottom half: " print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:6.2f} [{r['inclusion_reason']:16s}] "
f"{(neg['rank'] > len(ranked) / 2).sum()}/{len(neg)}") f"{str(r['known_targets'])[:38]}")
print("\n top 5 raw candidates:")
for name, r in ranked.nsmallest(5, "connectivity_score").iterrows():
print(f" {int(r['rank']):3d} {name:18s} {r['connectivity_score']:+.3f} "
f"[{r['inclusion_reason']}] {r['known_targets'][:50]}")
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -36,6 +36,7 @@ def main() -> None:
return int(df.loc[name, "rank"]) if name in df.index else None return int(df.loc[name, "rank"]) if name in df.index else None
hu, glut = rk("hydroxyurea"), rk("glutamine") hu, glut = rk("hydroxyurea"), rk("glutamine")
glut_z = df.loc["glutamine", "spec_z"]
# pick negative controls present in the ranking # pick negative controls present in the ranking
negs = {} negs = {}
@@ -47,9 +48,8 @@ def main() -> None:
print("=" * 60) print("=" * 60)
print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}") print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}")
print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%) -> top-10%? {hu <= top10_cut}") print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%) -> top-10%? {hu <= top10_cut}")
glut_score = df.loc["glutamine", "connectivity_score"] print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), spec_z={glut_z:+.2f} "
print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), WTCS={glut_score:.3f} " f"-> top-25%? {glut <= top25_cut} (positive z => does not reverse; has a signature)")
f"-> top-25%? {glut <= top25_cut} (has signature, so NOT 'missing-signature unscorable')")
print("\nnegative controls (pre-specified, 1 per category):") print("\nnegative controls (pre-specified, 1 per category):")
n_bottom = 0 n_bottom = 0
for d, (cat, r) in negs.items(): for d, (cat, r) in negs.items():
@@ -70,10 +70,10 @@ def main() -> None:
print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} " print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} "
f"(top {100*hu_b/n:.1f}%)") f"(top {100*hu_b/n:.1f}%)")
print("\n--- TOP 10 (raw connectivity) ---") print("\n--- TOP 10 (primary spec_z ranking) ---")
top10 = df.nsmallest(10, "connectivity_score") top10 = df.sort_values("rank").head(10)
for name, r in top10.iterrows(): for name, r in top10.iterrows():
print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} " print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:+.2f} "
f"[{r['inclusion_reason']}] {str(r['known_targets'])[:45]}") f"[{r['inclusion_reason']}] {str(r['known_targets'])[:45]}")

89
src/binding.py Normal file
View File

@@ -0,0 +1,89 @@
"""Structure-based binding track (PLAN §12).
Two capabilities:
- ligand-based retrieval (RDKit, works now): find existing drugs near a query molecule in
chemical space — validated, and the engine behind §12.9 generative-guided retrieval.
- structure-based docking (§12.3): score whether a ligand binds a target pocket. Blocked on an
ARM-Mac docking toolchain (AutoDock Vina does not pip-install); see ``dock`` for options.
Caveat carried throughout: chemical similarity != activity, and docking != efficacy (§12.6).
"""
from __future__ import annotations
from pathlib import Path
from rdkit import Chem, DataStructs, RDLogger
from rdkit.Chem import rdFingerprintGenerator
RDLogger.DisableLog("rdApp.*")
_MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
STRUCT_DIR = Path("data/raw/structures")
# Known sickle small-molecule binders, by target (positive controls for the §12.4 recovery test).
KNOWN_BINDERS = {
"hemoglobin": "voxelotor",
"PKR": "mitapivat",
"DNMT1": "decitabine",
"HDAC": "vorinostat",
}
# Curated target structures (PLAN §12.2). Add PDB ids as the harness grows.
TARGET_PDB = {
"hemoglobin": "5E83", # hemoglobin + voxelotor (GBT440)
"PKR": "8XFD", # pyruvate kinase R + mitapivat
}
def morgan_fp(smiles: str):
"""Morgan (ECFP4) fingerprint, or None for invalid / missing SMILES ('-666', '')."""
if not isinstance(smiles, str) or smiles in ("-666", ""):
return None
mol = Chem.MolFromSmiles(smiles)
return _MORGAN.GetFingerprint(mol) if mol else None
def tanimoto(smiles_a: str, smiles_b: str) -> float | None:
fa, fb = morgan_fp(smiles_a), morgan_fp(smiles_b)
if fa is None or fb is None:
return None
return DataStructs.TanimotoSimilarity(fa, fb)
def retrieve_nearest(
query_smiles: str,
library: dict[str, str],
top_n: int = 5,
) -> list[tuple[float, str]]:
"""Rank a library of {name: smiles} by Tanimoto similarity to a query molecule.
This is the §12.9 retrieval step: the query may be a known binder (positive-control beacon)
or a generated idealised binder; the returned existing drugs are repurposing candidates that
STILL require docking/validation (similarity != activity).
"""
qfp = morgan_fp(query_smiles)
if qfp is None:
raise ValueError("invalid query SMILES")
sims = []
for name, smi in library.items():
fp = morgan_fp(smi)
if fp is not None:
sims.append((DataStructs.TanimotoSimilarity(qfp, fp), name))
return sorted(sims, reverse=True)[:top_n]
def dock(target: str, ligand_smiles: str) -> float:
"""Dock a ligand into a target pocket and return a binding score (PLAN §12.3).
Blocked: AutoDock Vina does not pip-install on ARM Mac and no docking binary is on PATH.
Resolve the toolchain first (one of):
- conda/micromamba: ``vina`` (conda-forge) or ``smina`` (bioconda), osx-arm64 builds
- an AF3-class co-folding model on GPU: Boltz-2 / Chai-1 / DiffDock (also predicts affinity)
- x86 Vina binary under Rosetta 2
Then: fetch TARGET_PDB[target], define the pocket box, prep the ligand (Meeko), score.
"""
raise NotImplementedError(
"Docking toolchain unresolved on ARM Mac (PLAN §12.6 pitfall 4 / §12.8). "
"See docstring for options."
)

View File

@@ -16,7 +16,7 @@ from pathlib import Path
import numpy as np import numpy as np
import pandas as pd import pandas as pd
from . import RESULTS_DIR from . import RANDOM_SEED, RESULTS_DIR
# Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3). # Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
# Keys are pathway categories; values are substrings matched (case-insensitive) against a # Keys are pathway categories; values are substrings matched (case-insensitive) against a
@@ -134,6 +134,92 @@ def mechanistic_prior(targets: list[str]) -> float:
return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values())) return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))
# --- KS connectivity + tau calibration (v1.1) -----------------------------------------------
# Unweighted Kolmogorov-Smirnov connectivity (Lamb 2006) is O(k) per query (depends only on the
# ranks of the query genes), which makes a permutation null over many random queries cheap. tau
# expresses each drug's real connectivity as a signed percentile within its own null — so
# broad-effect drugs that connect to *random* signatures too get down-weighted (specificity).
def _ks_es(rank_matrix: np.ndarray, query_cols: np.ndarray, n_genes: int) -> np.ndarray:
"""Vectorized unweighted KS enrichment score of a query gene set for every drug.
``rank_matrix`` is (n_drugs, n_genes) of 1..N rank positions (1 = most up-regulated).
Returns one ES per drug; ES>0 => query enriched among up-regulated genes.
"""
k = len(query_cols)
if k == 0:
return np.zeros(rank_matrix.shape[0])
p = np.sort(rank_matrix[:, query_cols], axis=1) # (n_drugs, k), positions ascending
j = np.arange(1, k + 1)
a = (j / k - p / n_genes).max(axis=1)
b = (p / n_genes - (j - 1) / k).max(axis=1)
return np.where(a >= b, a, -b)
def _ks_connectivity(rank_matrix: np.ndarray, up_cols: np.ndarray, down_cols: np.ndarray,
n_genes: int) -> np.ndarray:
"""KS connectivity per drug: (ES_up - ES_down)/2. Negative=reversal.
Note: unlike WTCS, this does NOT zero same-sign (ambiguous) connections — same-sign ES
partially cancel and land near 0 naturally. Hard-zeroing would collapse the random-query
null to a spike at 0 and make tau saturate, so the continuous form is required for calibration.
"""
es_up = _ks_es(rank_matrix, up_cols, n_genes)
es_down = _ks_es(rank_matrix, down_cols, n_genes)
return (es_up - es_down) / 2.0
def tau_calibrate(
up_genes: list[str],
down_genes: list[str],
signature_matrix: pd.DataFrame,
n_null: int = 1000,
seed: int = RANDOM_SEED,
) -> pd.DataFrame:
"""Rank drugs by tau: each drug's KS connectivity as a signed percentile vs a null of
random queries of the same size (PLAN §6; CMap tau, Subramanian 2017).
tau in [-100, 100]: -100 => reverses our signature more specifically than any random query
(strong, specific candidate); ~0 => connects to our signature no more than to random ones
(broad-effect / non-specific). Ranked by tau ascending (rank 1 = most specific reversal).
"""
genes = list(signature_matrix.columns)
gene_to_col = {g: i for i, g in enumerate(genes)}
n = len(genes)
rank_matrix = signature_matrix.rank(axis=1, ascending=False).to_numpy()
up_cols = np.array([gene_to_col[g] for g in set(up_genes) if g in gene_to_col], dtype=int)
down_cols = np.array([gene_to_col[g] for g in set(down_genes) if g in gene_to_col], dtype=int)
real = _ks_connectivity(rank_matrix, up_cols, down_cols, n)
rng = np.random.default_rng(seed)
k_up, k_down = len(up_cols), len(down_cols)
null = np.empty((rank_matrix.shape[0], n_null))
for m in range(n_null):
samp = rng.choice(n, size=k_up + k_down, replace=False)
null[:, m] = _ks_connectivity(rank_matrix, samp[:k_up], samp[k_up:], n)
null_mean = null.mean(axis=1)
null_std = null.std(axis=1)
null_std[null_std == 0] = np.nan
# Per-drug standardized connectivity: how many SDs the real reversal is below what random
# queries of the same size produce against THIS drug. Continuous (no percentile floor), so it
# discriminates within the strong-reversal tail. Negative = specific reversal.
spec_z = (real - null_mean) / null_std
frac = (null <= real[:, None]).mean(axis=1)
tau = 100.0 * (2.0 * frac - 1.0) # retained for reference; saturates at +/-100 in the tail
df = pd.DataFrame(
{"connectivity_ks": real, "null_mean": null_mean, "spec_z": spec_z, "tau": tau},
index=signature_matrix.index,
)
df = df.sort_values("spec_z") # most negative z = most specific reversal
df.insert(0, "rank", range(1, len(df) + 1))
return df
def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path: def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:
"""Write the ranked candidate list to ``data/results/ranked_candidates_v1.csv``.""" """Write the ranked candidate list to ``data/results/ranked_candidates_v1.csv``."""
out_path = out_path or (RESULTS_DIR / "ranked_candidates_v1.csv") out_path = out_path or (RESULTS_DIR / "ranked_candidates_v1.csv")

View File

@@ -114,6 +114,33 @@ class TestMechanisticPrior:
assert mechanistic_prior(["Some unrelated kinase"]) == 0.0 assert mechanistic_prior(["Some unrelated kinase"]) == 0.0
class TestTauCalibration:
"""tau should reward a SPECIFIC reverser and give a near-zero score to a noise drug."""
@staticmethod
def _matrix() -> pd.DataFrame:
genes = [f"U{i}" for i in range(5)] + [f"D{i}" for i in range(5)] + [f"G{i}" for i in range(40)]
rng_vals = {g: 0.01 * ((hash(g) % 7) - 3) for g in genes} # tiny deterministic noise
# specific reverser: query-up genes at the bottom, query-down at the top, rest ~0
specific = dict(rng_vals)
for i in range(5):
specific[f"U{i}"] = -8 - i
specific[f"D{i}"] = 8 + i
noise = dict(rng_vals)
return pd.DataFrame([specific, noise], index=["specific", "noise"])[genes]
def test_specific_reverser_has_strongly_negative_tau(self):
from src.scoring import tau_calibrate
up = [f"U{i}" for i in range(5)]
down = [f"D{i}" for i in range(5)]
out = tau_calibrate(up, down, self._matrix(), n_null=300, seed=0)
# Ranked by spec_z (continuous); the specific reverser is the most negative.
assert out.loc["specific", "spec_z"] < -2
assert out.loc["specific", "spec_z"] < out.loc["noise", "spec_z"]
assert out.loc["specific", "tau"] < -50 # tau also flags it (may saturate near -100)
assert out.loc["specific", "rank"] == 1
def test_rank_drugs_orders_by_reversal(): def test_rank_drugs_orders_by_reversal():
from src.scoring import rank_drugs from src.scoring import rank_drugs
genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)] genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]