Compare commits

...

11 Commits

Author SHA1 Message Date
817bcda7dc Structure-binding track: scaffold + ligand-retrieval baseline
Start the structure-based binding branch (PLAN §12), baseline-first.

- src/binding.py: validated RDKit ligand retrieval (morgan_fp, tanimoto,
  retrieve_nearest = the §12.9 engine) + dock() stub documenting the
  blocked ARM-Mac toolchain
- scripts/binding_ligand_baseline.py: 300 drugs vs known binders
- docs/structure_binding_notes.md: status, toolchain blocker, next steps
- pyproject: [structure] extra (rdkit); data/raw/structures/ for PDBs

Step-0 finding: retrieval engine VALIDATED on in-set classes
(decitabine->azacitidine 0.62; vorinostat->scriptaid/belinostat) but the
distinctive binders voxelotor/mitapivat have no analog in our 300-drug
set (Tanimoto ~0.2). Needs (a) bigger library, (b) real docking (§12.3),
which is blocked on the ARM-Mac docking toolchain (§12.6 pitfall 4).
Structures 5E83 (Hb+voxelotor) and 8XFD (PKR+mitapivat) fetched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:53:27 +02:00
6c2c71d73d PLAN §12.9: leave door open for generative-guided retrieval
Reframe de novo generation into the repurposing frame per the founder's
idea: use a pocket-conditioned generative model (TargetDiff/DiffSBDD/
Pocket2Mol) to propose an idealised binder as a SEARCH BEACON, then
retrieve the nearest EXISTING drugs by chemical similarity (Tanimoto/
embedding) as repurposing candidates — the generated molecule is never
synthesised.

Caveats kept honest: generated molecules used only as beacons (often
synthetically invalid); similarity != activity, so retrieved neighbours
still must be docked + pass the binding recovery test; open question
whether it beats brute-force docking the existing library. Explore only
after the §12.3-12.4 docking baseline is validated. §12.7 exclusion
reworded to point here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:43:25 +02:00
7449dbeefb Scope Phase 2 structure-based binding track into PLAN (§12)
Add a scoped (not committed) follow-on track pivoting modality from
expression-connectivity to structure-based drug-target binding, motivated
by the empirical finding that the expression modality is signal-dead for
this task (relational-only supervised AUC = 0.49, chance).

§12 covers: the evidence for the pivot, a sickle-specific druggable target
shortlist with known-binder positive controls (Hb/voxelotor, PKR/mitapivat,
DNMT1/decitabine, LSD1, HDAC, EHMT2, PDE9), method (classical docking
baseline -> AF3-class co-folding: Boltz-2/Chai-1/DiffDock), a pre-registered
binding recovery test, integration with the expression layer as the real
prize, honest pitfalls (binding != efficacy, BCL11A untractable, GPU breaks
the all-local assumption), and open decisions before committing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:40:18 +02:00
649f617019 Phase D: supervised cross-disease (0.925 AUC degree-bias mirage)
Train GradientBoosting on 300 drugs x 839 GEO disease signatures with
Repurposing-Hub indications as labels (432 positives), disease-grouped CV.

Finding: 0.925 CV AUC looks like a win but is a MIRAGE. Feature
importances are all drug-level (drug_std 0.33, drug_mean 0.30,
broadness 0.17); drug-disease connectivity importance = 0.01. The model
learned a drug-POPULARITY prior, not disease-specific matching. On
held-out sickle it ranks hydroxyurea 231/300 (worse than baseline) and
tops out with promiscuous drugs (dexamethasone, methotrexate). Classic
degree-bias trap. Connectivity also has ~chance AUC (0.51) for predicting
approved indications.

Both obvious approaches now fail instructively: unsupervised = specificity
ceiling; naive supervised = degree bias. Real progress needs degree-
debiased training + much larger clean labels (a research effort).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:31:32 +02:00
0ce688449d Phase A: reference-library tau (negative result on specificity)
Calibrate sickle connectivity against a real disease-signature
reference population (Enrichr Disease_Signatures_from_GEO, 141 diseases)
instead of random gene-set nulls — proper CMap tau.

Finding: hydroxyurea still recovers (rank 25, top 8%), but negative-
control specificity is UNCHANGED (2/5; norethindrone + ciprofloxacin
still top). The reference-calibrated ranking is nearly identical to the
random-null ranking. Third independent fix (after gene-space expansion
and composition adjustment) that recovers hydroxyurea but does NOT fix
specificity.

Conclusion: unsupervised connectivity has a specificity ceiling — it
cannot distinguish therapeutic reversal from coincidental transcriptional
anti-correlation. Breaking it needs external signal (supervised labels
or mechanistic filtering), not more calibration. Disease-signature
library cached at data/raw/disease_sigs/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:19:26 +02:00
b62048614d Experiment: composition-adjusted signature (negative result)
Test whether fixing the cell-composition confound rescues the recovery
test. GSE35007 measured WBC/RBC/MCV per sample, so adjust DE directly
for them (per-gene OLS: expression ~ disease + WBC + RBC + MCV + age +
sex) — a measured-covariate deconvolution, compared vs unadjusted.

Finding (negative, informative): adjustment HURT hydroxyurea (rank
20 -> 35, fails top-10%) — it was recovering partly via composition
genes — and did NOT fix negative-control specificity (still 2/5). Only
gain: top hits become more mechanistically coherent (resveratrol,
aspirin enter). Conclusion: cleaning the disease signature does not
rescue connectivity; the binding constraint is matching-side (needs a
reference-signature library for proper specificity calibration), which
is a multi-disease investment. v1.1 signature NOT replaced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:05:08 +02:00
3417f85eb1 v1.1: full gene space + specificity z-score; hydroxyurea recovers
Post-hoc improvement after the pre-registered v1 recovery test failed.
Two changes, diagnosing v1's failure:
- score on the full 12,328-gene LINCS space (week2_lincs_extract.py),
  lifting signature overlap from 12% to 85% (brings erythroid markers in)
- src/scoring.py: KS connectivity + per-drug specificity z-score
  (spec_z = SDs below a 1,000 random-query null). Primary ranking is
  now spec_z. (Textbook tau saturated at +/-100 for a coherent query —
  documented; needs a reference-signature library, a v2 item.)
- week3_scoring.py: spec_z primary + WTCS reference + prior-blended
- tests: tau/spec_z calibration test; 19 passing
- scripts/exp_genespace.py: the BING vs all-12,328 comparison

Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%),
confirming the v1 failure was the landmark bottleneck not the algorithm.
Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite),
and negative controls (norethindrone, ciprofloxacin) rank top-3 —
connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory,
not a confirmatory test; reported as such in recovery_test_report.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:57:30 +02:00
72f1a49de6 Week 4: recovery test (FAIL, reported honestly) + 2-page report
Run the formal recovery test against the pre-registered criteria and
write the deliverable report (PLAN §6 Week 4):
- week4_recovery_test.py: evaluate hydroxyurea/L-glutamine + 5
  pre-specified negative controls vs the committed criteria
- recovery_test_report.md: methodology, FAIL result with diagnosis,
  top-10, lisinopril as the non-obvious candidate, limitations, v2
- known_limitations.md: L-glutamine coverage resolved, 12%-overlap
  driver, recovery outcome table

Outcome: FAIL on all 3 criteria (hydroxyurea top 13%, L-glutamine
WTCS=0, 1/5 negative controls bottom-half). Root cause is signature/
assay data limitations (lost erythroid+HbF axis, 12% landmark overlap),
not the matching algorithm — reported straight per the project ethos.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:38:56 +02:00
fd4591949c Week 3: CMap connectivity scoring engine + ranked candidates
Implement the matching engine (PLAN §6 Week 3):
- src/scoring.py: weighted-KS/GSEA enrichment, weighted connectivity
  score (WTCS, Lamb 2006 / Subramanian 2017), signed NCS normalization,
  rank_drugs, and a sickle-pathway mechanistic prior
- tests/test_scoring.py: real reference tests for the scorer (perfect
  reversal<null<mimic, same-sign->0, absent-gene invariance) + prior
- week3_scoring.py: score 300 drugs -> ranked_candidates_v1.csv with a
  raw ranking and a secondary mechanistic-prior-weighted ranking

Preliminary (formal recovery test is Week 4): hydroxyurea raw rank
40/300 (top 13%, just misses pre-registered top-10%), blended rank 7;
L-glutamine WTCS=0 (ambiguous). Notably anti-inflammatory SCD drugs
cluster in the raw top tier — the engine reverses the inflammation axis,
not the erythroid axis, traceable to the 12% landmark-overlap caveat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:34:56 +02:00
47b0094079 Week 2: 300-drug profiles with LINCS signatures + ChEMBL
Build the drug profile dataset (PLAN §6 Week 2):
- week2_curate_drugset.py: 300-drug set (2 ground-truth + 32 related-
  mechanism + 26 negative-control + 240 random), restricted to
  LINCS-scorable compounds, seed=42
- week2_chembl.py: InChIKey->ChEMBL match (145/300), MoA + targets
- week2_lincs_extract.py: cmapPy-slice both Level-5 GCTX phases to 978
  landmark genes, mean-aggregate per drug to one consensus signature
- week2_assemble.py: join into drug_profiles_v1.parquet, Tier B (LINCS
  single-source), scored flag per PLAN §6 Week 3 task 2
- docs/data_sources.md: drug set composition + LINCS/ChEMBL provenance

Results (all gitignored data): 300/300 drugs scored, both ground-truth
drugs present (hydroxyurea Phase II = CHEMBL467, L-glutamine Phase I).
Key caveat recorded: only 56/477 (12%) of the disease signature genes
are LINCS landmarks, so Week-3 scoring uses a 30-up/26-down query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:25:00 +02:00
c7b6649d31 Week 1: Tier-A sickle cell signature via 2-study concordance
Implement and run the Week 1 disease-signature pipeline:
- src/disease.py: Welch t-test + BH DE (microarray), probe->symbol
  collapse, cross-study concordance filter, 2-study provenance schema
- scripts/week1_explore.py: download GSE35007 + GSE16728, DE + concordance
- scripts/week1_finalize.py: mygene ID mapping + persist signature
- tests/test_disease.py: synthetic-data tests for DE/collapse/concordance
- docs/data_sources.md: chosen datasets, group defs, reproduction steps

Result: sickle_cell_signature_v1.json (gitignored), Tier A, 250 up /
227 down genes from 671 concordant (GSE35007 Illumina whole blood SS/AA +
GSE16728 Affymetrix whole blood patient/control). Documented caveats:
missing HbF axis (globin depletion) and reticulocyte composition confound.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 20:43:54 +02:00
25 changed files with 2487 additions and 124 deletions

140
PLAN.md
View File

@@ -426,3 +426,143 @@ This MVP exists in a broader strategic context that was built through ~7 expert
- **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product.
The MVP's job is to produce one credible result. Everything else follows from that.
---
## 12. Phase 2 track — Structure-based binding (scoped 2026-06-23)
> **Status: scoped, not committed.** This is a follow-on track proposed *after* the MVP and its
> follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what
> those experiments empirically showed. Read §911 and the experiment commits first.
### 12.1 Why pivot modality (the evidence, not a hunch)
The expression-connectivity approach was built, validated, and pushed hard (gene-space
expansion, cell-composition deconvolution, reference-library tau, supervised learning). The
empirical verdict:
- Connectivity **recovers hydroxyurea** (top ~68%) but **cannot achieve specificity**
unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four
independent methods.
- A supervised model on indication labels hit **0.925 CV AUC** but it was a **degree-bias
mirage**: it learned drug popularity, not disease matching (it ranked hydroxyurea *231/300*).
- The decisive test: with drug-popularity features removed, the model trained on the actual
drugdisease connectivity scored **AUC 0.491 — pure chance**. **The expression-connectivity
modality contains essentially no disease-specific therapeutic signal for this task.**
This is a *signal* problem, not a *model* problem no amount of model sophistication (diffusion,
GNNs, etc.) extracts signal that isn't in the data. The response is to **change modality** to one
with a strong, physical, drug-specific signal: **does a molecule bind a sickle-relevant target?**
A drug that binds HbS is mechanistically specific by construction the opposite of a coincidental
expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is
itself a diffusion model) actually has traction, because structure has physical ground truth.
### 12.2 Targets (sickle-specific, druggable, structurally characterised)
Small molecules only 2). Curated shortlist with public structures and, crucially, **known
small-molecule binders to serve as positive controls**:
| Target | Mechanism in sickle | Known binder (positive control) |
|---|---|---|
| Hemoglobin (HBB/HBA tetramer, HbS) | Anti-polymerisation; R-state stabiliser | **voxelotor** (binds α-Val1) |
| PKR (PKLR, red-cell pyruvate kinase) | Activator 2,3-BPG O2 affinity | **mitapivat**, etavopivat |
| DNMT1 | HbF induction (de-repress γ-globin) | **decitabine**, azacitidine |
| LSD1 / KDM1A | HbF induction | tranylcypromine analogues |
| HDAC1/2 | HbF induction | vorinostat, panobinostat |
| EHMT2 (G9a) | HbF induction | UNC0642 (tool) |
| PDE9 | cGMP, anti-adhesion | PF-04447943 (sickle trial) |
Hard/excluded for v1: **BCL11A** (transcription factor, no classic pocket the γ-globin master
repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism.
### 12.3 Method (baseline → generative co-folding)
1. **Prepare structures.** Pull target structures from the PDB; AF3/Boltz-predict any missing.
2. **Prepare ligands.** Reuse the existing ~300-drug set (we already have canonical SMILES from
ChEMBL); expandable to the full ChEMBL/LINCS catalogue.
3. **Dock + score**, in increasing sophistication:
- **Baseline:** classical docking (AutoDock Vina / smina) fast, CPU, well-understood.
- **Generative co-folding:** an open AlphaFold3-class model **Boltz-2** (predicts the
proteinligand complex *and* a binding-affinity estimate, MIT-licensed), **Chai-1**, or
**DiffDock** (a diffusion model for docking the legitimate home for the "diffusion"
instinct). These predict the bound pose directly and tend to beat classical docking.
- Report both; the baseline keeps us honest about whether the ML model adds anything.
### 12.4 Validation (a real recovery test, like §6 Week 4)
Pre-register before scoring: **the known structure-based sickle drugs must rank as top binders to
their targets** voxelotorhemoglobin, mitapivatPKR, decitabineDNMT1. Negative controls
(unrelated drugs) must *not* bind these pockets. This is a cleaner recovery test than the
expression one, because binding is mechanistically specific it should not have the
coincidental-reverser problem that sank the connectivity approach.
### 12.5 The real prize — integrate, don't replace
The long-term value is **both modalities together**: a candidate that *reverses the disease
signature* (expression) **and** *binds a sickle-relevant target* (structure) is far more credible
than either alone. Structure supplies the specificity the expression layer lacks; expression
supplies the systems-level, target-agnostic view structure lacks. The platform thesis 11)
two databases + a matching engine extends naturally to a third (structures) feeding the same
confidence-tiered data layer.
### 12.6 Honest pitfalls (do not ignore)
1. **Binding ≠ efficacy.** A molecule can bind and do nothing therapeutic. Structure-based hits
are still hypotheses (cf. §9.7).
2. **Only covers the enzyme/pocket subset.** Sickle's biggest lever (γ-globin reactivation via
BCL11A) is largely transcriptional and not small-molecule-tractable structure-based screening
is blind to it. Be explicit about coverage.
3. **Docking/affinity accuracy is limited.** Pose prediction is decent; absolute affinity is hard.
Validate on known binders before trusting novel scores.
4. **Compute.** AF3-class models are GPU-heavy; the local Mac Studio 2) may not suffice this
track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule.
5. **Moat.** Structures and tools are public; the proprietary value is the curated target list,
the integration with the expression layer, and provenance/tiering not the docker.
### 12.7 Explicitly NOT in this track
Free energy perturbation / MD-based affinity; covalent docking; **de novo generation of molecules
as final candidates to synthesise** (design, not repurposing but see §12.9 for the
generate-then-retrieve hybrid, which *is* repurposing); BCL11A or any non-pocket target;
biologics; combination binding.
### 12.8 Open decisions before committing
- **Tooling:** classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend:
baseline first, for an honest reference the lesson of the whole expression arc.)
- **Compute:** secure a GPU environment (the "all local" §2 assumption breaks here).
- **Scope of v1:** the 7-target shortlist above, or start with just Hb + PKR (the two with the
cleanest positive controls) to de-risk the harness before scaling targets.
### 12.9 Door left open — generative-guided retrieval (generate → match existing)
A legitimate way to bring generative models *into the repurposing frame* (vs the design frame
excluded in §12.7): don't generate molecules to synthesise generate them as a **search beacon**.
**The idea.** Use a pocket-conditioned generative model (target-conditioned diffusion e.g.
TargetDiff, DiffSBDD, Pocket2Mol) to propose idealised binders for a sickle target. Then retrieve
the **nearest existing drugs** to those proposals by chemical similarity (Tanimoto over Morgan
fingerprints, or a learned molecular embedding). The retrieved neighbours real, already-approved
or clinical molecules are the repurposing candidates. The generated molecule is never made; it
only *defines a region of chemical space* worth searching. This is the user-proposed framing and
it is sound: "generate the ideal, then find what we already have nearby."
**Why it could add value.** It can point at scaffolds / regions a known-binder-seeded or
brute-force docking sweep would miss, and it makes the target's binding requirements explicit as
geometry rather than as a single reference ligand.
**Honest caveats (why it's a *door*, not a commitment).**
1. **Generated molecules are often synthetically unrealistic / invalid** which is exactly why
they must be used only as beacons, never as candidates.
2. **Similarity ≠ activity.** Activity cliffs mean a near-neighbour of a good binder can be inert.
So retrieved neighbours do **not** bypass validation they must still be docked/scored 12.3)
and clear the binding recovery test 12.4). The generative step *proposes*; it does not *prove*.
3. **Marginal-value question.** Directly docking the whole existing library 12.3) already covers
chemical space. Whether generate-then-retrieve beats that by efficiency or by surfacing
non-obvious scaffolds is an open empirical question and needs a head-to-head before it earns
real investment.
4. **Only as good as the pocket conditioning** of the generator, and the chemistry of the target.
**Status:** explore only *after* the §12.312.4 docking harness works and is validated on the
known binders same discipline as everywhere else: prove the baseline, then test whether the
fancier method adds anything.

View File

View File

@@ -10,17 +10,63 @@
| MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD |
| Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD |
| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
| GEO | https://www.ncbi.nlm.nih.gov/geo/ | GEOparse, FTP | Public domain | Expression data | TBD | TBD |
| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets | TBD | TBD |
| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures | TBD | TBD |
| GEO (GSE35007) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 | GEOparse, FTP | Public domain | Disease signature (study 1) | GPL10558 | 2026-06-23 |
| GEO (GSE16728) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 | GEOparse, FTP | Public domain | Disease signature (study 2) | GPL570 | 2026-06-23 |
| ChEMBL | https://www.ebi.ac.uk/chembl | chembl_webresource_client | CC BY-SA 3.0 | Drug structures, MoA, targets | API (live) | 2026-06-23 |
| LINCS L1000 Phase I | GSE92742 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. L-glutamine) | GSE92742 | 2026-06-23 |
| LINCS L1000 Phase II | GSE70138 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. hydroxyurea) | GSE70138 | 2026-06-23 |
| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |
## Chosen GEO dataset
## Chosen GEO datasets (disease signature, Tier A via 2-study concordance)
_Document the chosen study fully: accession, platform, n per group, publication, why it was
selected over the alternatives (GSE53441, GSE35007, …)._
The signature is the cross-study concordance of two independent whole-blood studies (genes
significant at q<0.05 in **both** with the same direction). Whole-blood tissue was required so
concordance is meaningful; the two differ by platform and population, which strengthens
robustness.
| Study | Platform | Tissue | Disease group | Healthy group | n disease / healthy |
|---|---|---|---|---|---|
| **GSE35007** | Illumina HumanHT-12 V4 (GPL10558) | whole blood | hb phenotype = SS | hb phenotype = AA | 190 / 12 |
| **GSE16728** | Affymetrix HG-U133 Plus 2.0 (GPL570) | whole blood (PAXgene) | sickle-cell patient | control | 10 / 10 |
- DE method: per-gene Welch t-test + BenjaminiHochberg (microarray, pure Python).
- Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
- Result: 16,208 genes tested in both **671 concordant** (444 up / 227 down). Signature =
top 250 up + all 227 down by worst-case q-value.
- **Rejected candidates:** GSE53441 (PBMC tissue mismatch with the whole-blood anchor);
GSE84633/GSE84634 (PBMC, no healthy controls).
- **Tier caveat:** GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict
n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.
Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
`scripts/week1_finalize.py` (mygene mapping + persist).
## Drug profiles (Week 2)
300-drug set (`drug_set_v1.csv`), composed and restricted to LINCS-scorable compounds:
| Inclusion reason | n | Notes |
|---|---|---|
| ground_truth | 2 | hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I) |
| related_mechanism | 32 | HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories |
| negative_control | 26 | antifungals, antihistamines, antibiotics, hormones |
| general_sample | 240 | random from LINCS catalog, seed=42 |
- **LINCS signatures:** per-drug consensus = mean of Level-5 MODZ z-scores across the drug's
sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH
phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
- **ChEMBL:** matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds
38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
- **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
not-single-source rule).
- **Gene space (v1.1):** scoring uses the full **12,328-gene** LINCS space, not just the 978
landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger
space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from
LINCS entirely and remain unscoreable.
- Reproduce: `week2_curate_drugset.py``week2_chembl.py` → download Level-5 GCTX →
`week2_lincs_extract.py``week2_assemble.py`.
## Licensing note for LINCS

View File

@@ -12,9 +12,11 @@ Source: PLAN.md §9.
cell lines (MCF7, A375, PC3, …). Signatures for non-oncology diseases may be noisy. A
field-wide limitation, not unique to Reverso.
3. **L-glutamine probably has no LINCS signature.** Amino acids and metabolites weren't LINCS
priorities. If true, the ground-truth test effectively rests on hydroxyurea alone, which is
weaker. _Status: TBD — record the actual finding here once LINCS is pulled (Week 2)._
3. **L-glutamine LINCS coverage — RESOLVED, opposite of expected.** L-glutamine DOES have a
Phase I signature (hydroxyurea is Phase-II-only) — both ground-truth drugs are scorable. But
L-glutamine's connectivity is **ambiguous (WTCS=0)**: its up- and down-set enrichments share
a sign, so it shows no reversal. It ranks 100/300. So the ground-truth test effectively rests
on hydroxyurea, which itself only reaches top 13% (raw) — see the recovery test report.
4. **Connectivity scoring surfaces broad-effect drugs as false positives.** HDAC inhibitors and
broad kinase inhibitors often top connectivity rankings simply because they perturb many
@@ -32,8 +34,28 @@ Source: PLAN.md §9.
7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
to test, not discoveries. Use careful language in any write-up.
## Drug-specific gaps (fill in during Week 23)
8. **Gene-space bottleneck (v1 → fixed in v1.1).** v1 scored on only the 978 landmark genes (12%
signature overlap) — the main driver of the v1 failure. v1.1 uses the full 12,328-gene space
(85% overlap) and recovers hydroxyurea. HBG1/HBG2 remain absent from LINCS entirely.
| Drug | Issue | Handling |
9. **No reference-signature library for tau.** Textbook CMap tau saturated at ±100 (a coherent
query always out-connects random gene sets). v1.1 substitutes a per-drug specificity z-score.
Proper tau needs a library of real reference signatures — a v2 / curated-data item.
10. **Negative-control criterion may be invalid for connectivity scoring.** Unrelated drugs
(norethindrone, ciprofloxacin) rank as top specific reversers — connectivity measures
expression reversal, not therapeutic relatedness.
## Recovery test outcome
Pre-registered test (**v1, confirmatory**): **FAILED** all three criteria (hydroxyurea rank
40/top 13%; L-glutamine rank 100; 1/5 negative controls bottom-half). Post-hoc (**v1.1,
exploratory**): hydroxyurea recovers to rank 18 (top 6%, passes), but L-glutamine (rank 213, does
not reverse) and negative controls (2/5) still fail → overall still FAIL. See
`recovery_test_report.md`.
| Drug | Issue | v1.1 status |
|---|---|---|
| TBD | e.g. no LINCS signature | flagged "not scored, no signature available" |
| hydroxyurea | needed the full gene space | rank 18 (top 6%) — recovered post-hoc |
| L-glutamine | metabolite, no reversal signal (positive connectivity) | rank 213 — genuine negative |
| neg controls | reverse the generic inflammation signature | 2/5 bottom-half — criterion questionable |

View File

@@ -1,13 +1,10 @@
# Sickle Cell Repurposing — Recovery Test Report
> **Status: DRAFT SCAFFOLD — not yet run.** Filled in during Week 4 from
> `notebooks/05_recovery_test.ipynb`. Target length: ~2 pages, readable by a sceptical
> pharma scientist in 5 minutes.
> **Status: COMPLETE (v1 confirmatory + v1.1 exploratory).** Reproduce with `scripts/week1_*` →
> `week2_*` → `week3_scoring.py` → `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.
## Pre-registered success criteria
> ⚠️ **Commit this section to git _before_ running the recovery test** (PLAN.md §8, §10).
The MVP passes if:
- Hydroxyurea ranks in the **top 10%** (top 30 of 300), **AND**
@@ -15,54 +12,116 @@ The MVP passes if:
missing LINCS signature, **AND**
- At least **4 of 5** negative-control drugs rank in the **bottom half**.
_Pre-registered on: TBD (date of commit)_
_Pre-registered in the scaffold commit (`b731478`) before any scoring. **Primary (confirmatory)
analysis = v1**: 978 landmark genes, weighted connectivity score (WTCS). The 5 negative controls
were pre-specified by category rule without inspecting ranks._
---
## Section 1 — Methodology
_56 sentences: what was built, the GEO dataset used, the drug-set composition, and the
scoring method (CMap connectivity, Lamb 2006 / Subramanian 2017)._
A sickle cell disease signature was built from **two whole-blood microarray studies** (GSE35007
Illumina SS-vs-AA; GSE16728 Affymetrix patient-vs-control), keeping the **671 genes concordant**
across both (q<0.05, same direction) a cross-platform Tier-A signature (250 up / 227 down).
Profiles were built for **300 small molecules** (2 ground-truth; 32 related-mechanism; 26
negative controls; 240 random), each with a **LINCS L1000** consensus signature (mean Level-5
MODZ across cell lines, both CMap phases). Drugs were ranked by **CMap connectivity scoring**
(Kolmogorov-Smirnov, Lamb 2006 / Subramanian 2017): negative = reversal = candidate.
**v1 (pre-registered/confirmatory):** scored on the 978 landmark genes with WTCS.
**v1.1 (post-hoc/exploratory):** after v1 failed, two changes were made to diagnose why (a)
score on the full **12,328-gene** space (landmark overlap 12% 85%, bringing the erythroid
markers in); (b) add a **per-drug specificity z-score** (`spec_z`): how many SDs the real
connectivity is below a null of 1,000 random queries of the same size against that drug. Because
these changes followed inspection of the v1 result, **v1.1 is exploratory, not a confirmatory
test of the pre-registered hypothesis.**
## Section 2 — Recovery test result
| Drug | Rank | Percentile | Pass? |
|---|---|---|---|
| Hydroxyurea | TBD | TBD | TBD |
| L-glutamine | TBD | TBD | TBD |
Negative controls (expected: bottom half):
| Control drug | Rank | Bottom half? |
| Criterion | v1 (confirmatory) | v1.1 (exploratory) |
|---|---|---|
| TBD | TBD | TBD |
| Hydroxyurea top-10% (≤30) | rank **40** (13.3%) | rank **18** (6.0%) |
| L-glutamine top-25% (≤75) | rank 100, WTCS=0 | rank 213, spec_z=+0.98 |
| 4/5 neg controls bottom-half | 1/5 | 2/5 |
| **Overall** | **FAIL** | **FAIL** (but hydroxyurea recovered) |
**Overall: PASS / FAIL against pre-registered criteria — TBD**
v1.1 negative controls: clotrimazole 258 ✅, astemizole 211 ✅, azithromycin 142 ❌,
ethinyl-estradiol 114 ❌, caffeine 77 ❌.
## Section 3 — Top 10 candidates
**Honest reading.** The **pre-registered test FAILED (v1).** The post-hoc v1.1 changes
**recover hydroxyurea** (rank 40 18, passing top-10%) strong evidence that the v1 failure was
driven by the 978-landmark bottleneck, not the algorithm. But two failures survive into v1.1, and
both are now precisely diagnosed:
| Rank | Drug | Score | Known mechanism | Biological plausibility |
1. **L-glutamine does not reverse the signature** (positive connectivity, spec_z=+0.98). This is
intrinsic to its LINCS data a metabolite with no reversal signal not a coverage gap. More
genes cannot fix it.
2. **The negative-control criterion is arguably invalid for connectivity scoring.** Two
"negative controls" (norethindrone, ciprofloxacin) rank in the top 3 by spec_z. Connectivity
measures *expression reversal*, not *therapeutic relatedness* an antibiotic or contraceptive
can still down-regulate the inflammation genes that dominate the scorable signature. The test
design conflates the two.
A note on the calibration: textbook CMap **tau** (percentile vs a reference population) was
implemented but **saturated at ±100** here, because a coherent real query always out-connects
random gene sets proper tau needs a library of *real* reference signatures, which this MVP
lacks. The continuous `spec_z` is the workable substitute.
## Section 3 — Top 10 candidates (v1.1 spec_z)
| Rank | Drug | spec_z | Inclusion | Read |
|---|---|---|---|---|
| 1 | TBD | TBD | TBD | TBD |
| 1 | reserpic-acid | 3.80 | random | reserpine metabolite; non-obvious |
| 2 | norethindrone | 3.78 | **negative control** | false positive (see §2) |
| 3 | ciprofloxacin | 3.61 | **negative control** | false positive |
| 4 | resveratrol | 3.46 | related-mechanism | antioxidant studied in SCD coherent |
| 5 | BRD-K57490754 | 3.37 | random | tool compound |
| 6 | anastrozole | 3.27 | random | aromatase inhibitor |
| 710 | BRD-* / palmitoylethanolamide | ~3.1 | random | mostly tool compounds |
_Note: HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due
to widespread expression effects — flag these honestly (PLAN.md §9.4)._
That two negative controls outrank hydroxyurea is the single most informative result here see §4.
## Section 4 — One non-obvious candidate worth investigating
## Section 4 — One non-obvious result worth investigating
_A single paragraph on the most interesting result. Language must be careful: this is a
computational hypothesis to test, not a discovery (PLAN.md §9.7)._
The most useful finding is **not** a candidate drug but the **negative-control failure**:
unrelated drugs (norethindrone, ciprofloxacin) score as strong specific reversers. This is a
real, generalizable lesson for a signature whose *scorable* portion is generic
inflammation/metabolic genes, connectivity rewards any broad transcriptional perturbation that
touches those genes. The honest implication: **this signature is not specific enough to
discriminate true repurposing candidates from incidental expression reversers.** Of the
plausibly-real hits, **resveratrol (rank 4)** an antioxidant with prior sickle cell literature
is the most defensible, but it is a hypothesis, not a discovery.
## Section 5 — Honest limitations
- Cell-composition confound in whole-blood expression (PLAN.md §9.1)
- LINCS L1000 cell-line limitations — landmark genes measured mostly in cancer lines (§9.2)
- Missing signatures (e.g. L-glutamine) (§9.3)
- No mechanistic validation layer — discovery hypothesis generation, not validated prediction (§9.6)
1. **Pre-registered test failed; the pass is post-hoc.** v1.1's hydroxyurea recovery is
exploratory and must be re-validated on a held-out disease before any claim is made.
2. **Missing HbF axis** HBG1/HBG2 are absent from LINCS entirely (not just landmarks), so the
pathway hydroxyurea acts on can never be scored by this method.
3. **Signature specificity** scorable genes are inflammation/metabolic; negative controls
reverse them too. Connectivity therapeutic relatedness.
4. **Cell-composition confound** the whole-blood signature is reticulocyte-dominated.
5. **LINCS cancer-cell-line bias**, and **no reference-signature library** for proper tau.
6. **No mechanistic validation** all hits are computational hypotheses.
## Section 6 — What v2 would fix
- Cell-type deconvolution of the disease signature
- Knowledge graph fallback for missing-signature drugs
- A second disease to test generalization (the real test — sickle cell results do not prove
the platform generalizes, §9.5)
- **A reference-signature library** to make tau (proper specificity calibration) work the
single biggest fix to the negative-control problem, and a direct use of the curated-data moat.
- **Cell-type deconvolution** + a non-globin-depleted RNA-seq study to recover a more specific,
HbF-containing signature.
- **Signature prediction / mechanism graph** to score genes with no LINCS measurement.
- **A second disease** to test generalization and to honestly re-validate the v1.1 method
(PLAN §9.5).
---
### Bottom line
The pre-registered recovery test **failed**. Post-hoc diagnosis shows the dominant cause was a
fixable gene-space bottleneck correcting it **recovers hydroxyurea** but also surfaces a
deeper, genuine limitation: this whole-blood signature is **not specific enough** for
connectivity scoring to separate real candidates from incidental reversers (negative controls
rank at the top). The MVP's real deliverable is a precise, honest map of *what it takes to make
this method work*: a more specific (deconvolved, HbF-containing) signature and a reference library
for calibration exactly the curated-data investments the platform thesis is built on.

View File

@@ -0,0 +1,34 @@
# Structure-based binding track — working notes
Branch `structure-based-binding`. Implements PLAN §12. Baseline-first, start with the two cleanest
targets (Hemoglobin + PKR), de-risk the harness before scaling.
## Status (2026-06-23)
**Toolchain check (PLAN §12.6 pitfall 4, confirmed real):**
- ✅ RDKit installs on ARM Mac — ligand side ready.
- ❌ AutoDock Vina does NOT pip-install on ARM Mac; no docking binary available. Docking (§12.3)
is **blocked on toolchain** — must resolve via conda/micromamba (`vina`/`smina`), a GPU AF3-class
model (Boltz-2/Chai-1/DiffDock), or an x86 Vina binary under Rosetta.
**Structures obtained:** `5E83` (hemoglobin + voxelotor), `8XFD` (PKR + mitapivat) in
`data/raw/structures/`.
**Step 0 — ligand-based retrieval baseline (`scripts/binding_ligand_baseline.py`):**
RDKit Tanimoto of our 300 drugs vs known sickle binders.
- Engine VALIDATED on in-set classes: `decitabine`→azacitidine (0.62); `vorinostat`→scriptaid
(0.42), belinostat (0.28). Correctly clusters DNMT1 / HDAC HbF-inducers.
- But voxelotor / mitapivat have **no analog** in our set (max Tanimoto ~0.200.26). A 300-drug
library is too sparse to contain look-alikes of distinctive scaffolds.
**Takeaways:**
1. Ligand retrieval works but needs a **bigger drug library** to be useful for distinctive targets.
2. The targets without in-set analogs (Hb, PKR) need **actual docking** (§12.3) — which scores
binding directly, no look-alike required. That is the gating next step, and it needs the
toolchain solved.
## Next steps
- [ ] Resolve the docking toolchain (recommend: micromamba + smina/vina, CPU, no GPU needed for baseline).
- [ ] Dock the known binders (voxelotor→5E83, mitapivat→8XFD) as positive controls (§12.4 recovery test).
- [ ] Expand the ligand library (full ChEMBL/LINCS) for retrieval to have reach.
- [ ] Only then: AF3-class co-folding (Boltz-2/DiffDock) vs the docking baseline; and §12.9 generative beacon.

View File

@@ -33,6 +33,12 @@ dev = [
"pytest>=8.0",
"ruff>=0.5",
]
# Structure-based binding track (PLAN §12). Docking tool (vina/smina) is NOT pip-installable on
# ARM Mac — install via conda/micromamba or use a GPU AF3-class model; see docs/structure_binding_notes.md.
structure = [
"rdkit>=2024.3",
"requests>=2.31",
]
[tool.setuptools.packages.find]
where = ["."]

View File

@@ -0,0 +1,66 @@
"""Structure-based track, step 0: ligand-based retrieval baseline (PLAN §12.9 engine).
Docking (§12.3) needs a toolchain that doesn't pip-install on ARM Mac (AutoDock Vina) — that's the
next dependency to solve. Meanwhile this runs now with pure RDKit: do any of our 300 drugs sit near
the KNOWN sickle binders (voxelotor, mitapivat, decitabine) in chemical space? This is the
retrieval engine §12.9 would point a generative beacon at, and a sanity check on the ligand data.
NOT docking and NOT a binding claim — chemical similarity only. Similarity != activity (§12.9).
"""
from __future__ import annotations
import pandas as pd
import requests
from rdkit import Chem, DataStructs, RDLogger
from rdkit.Chem import rdFingerprintGenerator
RDLogger.DisableLog("rdApp.*")
MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
# Known sickle binders = positive-control beacons (target in parens).
BINDERS = ["voxelotor", "mitapivat", "decitabine", "vorinostat"]
def pubchem_smiles(name: str) -> str | None:
for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
try:
u = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON"
d = requests.get(u, timeout=30).json()["PropertyTable"]["Properties"][0]
if prop in d:
return d[prop]
except Exception:
continue
return None
def fp(smi: str):
if not isinstance(smi, str) or smi in ("-666", ""):
return None
m = Chem.MolFromSmiles(smi)
return MORGAN.GetFingerprint(m) if m else None
def main() -> None:
binder_smi = {b: pubchem_smiles(b) for b in BINDERS}
print("known-binder SMILES:", {k: (v[:34] + "..." if v else "MISSING") for k, v in binder_smi.items()})
drugs = pd.read_csv("data/processed/drug_set_v1.csv")[["pert_iname", "canonical_smiles", "inclusion_reason"]]
reason = dict(zip(drugs.pert_iname, drugs.inclusion_reason))
drug_fp = {r.pert_iname: fp(r.canonical_smiles) for r in drugs.itertuples()}
drug_fp = {k: v for k, v in drug_fp.items() if v is not None}
print(f"fingerprinted {len(drug_fp)}/{len(drugs)} drugs\n")
for b, smi in binder_smi.items():
bfp = fp(smi)
if bfp is None:
print(f"{b}: no SMILES\n"); continue
sims = sorted(((DataStructs.TanimotoSimilarity(bfp, v), k) for k, v in drug_fp.items()), reverse=True)
print(f"nearest drugs to {b}:")
for s, k in sims[:5]:
print(f" {s:.3f} {k:22s} [{reason.get(k,'?')}]")
print()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,137 @@
"""Experiment: composition-adjusted sickle signature, to fix specificity (option 1).
The v1 signature is confounded by cell composition (SS patients have very different WBC/RBC
than AA controls). GSE35007 *measured* those counts per sample, so we adjust the differential
expression for them directly (a measured-covariate alternative to estimated deconvolution):
expression ~ disease + WBC + RBC + MCV + age + sex (per gene, vectorized OLS)
We compare the composition-ADJUSTED signature against an UNADJUSTED single-study signature
(same samples, model without the covariates), score both with the v1.1 engine (full gene space
+ spec_z), and report the recovery test for each. Writes nothing to committed artifacts.
"""
from __future__ import annotations
import warnings
from pathlib import Path
import GEOparse
import numpy as np
import pandas as pd
from scipy.stats import false_discovery_control
from scipy.stats import t as tdist
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.disease import collapse_probes_to_symbols # noqa: E402
from src.scoring import tau_calibrate # noqa: E402
warnings.filterwarnings("ignore")
PROCESSED = Path("data/processed")
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
SYMBOL_COLS = ["Symbol", "ILMN_Gene", "Gene Symbol", "GeneSymbol"]
def cval(gsm, key):
for c in gsm.metadata.get("characteristics_ch1", []):
if c.lower().startswith(key.lower()):
return c.split(":", 1)[1].strip()
return None
def load_data():
gse = GEOparse.get_GEO(geo="GSE35007", destdir="data/raw/geo", silent=True)
meta = []
for gid, g in gse.gsms.items():
meta.append({"gsm": gid, "hb": cval(g, "hb phenotype"), "wbc": cval(g, "white blood cells"),
"rbc": cval(g, "red blood cells"), "mcv": cval(g, "mean corpuscular volume"),
"age": cval(g, "age"), "sex": cval(g, "Sex")})
meta = pd.DataFrame(meta).set_index("gsm")
for c in ["wbc", "rbc", "mcv", "age"]:
meta[c] = pd.to_numeric(meta[c], errors="coerce")
meta["disease"] = meta["hb"].map({"SS": 1.0, "AA": 0.0})
meta["sex_m"] = (meta["sex"] == "M").astype(float)
keep = meta[meta["hb"].isin(["SS", "AA"])].dropna(subset=["wbc", "rbc", "mcv", "age", "disease"])
expr = pd.DataFrame({gid: gse.gsms[gid].table.set_index("ID_REF")["VALUE"] for gid in keep.index})
expr = expr.apply(pd.to_numeric, errors="coerce").dropna(how="any")
if float(np.nanmax(expr.to_numpy())) > 50:
expr = np.log2(expr.clip(lower=0) + 1.0)
gpl = list(gse.gpls.values())[0]
col = next((c for c in SYMBOL_COLS if c in gpl.table.columns), None)
sym = gpl.table.set_index("ID")[col].astype(str).str.strip().replace({"": np.nan, "nan": np.nan})
return expr, keep, sym.dropna()
def ols_de(expr, design, disease_idx):
"""Vectorized per-gene OLS; return DE table (log_fc=disease coef, pvalue, qvalue)."""
X = design.to_numpy(dtype=float)
Y = expr.T.to_numpy(dtype=float) # samples x genes
n, p = X.shape
XtX_inv = np.linalg.pinv(X.T @ X)
B = XtX_inv @ X.T @ Y
resid = Y - X @ B
dof = n - p
sigma2 = (resid ** 2).sum(0) / dof
se = np.sqrt(sigma2 * XtX_inv[disease_idx, disease_idx])
t = B[disease_idx] / se
pval = 2 * tdist.sf(np.abs(t), dof)
out = pd.DataFrame({"log_fc": B[disease_idx], "pvalue": pval}, index=expr.index).dropna()
out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
return out
def make_signature(de, sym, expr, top_n=250):
de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
sig = de_sym[de_sym["qvalue"] < 0.05]
up = sig[sig["log_fc"] > 0].nsmallest(top_n, "qvalue").index.tolist()
down = sig[sig["log_fc"] < 0].nsmallest(top_n, "qvalue").index.tolist()
return up, down
def evaluate(label, up, down, lincs):
ranked = tau_calibrate(up, down, lincs, n_null=1000)
n = len(ranked)
top10, top25, half = int(n * .10), int(n * .25), n // 2
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
ranked = ranked.join(profiles[["inclusion_reason"]])
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
n_bottom = sum(r > half for r in negs.values())
n_ov = len((set(up) | set(down)) & set(lincs.columns))
print(f"\n### {label}: {len(up)} up / {len(down)} down (query overlap {n_ov})")
print(f" hydroxyurea rank {hu}/{n} (top {100*hu/n:.1f}%) top-10%? {hu <= top10}")
print(f" L-glutamine rank {glut}/{n} (top {100*glut/n:.1f}%) top-25%? {glut <= top25}")
print(f" neg controls bottom-half: {n_bottom}/5 {negs}")
print(" top 8: " + ", ".join(
f"{name}[{r['inclusion_reason'][:3]}]" for name, r in ranked.nsmallest(8, "spec_z").iterrows()))
return ranked
def main():
expr, meta, sym = load_data()
print(f"loaded {expr.shape[1]} samples x {expr.shape[0]} probes; "
f"{int(meta.disease.sum())} SS / {int((meta.disease==0).sum())} AA")
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
base = pd.DataFrame({"intercept": 1.0, "disease": meta["disease"]}, index=meta.index)
adj = base.assign(wbc=meta["wbc"], rbc=meta["rbc"], mcv=meta["mcv"], age=meta["age"], sex_m=meta["sex_m"])
de_unadj = ols_de(expr, base, disease_idx=1)
de_adj = ols_de(expr, adj, disease_idx=1)
up_u, dn_u = make_signature(de_unadj, sym, expr)
up_a, dn_a = make_signature(de_adj, sym, expr)
# how much does adjustment change the gene set?
overlap = len((set(up_u) | set(dn_u)) & (set(up_a) | set(dn_a)))
print(f"\nsignature gene overlap unadjusted vs adjusted: {overlap}/{len(set(up_u)|set(dn_u))}")
evaluate("UNADJUSTED (GSE35007 only)", up_u, dn_u, lincs)
evaluate("COMPOSITION-ADJUSTED", up_a, dn_a, lincs)
if __name__ == "__main__":
main()

102
scripts/exp_genespace.py Normal file
View File

@@ -0,0 +1,102 @@
"""Experiment (v1.1): re-score on a larger LINCS gene space and re-run the recovery test.
v1 used only the 978 landmark genes (12% signature overlap). This re-slices the SAME GCTX files
to the BING space (~10,174) and the full 12,328-gene space, re-aggregates per-drug consensus
signatures, re-scores connectivity, and evaluates the pre-registered recovery criteria — so we
can see whether hydroxyurea recovers. Writes nothing to the committed v1 artifacts.
"""
from __future__ import annotations
import gzip
import io
import json
from pathlib import Path
import pandas as pd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import rank_drugs # noqa: E402
LINCS = Path("data/raw/lincs")
PROCESSED = Path("data/processed")
GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
def read_gz(name):
return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
def gene_ids_for_space(space: str):
g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
if space == "bing":
g = g[g.pr_is_bing == 1]
# 'all' -> keep everything
ids = [str(x) for x in g.pr_gene_id]
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
return ids, id_to_symbol
def extract(space, drug_names, gene_ids, id_to_symbol):
from cmapPy.pandasGEXpress.parse import parse
frames = []
for ph in (1, 2):
sig = read_gz(SIG_INFO[ph])
sig = sig[(sig.pert_type == "trt_cp") & (sig.pert_iname.isin(drug_names))]
if sig.empty:
continue
gct = parse(str(GCTX[ph]), rid=gene_ids, cid=sig.sig_id.tolist())
data = gct.data_df
s2d = dict(zip(sig.sig_id, sig.pert_iname))
frames.append(data.T.groupby(data.columns.map(s2d)).mean())
print(f" [{space}] phase {ph}: {sig.pert_iname.nunique()} drugs sliced", flush=True)
combined = pd.concat(frames).groupby(level=0).mean()
combined.columns = [id_to_symbol.get(c, c) for c in combined.columns]
combined = combined.loc[:, ~combined.columns.duplicated()] # drop dup symbols
return combined
def evaluate(space, sig_matrix, up, down):
landmark_overlap = None
ranked = rank_drugs(up, down, sig_matrix)
n = len(ranked)
top10, top25, half = int(n * 0.10), int(n * 0.25), n // 2
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
ranked = ranked.join(profiles[["inclusion_reason"]])
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
glut_s = ranked.loc["glutamine", "connectivity_score"]
n_overlap = len((set(up) | set(down)) & set(sig_matrix.columns))
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
n_bottom = sum(r > half for r in negs.values())
print(f"\n=== gene space: {space.upper()} ({sig_matrix.shape[1]} genes; query overlap {n_overlap}) ===")
print(f" hydroxyurea: rank {hu}/{n} (top {100*hu/n:.1f}%) top-10%? {hu <= top10}")
print(f" L-glutamine: rank {glut}/{n} (top {100*glut/n:.1f}%), WTCS={glut_s:.3f} top-25%? {glut <= top25}")
print(f" neg controls in bottom half: {n_bottom}/5 {negs}")
crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
print(f" OVERALL: {'PASS' if crit else 'FAIL'}")
print(" top 8:")
for name, r in ranked.nsmallest(8, "connectivity_score").iterrows():
print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} [{r['inclusion_reason']}]")
return ranked
def main():
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
up = [g["gene"] for g in sig["up_regulated"]]
down = [g["gene"] for g in sig["down_regulated"]]
drug_names = set(pd.read_csv(PROCESSED / "drug_set_v1.csv").pert_iname)
for space in ("bing", "all"):
print(f"\n>>> extracting {space} ...", flush=True)
ids, id2sym = gene_ids_for_space(space)
mat = extract(space, drug_names, ids, id2sym)
evaluate(space, mat, up, down)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,118 @@
"""Phase A: calibrate sickle connectivity against a REAL disease-signature reference population.
The v1.1 tau saturated because it used random gene-set nulls. Proper specificity calibration
asks: does a drug reverse SICKLE more than it reverses diseases in general? We download a library
of real disease signatures (Enrichr "Disease Signatures from GEO", up+down), compute each drug's
connectivity to every reference disease, and express its sickle connectivity as a z-score within
that per-drug reference distribution. Broad-effect drugs (reverse everything) -> z~0 -> down-ranked.
Re-runs the recovery test and compares to v1.1 (random-null). Writes nothing committed.
"""
from __future__ import annotations
from pathlib import Path
import numpy as np
import pandas as pd
import requests
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import _ks_connectivity # noqa: E402
PROCESSED = Path("data/processed")
RAW = Path("data/raw/disease_sigs")
ENRICHR = "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName={}"
LIBS = {"up": "Disease_Signatures_from_GEO_up_2014", "down": "Disease_Signatures_from_GEO_down_2014"}
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
def fetch_gmt(name: str) -> dict[str, list[str]]:
RAW.mkdir(parents=True, exist_ok=True)
path = RAW / f"{name}.gmt"
if not path.exists():
r = requests.get(ENRICHR.format(name), timeout=120)
r.raise_for_status()
path.write_text(r.text)
out = {}
for line in path.read_text().splitlines():
parts = line.split("\t")
if len(parts) < 3:
continue
term = parts[0]
genes = [g.split(",")[0].strip().upper() for g in parts[2:] if g.strip()]
out[term] = genes
print(f" {name}: {path.stat().st_size/1e6:.1f} MB, {len(out)} terms")
return out
def build_reference() -> list[dict]:
up = fetch_gmt(LIBS["up"])
down = fetch_gmt(LIBS["down"])
shared = set(up) & set(down)
refs = []
for term in shared:
if "sickle" in term.lower():
continue # exclude the target disease from its own reference population
refs.append({"name": term, "up": up[term], "down": down[term]})
print(f"reference disease signatures (paired up+down, sickle excluded): {len(refs)}")
return refs
def cols_for(genes, gene_to_col):
return np.array([gene_to_col[g] for g in set(genes) if g in gene_to_col], dtype=int)
def main():
import json
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
sk_up = [g["gene"] for g in sig["up_regulated"]]
sk_down = [g["gene"] for g in sig["down_regulated"]]
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
genes = list(lincs.columns)
gene_to_col = {g: i for i, g in enumerate(genes)}
n = len(genes)
R = lincs.rank(axis=1, ascending=False).to_numpy()
refs = build_reference()
# connectivity of every drug to every reference disease -> (n_drugs, n_refs)
C = np.empty((R.shape[0], len(refs)))
for j, d in enumerate(refs):
C[:, j] = _ks_connectivity(R, cols_for(d["up"], gene_to_col), cols_for(d["down"], gene_to_col), n)
# drop reference diseases with too few mapped genes (degenerate columns)
mapped = np.array([len(cols_for(d["up"], gene_to_col)) + len(cols_for(d["down"], gene_to_col)) for d in refs])
keep = mapped >= 10
C = C[:, keep]
print(f"usable reference diseases (>=10 mapped genes): {keep.sum()}")
real = _ks_connectivity(R, cols_for(sk_up, gene_to_col), cols_for(sk_down, gene_to_col), n)
ref_mean, ref_std = C.mean(axis=1), C.std(axis=1)
ref_std[ref_std == 0] = np.nan
spec_z = (real - ref_mean) / ref_std # negative = reverses sickle more than diseases-in-general
ranked = pd.DataFrame({"spec_z": spec_z, "connectivity": real}, index=lincs.index).sort_values("spec_z")
ranked.insert(0, "rank", range(1, len(ranked) + 1))
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
ranked = ranked.join(profiles[["inclusion_reason"]])
N = len(ranked)
top10, top25, half = int(N * .10), int(N * .25), N // 2
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
n_bottom = sum(r > half for r in negs.values())
print("\n==== RECOVERY TEST (reference-calibrated tau) ====")
print(f" hydroxyurea rank {hu}/{N} (top {100*hu/N:.1f}%) top-10%? {hu <= top10} z={ranked.loc['hydroxyurea','spec_z']:.2f}")
print(f" L-glutamine rank {glut}/{N} (top {100*glut/N:.1f}%) top-25%? {glut <= top25}")
print(f" neg controls bottom-half: {n_bottom}/5 {negs}")
crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
print(f" OVERALL: {'PASS' if crit else 'FAIL'}")
print("\n top 12 by reference-calibrated z:")
for name, r in ranked.nsmallest(12, "spec_z").iterrows():
print(f" {int(r['rank']):2d} {name:20s} z={r['spec_z']:6.2f} [{r['inclusion_reason']}]")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,137 @@
"""Phase D: supervised cross-disease repurposing — can labels break the specificity ceiling?
Connectivity alone can't tell therapeutic from coincidental reversal. A supervised model trained
on known drug-disease pairs CAN learn that pattern — given features that expose drug "broadness"
(a drug that reverses everything is non-specific). We train on 839 GEO disease signatures with
Repurposing-Hub indications as labels, evaluate with disease-grouped CV, then apply to HELD-OUT
sickle and check whether the coincidental reversers (norethindrone, ciprofloxacin) finally drop.
Baseline = rank by raw connectivity (the single conn feature). Win = model down-ranks the
negative controls vs baseline while keeping hydroxyurea high.
"""
from __future__ import annotations
import json
import re
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GroupKFold, cross_val_predict
from sklearn.metrics import roc_auc_score
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import _ks_connectivity # noqa: E402
PROCESSED = Path("data/processed")
SIGS = Path("data/raw/disease_sigs")
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
ID_RE = re.compile(r"\s*(c\d{6,}|doid[-\s]?\d+|gse\d+|umls\S+)\s*", re.I)
def clean_disease(term: str) -> str:
return ID_RE.sub(" ", term).strip().lower()
def load_disease_sigs():
def parse(name):
out = {}
for line in (SIGS / name).read_text().splitlines():
p = line.split("\t")
if len(p) < 3:
continue
out[p[0]] = [g.split(",")[0].strip().upper() for g in p[2:] if g.strip()]
return out
up = parse("Disease_Perturbations_from_GEO_up.gmt")
down = parse("Disease_Perturbations_from_GEO_down.gmt")
terms = sorted(set(up) & set(down))
return [{"term": t, "disease": clean_disease(t), "up": up[t], "down": down[t]} for t in terms]
def cols_for(genes, g2c):
return np.array([g2c[g] for g in set(genes) if g in g2c], dtype=int)
def featurize(conn_col, C):
"""Per-(drug,disease) features for one disease column conn_col, given full matrix C."""
drug_mean, drug_std = C.mean(1), C.std(1)
drug_min = C.min(1)
broad = (C < -0.10).mean(1) # fraction of diseases this drug strongly reverses
dmean, dstd = conn_col.mean(), conn_col.std() or 1.0
return np.column_stack([
conn_col, drug_mean, drug_std, drug_min, broad,
np.full_like(conn_col, dmean),
(conn_col - drug_mean) / np.where(drug_std == 0, 1, drug_std), # specificity within drug
(conn_col - dmean) / dstd, # specificity within disease
])
FEATS = ["conn", "drug_mean", "drug_std", "drug_min", "broadness",
"disease_mean", "z_within_drug", "z_within_disease"]
def main():
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
drugs = list(lincs.index)
g2c = {g: i for i, g in enumerate(lincs.columns)}
n = len(lincs.columns)
R = lincs.rank(axis=1, ascending=False).to_numpy()
refs = [d for d in load_disease_sigs()
if len(cols_for(d["up"], g2c)) + len(cols_for(d["down"], g2c)) >= 10]
C = np.column_stack([_ks_connectivity(R, cols_for(d["up"], g2c), cols_for(d["down"], g2c), n) for d in refs])
print(f"{len(drugs)} drugs x {len(refs)} disease signatures; connectivity matrix {C.shape}")
# labels from Repurposing Hub indications
hub = pd.read_csv("data/raw/repurposing_drugs.txt", sep="\t", comment="!", low_memory=False)
ind = {r.pert_iname: [i.strip().lower() for i in re.split(r"[|,]", r.indication) if len(i.strip()) > 3]
for r in hub.itertuples() if isinstance(r.indication, str)}
drug_idx = {d: i for i, d in enumerate(drugs)}
X_rows, y, grp = [], [], []
for j, d in enumerate(refs):
feats = featurize(C[:, j], C)
dz = d["disease"]
for i, drug in enumerate(drugs):
inds = ind.get(drug, [])
label = int(any(dz == k or (len(dz) > 4 and dz in k) or (len(k) > 4 and k in dz) for k in inds))
X_rows.append(feats[i]); y.append(label); grp.append(j)
X = np.array(X_rows); y = np.array(y); grp = np.array(grp)
print(f"pairs: {len(y)}, positives: {y.sum()} ({100*y.mean():.2f}%)")
# disease-grouped CV (generalize to unseen diseases)
clf = GradientBoostingClassifier(random_state=0)
proba = cross_val_predict(clf, X, y, cv=GroupKFold(5), groups=grp, method="predict_proba")[:, 1]
print(f"disease-grouped CV AUC: {roc_auc_score(y, proba):.3f} (conn-only AUC: {roc_auc_score(y, X[:,0]*-1):.3f})")
clf.fit(X, y)
print("feature importances: " + ", ".join(f"{f}={imp:.2f}" for f, imp in
sorted(zip(FEATS, clf.feature_importances_), key=lambda t: -t[1])))
# apply to HELD-OUT sickle
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
sk = _ks_connectivity(R, cols_for([g["gene"] for g in sig["up_regulated"]], g2c),
cols_for([g["gene"] for g in sig["down_regulated"]], g2c), n)
Xs = featurize(sk, C)
p_sickle = clf.predict_proba(Xs)[:, 1]
res = pd.DataFrame({"model_p": p_sickle, "conn": sk}, index=drugs)
res["model_rank"] = res["model_p"].rank(ascending=False).astype(int)
res["conn_rank"] = res["conn"].rank(ascending=True).astype(int) # most negative = best
N = len(res)
print(f"\n{'drug':14s} {'model_rank':>10s} {'conn_rank(base)':>16s}")
for d in ["hydroxyurea", "glutamine"] + NEG5 + ["norethindrone", "ciprofloxacin"]:
if d in res.index:
r = res.loc[d]
print(f" {d:14s} {int(r['model_rank']):6d}/{N} {int(r['conn_rank']):6d}/{N}")
nb_model = sum(res.loc[d, "model_rank"] > N/2 for d in NEG5 if d in res.index)
nb_conn = sum(res.loc[d, "conn_rank"] > N/2 for d in NEG5 if d in res.index)
print(f"\nneg controls bottom-half: model {nb_model}/5 vs baseline {nb_conn}/5")
print("model top 10:", ", ".join(res.sort_values('model_p', ascending=False).head(10).index))
if __name__ == "__main__":
main()

139
scripts/week1_explore.py Normal file
View File

@@ -0,0 +1,139 @@
"""Week 1 exploratory run: build per-study DE + concordance for the whole-blood pair.
Read-only decision aid (NOT the final pipeline): downloads GSE35007 + GSE16728, runs DE on
each, collapses to HGNC symbols, computes concordance, and reports whether the sanity-gate
genes survive — so we can choose Tier A (concordance pair) vs Tier B (GSE35007 alone) with
real numbers. Intermediate DE tables are cached under data/raw/geo/.
"""
from __future__ import annotations
import sys
from pathlib import Path
import GEOparse
import numpy as np
import pandas as pd
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.disease import ( # noqa: E402
DISEASE_LABEL,
HEALTHY_LABEL,
collapse_probes_to_symbols,
compute_differential_expression,
concordance_filter,
)
GEO_DIR = Path("data/raw/geo")
GEO_DIR.mkdir(parents=True, exist_ok=True)
SANITY_GENES = ["HBG1", "HBG2", "ALAS2", "CA1", "AHSP", "SLC4A1", "HBB", "ERAF"]
SYMBOL_COL_CANDIDATES = [
"Symbol", "ILMN_Gene", "Gene Symbol", "GENE_SYMBOL", "Gene symbol",
"gene_assignment", "GeneSymbol",
]
def log(msg: str) -> None:
print(msg, flush=True)
def get_symbol_map(gpl) -> pd.Series:
tbl = gpl.table
col = next((c for c in SYMBOL_COL_CANDIDATES if c in tbl.columns), None)
if col is None:
raise RuntimeError(f"No symbol column in GPL {gpl.name}; columns={list(tbl.columns)[:15]}")
s = tbl.set_index("ID")[col].astype(str).str.strip()
# gene_assignment fields are '// SYMBOL // ...'; take the first real token.
if col == "gene_assignment":
s = s.str.split("//").str[1].str.strip()
s = s.replace({"": np.nan, "nan": np.nan, "---": np.nan})
return s.dropna()
def build_expression(gse, sample_ids: list[str]) -> pd.DataFrame:
cols = {}
for gsm_id in sample_ids:
tbl = gse.gsms[gsm_id].table
cols[gsm_id] = tbl.set_index("ID_REF")["VALUE"]
return pd.DataFrame(cols)
def char_value(gsm, key: str) -> str | None:
for c in gsm.metadata.get("characteristics_ch1", []):
if c.lower().startswith(key.lower()):
return c.split(":", 1)[1].strip()
return None
def run_study(gse, groups: dict[str, str], label: str) -> pd.DataFrame:
"""groups: gsm_id -> 'disease'/'healthy'. Returns symbol-level DE table."""
sample_ids = list(groups.keys())
log(f"[{label}] {len(sample_ids)} samples "
f"({sum(v==DISEASE_LABEL for v in groups.values())} disease / "
f"{sum(v==HEALTHY_LABEL for v in groups.values())} healthy)")
expr = build_expression(gse, sample_ids).apply(pd.to_numeric, errors="coerce").dropna(how="all")
sample_groups = pd.Series(groups)
de = compute_differential_expression(expr, sample_groups, method="welch")
gpl = list(gse.gpls.values())[0]
sym = get_symbol_map(gpl)
de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
log(f"[{label}] probes->symbols: {len(de)} probes -> {len(de_sym)} symbols; "
f"sig q<0.05: {(de_sym['qvalue']<0.05).sum()}")
de_sym.to_parquet(GEO_DIR / f"de_{label}.parquet")
return de_sym
def main() -> None:
# --- GSE35007: whole blood, hb phenotype SS (disease) vs AA (healthy) ---
log("downloading GSE35007 ...")
gse1 = GEOparse.get_GEO(geo="GSE35007", destdir=str(GEO_DIR), silent=True)
g1 = {}
for gsm_id, gsm in gse1.gsms.items():
hb = char_value(gsm, "hb phenotype")
if hb == "SS":
g1[gsm_id] = DISEASE_LABEL
elif hb == "AA":
g1[gsm_id] = HEALTHY_LABEL
de1 = run_study(gse1, g1, "GSE35007")
# --- GSE16728: whole blood only (PAXgene preps; exclude PBMC), patient vs control ---
log("downloading GSE16728 ...")
gse2 = GEOparse.get_GEO(geo="GSE16728", destdir=str(GEO_DIR), silent=True)
g2 = {}
for gsm_id, gsm in gse2.gsms.items():
prep = char_value(gsm, "rna prep") or ""
subj = char_value(gsm, "subject") or ""
if "PBMC" in prep:
continue # whole-blood only
if subj.lower().startswith("sickle"):
g2[gsm_id] = DISEASE_LABEL
elif subj.lower().startswith("control"):
g2[gsm_id] = HEALTHY_LABEL
de2 = run_study(gse2, g2, "GSE16728")
# --- Concordance ---
keep, summary = concordance_filter(de1, de2)
log("\n==== CONCORDANCE (whole-blood pair) ====")
log(f"genes tested in both: {summary.n_genes_tested}")
log(f"concordant (q<0.05 both + same sign): {summary.n_concordant} "
f"(up={summary.n_up}, down={summary.n_down})")
keep.to_parquet(GEO_DIR / "concordant_genes.parquet")
log("\n==== SANITY-GATE GENES ====")
for g in SANITY_GENES:
in1 = g in de1.index
in2 = g in de2.index
row1 = de1.loc[g] if in1 else None
row2 = de2.loc[g] if in2 else None
concord = "YES" if g in keep.index else "no"
d1 = f"lfc={row1['log_fc']:+.2f},q={row1['qvalue']:.1e}" if in1 else "absent"
d2 = f"lfc={row2['log_fc']:+.2f},q={row2['qvalue']:.1e}" if in2 else "absent"
log(f" {g:7s} | GSE35007 {d1:28s} | GSE16728 {d2:28s} | concordant={concord}")
log("\n==== TOP 15 CONCORDANT (by worst-case q) ====")
log(keep.head(15).to_string())
if __name__ == "__main__":
main()

121
scripts/week1_finalize.py Normal file
View File

@@ -0,0 +1,121 @@
"""Finalize the Week 1 sickle cell signature (Tier A, whole-blood concordance pair).
Reads the cached concordant gene table, maps symbols -> Entrez/Ensembl via mygene, attaches
two-study provenance + concordance summary, and persists
data/processed/sickle_cell_signature_v1.json.
Tier note: GSE16728 is exactly 10/group, which fails assign_tier()'s strict n>10. Tier A is
assigned explicitly here on the basis of cross-study, cross-platform, cross-population
concordance (the founder's decision), with the n=10 / prep-merge caveat documented in the
tier rationale and limitations.
"""
from __future__ import annotations
import sys
from pathlib import Path
import mygene
import pandas as pd
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.disease import ( # noqa: E402
ConcordanceSummary,
SignatureProvenance,
StudyProvenance,
build_signature,
persist_signature,
)
from src.provenance import ConfidenceTier # noqa: E402
CREATED_DATE = "2026-06-23"
GEO_DIR = Path("data/raw/geo")
def map_ids(symbols: list[str]) -> pd.DataFrame:
mg = mygene.MyGeneInfo()
res = mg.querymany(
symbols, scopes="symbol", fields="entrezgene,ensembl.gene",
species="human", as_dataframe=True, verbose=False,
)
res = res[~res.index.duplicated(keep="first")]
def _ensembl(v):
if isinstance(v, list) and v:
v = v[0]
if isinstance(v, dict):
return v.get("gene")
return v
id_map = pd.DataFrame(index=res.index)
id_map["entrez_id"] = res["entrezgene"] if "entrezgene" in res.columns else None
ens_col = "ensembl.gene" if "ensembl.gene" in res.columns else ("ensembl" if "ensembl" in res.columns else None)
id_map["ensembl_id"] = res[ens_col].map(_ensembl) if ens_col else None
return id_map
def main() -> None:
concordant = pd.read_parquet(GEO_DIR / "concordant_genes.parquet")
print(f"loaded {len(concordant)} concordant genes "
f"({(concordant['log_fc']>0).sum()} up / {(concordant['log_fc']<0).sum()} down)")
id_map = map_ids(list(concordant.index))
print(f"mygene mapped {id_map['entrez_id'].notna().sum()}/{len(id_map)} symbols to Entrez")
provenance = SignatureProvenance(
studies=[
StudyProvenance(
geo_accession="GSE35007", n_disease=190, n_healthy=12,
platform="Illumina HumanHT-12 V4 (GPL10558)",
tissue="whole blood", method="welch",
),
StudyProvenance(
geo_accession="GSE16728", n_disease=10, n_healthy=10,
platform="Affymetrix HG-U133 Plus 2.0 (GPL570)",
tissue="whole blood (PAXgene; globin-reduced + total RNA preps merged)",
method="welch",
),
],
concordance=ConcordanceSummary(
n_genes_tested=16208, n_concordant=671, n_up=444, n_down=227,
),
created_date=CREATED_DATE,
)
tier_rationale = (
"Measured RNA expression concordant across two independent peer-reviewed whole-blood "
"studies on different platforms (Illumina GPL10558, Affymetrix GPL570) and populations "
"(pediatric West-African GSE35007; adult GSE16728). Tier A rests on cross-study / "
"cross-platform concordance. Caveat: GSE16728 is exactly 10/group (two PAXgene preps "
"merged), which does not meet the strict n>10 per-study threshold on its own."
)
limitations = [
"Whole-blood expression is partly driven by cell-composition differences "
"(reticulocytosis in sickle patients), not pure disease-state regulation; the "
"signature is dominated by erythroid markers (CA1, AHSP, SLC4A1). v2 needs cell-type "
"deconvolution.",
"Fetal-hemoglobin genes (HBG1/HBG2) are NOT captured: flat in GSE35007 and removed by "
"GSE16728's globin-depleted RNA prep. The signature therefore does not directly encode "
"the HbF axis that hydroxyurea acts on, which may weaken the hydroxyurea recovery test.",
"GSE35007 is highly imbalanced (SS n=190 vs AA n=12); the healthy arm is small.",
"GSE16728 whole-blood arm is small (10/group) and merges two RNA-prep batches.",
"Cross-platform probe->symbol collapse (keep max-mean-expression probe) loses "
"isoform-level resolution and drops genes absent from either platform.",
]
signature = build_signature(
concordant, provenance,
tier=ConfidenceTier.A,
tier_rationale=tier_rationale,
limitations=limitations,
id_map=id_map,
top_n=250,
)
path = persist_signature(signature)
print(f"wrote {path}")
print(f"signature: {len(signature.up_regulated)} up / {len(signature.down_regulated)} down, "
f"tier {signature.confidence_tier.value}")
if __name__ == "__main__":
main()

91
scripts/week2_assemble.py Normal file
View File

@@ -0,0 +1,91 @@
"""Week 2, task 4: assemble drug_profiles_v1.parquet (PLAN §6).
Joins the curated drug set + ChEMBL enrichment + LINCS consensus signatures into one profile
table. Each drug carries a confidence tier: LINCS is a single source, so signature-backed drugs
are Tier B at best (assign_tier with single_source=True); drugs with no signature are Tier C and
marked not-scored (not dropped silently — PLAN §6 Week 3 task 2).
The 978-gene signature order is the column order of lincs_signatures_v1.parquet (landmark
symbols); each profile's `lincs_signature` is that vector (or null).
"""
from __future__ import annotations
import ast
from pathlib import Path
import pandas as pd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.drugs import persist_drug_profiles # noqa: E402
from src.provenance import ConfidenceTier, assign_tier # noqa: E402
PROCESSED = Path("data/processed")
DRUG_SET = PROCESSED / "drug_set_v1.csv"
CHEMBL = Path("data/raw/chembl/chembl_enrichment.parquet")
LINCS_SIG = PROCESSED / "lincs_signatures_v1.parquet"
def main() -> None:
drugs = pd.read_csv(DRUG_SET)
chembl = pd.read_parquet(CHEMBL)[
["pert_iname", "chembl_id", "pref_name", "smiles", "mechanism_of_action", "targets"]
]
sigs = pd.read_parquet(LINCS_SIG) # rows=pert_iname, cols=978 landmark symbols
gene_order = list(sigs.columns)
df = drugs.merge(chembl, on="pert_iname", how="left")
rows = []
for r in df.itertuples():
has_sig = r.pert_iname in sigs.index
vector = sigs.loc[r.pert_iname].tolist() if has_sig else None
# LINCS = single source => Tier B max when measured; no signature => Tier C.
tier = assign_tier(
is_measured=has_sig, n_per_group=None, peer_reviewed=True, single_source=True
) if has_sig else ConfidenceTier.C
targets = r.targets
if isinstance(targets, str):
try:
targets = ast.literal_eval(targets)
except (ValueError, SyntaxError):
targets = []
elif hasattr(targets, "tolist"): # numpy ndarray from parquet round-trip
targets = targets.tolist()
elif targets is None or (not isinstance(targets, (list, tuple))):
targets = []
rows.append({
"name": r.pert_iname,
"chembl_id": r.chembl_id if pd.notna(r.chembl_id) else None,
"pref_name": r.pref_name if pd.notna(r.pref_name) else None,
"inchikey": r.inchi_key if pd.notna(r.inchi_key) else None,
"smiles": r.smiles if pd.notna(r.smiles) else None,
"targets": list(targets),
"mechanism_of_action": r.mechanism_of_action if pd.notna(r.mechanism_of_action) else None,
"inclusion_reason": r.inclusion_reason,
"lincs_phase": r.phase,
"scored": has_sig,
"lincs_signature": vector,
"confidence_tier": tier.value,
})
profiles = pd.DataFrame(rows)
# Persist the gene order alongside, so Week-3 scoring can align the vectors.
(PROCESSED / "lincs_gene_order.txt").write_text("\n".join(gene_order))
path = persist_drug_profiles(profiles)
print(f"drug_profiles_v1: {len(profiles)} drugs")
print(f" scored (have LINCS signature): {profiles['scored'].sum()}")
print(f" not scored: {(~profiles['scored']).sum()}")
print(" by inclusion reason (scored rate):")
print(profiles.groupby("inclusion_reason")["scored"].agg(["sum", "count"]).to_string())
print(" tier split:", profiles["confidence_tier"].value_counts().to_dict())
for gt in ["hydroxyurea", "glutamine"]:
row = profiles[profiles["name"] == gt]
print(f" ground truth '{gt}': scored={bool(row['scored'].iloc[0]) if len(row) else 'ABSENT'}")
print(f"wrote {path}")
if __name__ == "__main__":
main()

99
scripts/week2_chembl.py Normal file
View File

@@ -0,0 +1,99 @@
"""Week 2, task 2: enrich the drug set with ChEMBL structure/target/mechanism data.
Drugs are matched to ChEMBL by the InChIKey already carried from LINCS pert_info (reliable),
then mechanism-of-action and target names are pulled. Compounds absent from ChEMBL (many
research/tool compounds in the random arm) keep null ChEMBL fields — they still have LINCS
signatures for scoring; only the Week-3 mechanistic prior won't apply. Output cached to
data/raw/chembl/chembl_enrichment.parquet.
"""
from __future__ import annotations
from pathlib import Path
import pandas as pd
from chembl_webresource_client.new_client import new_client
DRUG_SET = Path("data/processed/drug_set_v1.csv")
OUT = Path("data/raw/chembl/chembl_enrichment.parquet")
BATCH = 40
def chunks(seq, n):
for i in range(0, len(seq), n):
yield seq[i:i + n]
def main() -> None:
drugs = pd.read_csv(DRUG_SET)
inchikeys = sorted({k for k in drugs["inchi_key"].dropna() if isinstance(k, str) and len(k) > 10})
print(f"{len(drugs)} drugs; {len(inchikeys)} usable InChIKeys to resolve")
molecule = new_client.molecule
mechanism = new_client.mechanism
target = new_client.target
# 1) InChIKey -> ChEMBL molecule (id, name, smiles)
mol_rows = []
for i, batch in enumerate(chunks(inchikeys, BATCH)):
res = molecule.filter(molecule_structures__standard_inchi_key__in=batch).only(
["molecule_chembl_id", "pref_name", "molecule_structures"])
for m in res:
ms = m.get("molecule_structures") or {}
mol_rows.append({
"chembl_id": m["molecule_chembl_id"],
"pref_name": m.get("pref_name"),
"smiles": ms.get("canonical_smiles"),
"inchi_key": ms.get("standard_inchi_key"),
})
print(f" molecules batch {i+1}: cumulative {len(mol_rows)} hits", flush=True)
mols = pd.DataFrame(mol_rows).drop_duplicates("inchi_key")
chembl_ids = sorted(mols["chembl_id"].unique())
print(f"resolved {len(mols)} molecules -> {len(chembl_ids)} ChEMBL ids")
# 2) ChEMBL id -> mechanism of action + target ids
mech_rows = []
for batch in chunks(chembl_ids, BATCH):
for m in mechanism.filter(molecule_chembl_id__in=batch).only(
["molecule_chembl_id", "mechanism_of_action", "target_chembl_id"]):
mech_rows.append(m)
mech = pd.DataFrame(mech_rows)
print(f"mechanism records: {len(mech)}")
# 3) target id -> name
tgt_names = {}
if not mech.empty:
tids = sorted({t for t in mech["target_chembl_id"].dropna().unique()})
for batch in chunks(tids, BATCH):
for t in target.filter(target_chembl_id__in=batch).only(["target_chembl_id", "pref_name"]):
tgt_names[t["target_chembl_id"]] = t.get("pref_name")
# aggregate mechanism/targets per molecule
def agg(df):
moa = sorted({x for x in df["mechanism_of_action"].dropna()})
tns = sorted({tgt_names.get(t) for t in df["target_chembl_id"].dropna() if tgt_names.get(t)})
return pd.Series({"mechanism_of_action": "; ".join(moa) or None, "targets": tns})
if not mech.empty:
per_mol = mech.groupby("molecule_chembl_id").apply(agg, include_groups=False).reset_index()
per_mol = per_mol.rename(columns={"molecule_chembl_id": "chembl_id"})
mols = mols.merge(per_mol, on="chembl_id", how="left")
else:
mols["mechanism_of_action"] = None
mols["targets"] = None
# join back to the drug set on inchi_key
enriched = drugs.merge(mols, on="inchi_key", how="left", suffixes=("", "_chembl"))
OUT.parent.mkdir(parents=True, exist_ok=True)
enriched.to_parquet(OUT, index=False)
n_resolved = enriched["chembl_id"].notna().sum()
n_moa = enriched["mechanism_of_action"].notna().sum()
print(f"\nenriched {len(enriched)} drugs: {n_resolved} matched ChEMBL, {n_moa} have MoA")
print(f"by reason, ChEMBL match rate:")
print(enriched.assign(matched=enriched["chembl_id"].notna()).groupby("inclusion_reason")["matched"].mean().round(2).to_string())
print(f"wrote {OUT}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,131 @@
"""Week 2, task 1: curate the deliberately-composed ~300-drug set (PLAN §6).
Composition: 2 ground-truth + ~50 related-mechanism + ~50 negative controls + ~200 random.
The universe is restricted to compounds that actually have a LINCS Level-5 signature (in
Phase I and/or Phase II), so every curated drug is scorable. Output: drug_set_v1.csv.
"""
from __future__ import annotations
import gzip
import io
from pathlib import Path
import pandas as pd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src import RANDOM_SEED # noqa: E402
LINCS = Path("data/raw/lincs")
OUT = Path("data/processed/drug_set_v1.csv")
GROUND_TRUTH = ["hydroxyurea", "glutamine"] # glutamine == L-glutamine in LINCS
# Curated by mechanism (PLAN §6). Intersected with the LINCS catalog below, so misses are
# silently dropped — we keep whatever actually has a signature.
RELATED_MECHANISM = [
# HbF inducers / epigenetic
"decitabine", "azacitidine", "vorinostat", "panobinostat", "romidepsin", "entinostat",
"mocetinostat", "belinostat", "pomalidomide", "lenalidomide", "thalidomide", "apicidin",
"trichostatin-a", "scriptaid", "valproic-acid",
# NO / vascular
"sildenafil", "tadalafil", "nitroprusside",
# antioxidants
"n-acetyl-cysteine", "resveratrol", "curcumin", "quercetin", "sulforaphane",
# anti-inflammatory studied in SCD
"dexamethasone", "prednisolone", "hydrocortisone", "ibuprofen", "indomethacin",
"sulfasalazine", "montelukast", "aspirin",
# iron / heme / SCD-adjacent
"hemin", "deferoxamine", "deferasirox", "simvastatin", "atorvastatin", "ticagrelor",
]
NEGATIVE_CONTROL = [
# antifungals
"fluconazole", "ketoconazole", "itraconazole", "clotrimazole", "terbinafine", "miconazole",
# antihistamines
"loratadine", "cetirizine", "fexofenadine", "diphenhydramine", "chlorpheniramine",
"astemizole",
# antibiotics
"amoxicillin", "ciprofloxacin", "doxycycline", "trimethoprim", "azithromycin", "tetracycline",
"nitrofurantoin",
# hormones / contraceptives
"levonorgestrel", "ethinyl-estradiol", "norethindrone", "medroxyprogesterone-acetate",
# misc unrelated
"omeprazole", "ranitidine", "loperamide", "caffeine", "acetaminophen", "lidocaine",
]
# Fill the random sample so the total set is ~300 (the denominator the pre-registered
# recovery-test thresholds assume: "top 30 of 300"). Curated mechanism/control drugs are
# capped by what LINCS actually contains, so the random arm absorbs the remainder.
TARGET_TOTAL = 300
def load_catalog() -> pd.DataFrame:
"""Compounds with >=1 Level-5 signature, annotated with phase + inchi/smiles."""
def read_gz(fn, **kw):
return pd.read_csv(io.BytesIO(gzip.decompress(Path(fn).read_bytes())), sep="\t", **kw)
sig1 = read_gz(LINCS / "GSE92742_sig_info.txt.gz", low_memory=False)
sig2 = read_gz(LINCS / "GSE70138_sig_info.txt.gz", low_memory=False)
cp1 = set(sig1[sig1["pert_type"] == "trt_cp"]["pert_iname"])
cp2 = set(sig2[sig2["pert_type"] == "trt_cp"]["pert_iname"])
pert1 = read_gz(LINCS / "GSE92742_pert_info.txt.gz", low_memory=False)
pert2 = read_gz(LINCS / "GSE70138_pert_info.txt.gz", low_memory=False)
info = pd.concat([pert1, pert2], ignore_index=True)
info = info[info["pert_type"] == "trt_cp"].drop_duplicates("pert_iname", keep="first")
info = info.set_index("pert_iname")
names = cp1 | cp2
rows = []
for nm in names:
phase = "both" if nm in cp1 and nm in cp2 else ("P1" if nm in cp1 else "P2")
rec = info.loc[nm] if nm in info.index else None
rows.append({
"pert_iname": nm,
"phase": phase,
"pert_id": rec["pert_id"] if rec is not None else None,
"inchi_key": rec["inchi_key"] if rec is not None else None,
"canonical_smiles": rec["canonical_smiles"] if rec is not None else None,
})
return pd.DataFrame(rows).set_index("pert_iname")
def pick(catalog: pd.DataFrame, names: list[str], reason: str) -> pd.DataFrame:
present = [n for n in names if n in catalog.index]
missing = [n for n in names if n not in catalog.index]
if missing:
print(f" [{reason}] {len(present)}/{len(names)} in LINCS; dropped: {missing}")
out = catalog.loc[present].copy()
out["inclusion_reason"] = reason
return out
def main() -> None:
catalog = load_catalog()
print(f"LINCS scorable compound universe: {len(catalog)}")
gt = pick(catalog, GROUND_TRUTH, "ground_truth")
rel = pick(catalog, RELATED_MECHANISM, "related_mechanism")
neg = pick(catalog, NEGATIVE_CONTROL, "negative_control")
chosen = pd.concat([gt, rel, neg])
remaining = catalog.drop(index=chosen.index)
n_random = TARGET_TOTAL - len(chosen)
rand = remaining.sample(n=n_random, random_state=RANDOM_SEED).copy()
rand["inclusion_reason"] = "general_sample"
drug_set = pd.concat([gt, rel, neg, rand]).reset_index()
OUT.parent.mkdir(parents=True, exist_ok=True)
drug_set.to_csv(OUT, index=False)
print(f"\ndrug_set_v1.csv: {len(drug_set)} drugs")
print(drug_set["inclusion_reason"].value_counts().to_string())
print(f"phase split:\n{drug_set['phase'].value_counts().to_string()}")
print(f"wrote {OUT}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,107 @@
"""Week 2, task 3: extract per-drug LINCS L1000 consensus signatures.
For each drug in the set, slice its Level-5 MODZ signatures (978 landmark genes x its sig_ids)
out of the big GCTX via cmapPy, then aggregate to ONE consensus 978-vector per drug by mean
across its sig_ids (the "MODZ aggregation across cell lines/replicates" of PLAN §6). hydroxyurea
lives in Phase II, L-glutamine in Phase I, so both phases are processed and merged.
Output: data/processed/lincs_signatures_v1.parquet (rows = pert_iname, cols = 978 landmark
gene symbols, values = mean MODZ z-score).
Usage:
python scripts/week2_lincs_extract.py --phase 2 # test on Phase II (already downloaded)
python scripts/week2_lincs_extract.py # both phases (needs Phase I gunzipped)
"""
from __future__ import annotations
import argparse
import gzip
import io
from pathlib import Path
import pandas as pd
from cmapPy.pandasGEXpress.parse import parse
LINCS = Path("data/raw/lincs")
DRUG_SET = Path("data/processed/drug_set_v1.csv")
OUT = Path("data/processed/lincs_signatures_v1.parquet")
GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
def read_gz_tsv(name: str) -> pd.DataFrame:
return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
"""Gene row-ids + id->symbol map for the scored gene space.
v1.1: use the FULL 12,328-gene space (landmark + inferred), not just the 978 landmarks.
This lifts disease-signature overlap from 12% to ~85% and brings the erythroid markers into
scoring (see docs/recovery_test_report.md). Inferred genes are model-predicted (noisier).
"""
g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
ids = [str(x) for x in g["pr_gene_id"]]
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
return ids, id_to_symbol
def extract_phase(phase: int, drug_names: set[str], landmark_ids: list[str]) -> pd.DataFrame:
"""Return DataFrame: rows=pert_iname, cols=landmark gene_id (str), one mean vector per drug."""
sig = read_gz_tsv(SIG_INFO[phase])
sig = sig[(sig["pert_type"] == "trt_cp") & (sig["pert_iname"].isin(drug_names))]
if sig.empty:
return pd.DataFrame()
sig_ids = sig["sig_id"].tolist()
print(f"[phase {phase}] {sig['pert_iname'].nunique()} drugs, {len(sig_ids)} signatures to slice", flush=True)
gctoo = parse(str(GCTX[phase]), rid=landmark_ids, cid=sig_ids)
data = gctoo.data_df # rows=gene_id, cols=sig_id
sig_to_drug = dict(zip(sig["sig_id"], sig["pert_iname"]))
# mean across each drug's sig_ids -> one consensus vector per drug
per_drug = data.T.groupby(data.columns.map(sig_to_drug)).mean()
print(f"[phase {phase}] aggregated to {len(per_drug)} drug consensus vectors", flush=True)
return per_drug # rows=pert_iname, cols=gene_id
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--phase", type=int, choices=[1, 2], default=None, help="single phase (test)")
args = ap.parse_args()
drugs = pd.read_csv(DRUG_SET)
drug_names = set(drugs["pert_iname"])
landmark_ids, id_to_symbol = landmark_ids_and_symbols()
phases = [args.phase] if args.phase else [1, 2]
frames = []
for ph in phases:
if not GCTX[ph].exists():
print(f"[phase {ph}] {GCTX[ph]} missing — skipping")
continue
frames.append(extract_phase(ph, drug_names, landmark_ids))
frames = [f for f in frames if not f.empty]
if not frames:
print("no signatures extracted")
return
# A drug present in both phases: average the two phase consensus vectors.
combined = pd.concat(frames).groupby(level=0).mean()
combined.columns = [id_to_symbol.get(c, c) for c in combined.columns] # gene_id -> symbol
covered = sorted(set(combined.index))
missing = sorted(drug_names - set(covered))
print(f"\nsignatures extracted for {len(covered)}/{len(drug_names)} drugs")
for gt in ["hydroxyurea", "glutamine"]:
print(f" ground truth '{gt}': {'OK' if gt in covered else 'MISSING'}")
if args.phase is None:
combined.to_parquet(OUT)
print(f"wrote {OUT} ({combined.shape[0]} drugs x {combined.shape[1]} landmark genes)")
if missing:
print(f"{len(missing)} drugs without signature (will be marked not-scored in Week 3)")
if __name__ == "__main__":
main()

82
scripts/week3_scoring.py Normal file
View File

@@ -0,0 +1,82 @@
"""Week 3 (v1.1): connectivity scoring over the full gene space with tau calibration.
Primary ranking is now **tau** (KS connectivity expressed as a signed percentile vs a null of
random queries) — this calibrates out broad-effect drugs that connect to random signatures too,
the specificity fix. The weighted connectivity score (WTCS) is retained as a reference column,
and a secondary ranking blends tau with the sickle-pathway mechanistic prior.
Output: data/results/ranked_candidates_v1.csv.
"""
from __future__ import annotations
import json
from pathlib import Path
import pandas as pd
import sys
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from src.scoring import ( # noqa: E402
connectivity_score, mechanistic_prior, normalize_scores, persist_ranking, tau_calibrate,
)
PROCESSED = Path("data/processed")
N_NULL = 1000
PRIOR_LAMBDA = 0.5 # spec_z credit per matched sickle pathway, for the blended ranking
def main() -> None:
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
up = [g["gene"] for g in sig["up_regulated"]]
down = [g["gene"] for g in sig["down_regulated"]]
sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 12,328 genes
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
n_up = len(set(up) & set(sig_matrix.columns))
n_down = len(set(down) & set(sig_matrix.columns))
print(f"gene space: {sig_matrix.shape[1]} genes; query overlap {n_up} up + {n_down} down = {n_up + n_down}")
# primary: tau calibration
print(f"computing tau over {N_NULL} random-query permutations ...", flush=True)
ranked = tau_calibrate(up, down, sig_matrix, n_null=N_NULL)
# reference: weighted connectivity score (WTCS) + NCS
wtcs = pd.Series({d: connectivity_score(up, down, sig_matrix.loc[d]) for d in sig_matrix.index},
name="connectivity_score")
ranked["connectivity_score"] = wtcs
ranked["normalized_score"] = normalize_scores(wtcs)
# metadata + mechanistic prior
ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
ranked["mechanistic_prior"] = ranked["targets"].apply(
lambda t: mechanistic_prior(list(t) if t is not None else []))
ranked["known_targets"] = ranked["targets"].apply(
lambda t: "; ".join(t) if t is not None and len(t) else "")
ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})
# secondary, prior-weighted ranking (relevant drugs pushed toward more-negative spec_z)
ranked["blended_score"] = ranked["spec_z"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)
out = ranked.rename_axis("drug_name").reset_index()[[
"rank", "drug_name", "chembl_id", "spec_z", "tau", "connectivity_ks", "connectivity_score",
"inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
]]
path = persist_ranking(out)
print(f"wrote {path} ({len(out)} drugs)")
print("\n--- sanity peek (spec_z ranking) ---")
for gt in ["hydroxyurea", "glutamine"]:
r = ranked.loc[gt]
print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {100*r['rank']/len(ranked):.0f}%), "
f"spec_z={r['spec_z']:.2f}")
print(" top 10 by spec_z:")
for name, r in ranked.nsmallest(10, "spec_z").iterrows():
print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:6.2f} [{r['inclusion_reason']:16s}] "
f"{str(r['known_targets'])[:38]}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,81 @@
"""Week 4: formal recovery test against the pre-registered criteria (PLAN §6).
Pre-registered criteria (committed in docs/recovery_test_report.md before this run):
- hydroxyurea in top 10% (top 30 of 300), AND
- L-glutamine in top 25% (top 75) OR documented unscorable due to missing LINCS signature, AND
- >=4 of 5 pre-specified negative controls in the bottom half.
The 5 negative controls are pre-specified here by a category rule (one per category, alphabetically
first available) so the choice does not peek at ranks. Primary ranking = raw connectivity.
"""
from __future__ import annotations
from pathlib import Path
import pandas as pd
RANKED = Path("data/results/ranked_candidates_v1.csv")
# One per unrelated category, alphabetical-first — chosen without looking at ranks.
NEG_CONTROL_CATEGORIES = {
"antifungal": ["clotrimazole", "fluconazole", "itraconazole", "ketoconazole", "miconazole", "terbinafine"],
"antihistamine": ["astemizole", "cetirizine", "diphenhydramine", "fexofenadine", "loratadine"],
"antibiotic": ["azithromycin", "ciprofloxacin", "doxycycline", "tetracycline", "trimethoprim"],
"hormone": ["ethinyl-estradiol", "levonorgestrel", "medroxyprogesterone-acetate", "norethindrone"],
"misc": ["caffeine", "lidocaine", "loperamide", "omeprazole", "ranitidine"],
}
def main() -> None:
df = pd.read_csv(RANKED).set_index("drug_name")
n = len(df)
top10_cut, top25_cut, half = int(n * 0.10), int(n * 0.25), n // 2
def rk(name):
return int(df.loc[name, "rank"]) if name in df.index else None
hu, glut = rk("hydroxyurea"), rk("glutamine")
glut_z = df.loc["glutamine", "spec_z"]
# pick negative controls present in the ranking
negs = {}
for cat, options in NEG_CONTROL_CATEGORIES.items():
pick = next((d for d in options if d in df.index), None)
if pick:
negs[pick] = (cat, rk(pick))
print("=" * 60)
print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}")
print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%) -> top-10%? {hu <= top10_cut}")
print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), spec_z={glut_z:+.2f} "
f"-> top-25%? {glut <= top25_cut} (positive z => does not reverse; has a signature)")
print("\nnegative controls (pre-specified, 1 per category):")
n_bottom = 0
for d, (cat, r) in negs.items():
in_bottom = r > half
n_bottom += in_bottom
print(f" {d:18s} [{cat:13s}] rank {r:3d} bottom-half? {in_bottom}")
print(f" -> {n_bottom}/5 in bottom half (need >=4)")
crit_hu = hu <= top10_cut
crit_glut = glut <= top25_cut
crit_neg = n_bottom >= 4
overall = crit_hu and crit_glut and crit_neg
print(f"\nCRITERIA: hydroxyurea={crit_hu}, L-glutamine={crit_glut}, neg-controls={crit_neg}")
print(f"OVERALL (raw ranking): {'PASS' if overall else 'FAIL'}")
# secondary prior-weighted view (reported, not the primary criterion)
hu_b = int(df.loc["hydroxyurea", "blended_rank"])
print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} "
f"(top {100*hu_b/n:.1f}%)")
print("\n--- TOP 10 (primary spec_z ranking) ---")
top10 = df.sort_values("rank").head(10)
for name, r in top10.iterrows():
print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:+.2f} "
f"[{r['inclusion_reason']}] {str(r['known_targets'])[:45]}")
if __name__ == "__main__":
main()

89
src/binding.py Normal file
View File

@@ -0,0 +1,89 @@
"""Structure-based binding track (PLAN §12).
Two capabilities:
- ligand-based retrieval (RDKit, works now): find existing drugs near a query molecule in
chemical space — validated, and the engine behind §12.9 generative-guided retrieval.
- structure-based docking (§12.3): score whether a ligand binds a target pocket. Blocked on an
ARM-Mac docking toolchain (AutoDock Vina does not pip-install); see ``dock`` for options.
Caveat carried throughout: chemical similarity != activity, and docking != efficacy (§12.6).
"""
from __future__ import annotations
from pathlib import Path
from rdkit import Chem, DataStructs, RDLogger
from rdkit.Chem import rdFingerprintGenerator
RDLogger.DisableLog("rdApp.*")
_MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
STRUCT_DIR = Path("data/raw/structures")
# Known sickle small-molecule binders, by target (positive controls for the §12.4 recovery test).
KNOWN_BINDERS = {
"hemoglobin": "voxelotor",
"PKR": "mitapivat",
"DNMT1": "decitabine",
"HDAC": "vorinostat",
}
# Curated target structures (PLAN §12.2). Add PDB ids as the harness grows.
TARGET_PDB = {
"hemoglobin": "5E83", # hemoglobin + voxelotor (GBT440)
"PKR": "8XFD", # pyruvate kinase R + mitapivat
}
def morgan_fp(smiles: str):
"""Morgan (ECFP4) fingerprint, or None for invalid / missing SMILES ('-666', '')."""
if not isinstance(smiles, str) or smiles in ("-666", ""):
return None
mol = Chem.MolFromSmiles(smiles)
return _MORGAN.GetFingerprint(mol) if mol else None
def tanimoto(smiles_a: str, smiles_b: str) -> float | None:
fa, fb = morgan_fp(smiles_a), morgan_fp(smiles_b)
if fa is None or fb is None:
return None
return DataStructs.TanimotoSimilarity(fa, fb)
def retrieve_nearest(
query_smiles: str,
library: dict[str, str],
top_n: int = 5,
) -> list[tuple[float, str]]:
"""Rank a library of {name: smiles} by Tanimoto similarity to a query molecule.
This is the §12.9 retrieval step: the query may be a known binder (positive-control beacon)
or a generated idealised binder; the returned existing drugs are repurposing candidates that
STILL require docking/validation (similarity != activity).
"""
qfp = morgan_fp(query_smiles)
if qfp is None:
raise ValueError("invalid query SMILES")
sims = []
for name, smi in library.items():
fp = morgan_fp(smi)
if fp is not None:
sims.append((DataStructs.TanimotoSimilarity(qfp, fp), name))
return sorted(sims, reverse=True)[:top_n]
def dock(target: str, ligand_smiles: str) -> float:
"""Dock a ligand into a target pocket and return a binding score (PLAN §12.3).
Blocked: AutoDock Vina does not pip-install on ARM Mac and no docking binary is on PATH.
Resolve the toolchain first (one of):
- conda/micromamba: ``vina`` (conda-forge) or ``smina`` (bioconda), osx-arm64 builds
- an AF3-class co-folding model on GPU: Boltz-2 / Chai-1 / DiffDock (also predicts affinity)
- x86 Vina binary under Rosetta 2
Then: fetch TARGET_PDB[target], define the pocket box, prep the ligand (Meeko), score.
"""
raise NotImplementedError(
"Docking toolchain unresolved on ARM Mac (PLAN §12.6 pitfall 4 / §12.8). "
"See docstring for options."
)

View File

@@ -1,19 +1,27 @@
"""Disease signature construction.
Week 1 (PLAN.md §6). Builds a Tier-A sickle cell signature from GEO expression data via
differential expression, then persists it with full provenance to
``data/processed/sickle_cell_signature_v1.json``.
differential expression on TWO studies, keeping only genes that are concordant across both
(the multi-source evidence that earns Tier A — see project memory). Persists with full
provenance to ``data/processed/sickle_cell_signature_v1.json``.
This module defines the persisted schema (pydantic) and the construction stubs. The actual
data pull + differential expression is driven from ``notebooks/02_disease_signature.ipynb``.
Pipeline (driven from ``notebooks/02_disease_signature.ipynb``):
per study: compute_differential_expression -> collapse_probes_to_symbols
across: concordance_filter -> build_signature -> persist_signature
Conventions:
- log_fc > 0 => up-regulated in DISEASE vs HEALTHY (Week 1 refinement R1)
- cross-study join key is the HGNC gene symbol (R2)
"""
from __future__ import annotations
from pathlib import Path
import numpy as np
import pandas as pd
from pydantic import BaseModel, Field
from scipy.stats import false_discovery_control, ttest_ind
from . import PIPELINE_VERSION, PROCESSED_DIR
from .provenance import ConfidenceTier
@@ -22,6 +30,10 @@ from .provenance import ConfidenceTier
TOP_N_PER_DIRECTION = 250
QVALUE_CUTOFF = 0.05
# Group labels used throughout. The disease group is the "treatment" arm in DE contrasts.
DISEASE_LABEL = "disease"
HEALTHY_LABEL = "healthy"
class GeneEntry(BaseModel):
"""A single differentially expressed gene in the signature."""
@@ -33,15 +45,37 @@ class GeneEntry(BaseModel):
qvalue: float
class SignatureProvenance(BaseModel):
"""Provenance block for a disease signature (PLAN.md §6 schema)."""
class StudyProvenance(BaseModel):
"""Provenance for one contributing GEO study (Week 1 refinement R6)."""
geo_accession: str
n_disease: int
n_healthy: int
platform: str
method: str = Field(..., description="Differential expression method, e.g. 'limma', 'deseq2'.")
tissue: str
method: str = Field(..., description="DE method: 'welch' (microarray) or 'deseq2' (RNA-seq).")
class ConcordanceSummary(BaseModel):
"""How the two studies were reconciled into the signature (R6)."""
rule: str = Field(
default="q<0.05 in both studies AND same sign of log_fc in both",
description="The concordance rule applied across studies.",
)
n_genes_tested: int = Field(..., description="Genes present in both studies after symbol collapse.")
n_concordant: int
n_up: int
n_down: int
class SignatureProvenance(BaseModel):
"""Provenance block for a disease signature (revised for 2-study concordance, R6)."""
studies: list[StudyProvenance]
concordance: ConcordanceSummary
created_date: str
pipeline_version: str = PIPELINE_VERSION
class DiseaseSignature(BaseModel):
@@ -58,44 +92,213 @@ class DiseaseSignature(BaseModel):
limitations: list[str]
def _looks_like_linear_intensity(expression: pd.DataFrame) -> bool:
"""Heuristic: microarray data still on a linear scale has a large dynamic range.
log2-transformed intensities are typically <~20; raw intensities run into the thousands.
"""
return float(np.nanmax(expression.to_numpy())) > 50.0
def _welch_de(expression: pd.DataFrame, groups: pd.Series) -> pd.DataFrame:
"""Per-gene Welch t-test + Benjamini-Hochberg, the limma-equivalent for microarray.
Assumes ``expression`` is genes (rows) x samples (columns), on (or coercible to) a log2
scale. ``log_fc`` is mean(disease) - mean(healthy) (R1 sign convention).
"""
expr = expression.copy()
if _looks_like_linear_intensity(expr):
# log2(x+1) guards against zeros/negatives while keeping the transform standard.
expr = np.log2(expr.clip(lower=0) + 1.0)
disease_cols = groups.index[groups == DISEASE_LABEL]
healthy_cols = groups.index[groups == HEALTHY_LABEL]
disease = expr[disease_cols].to_numpy()
healthy = expr[healthy_cols].to_numpy()
log_fc = disease.mean(axis=1) - healthy.mean(axis=1)
# Welch's t-test (unequal variance) per gene across samples.
_, pvalue = ttest_ind(disease, healthy, axis=1, equal_var=False, nan_policy="omit")
out = pd.DataFrame({"log_fc": log_fc, "pvalue": pvalue}, index=expr.index)
out = out.dropna(subset=["pvalue"])
out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
return out.sort_values("qvalue")
def _deseq2_de(counts: pd.DataFrame, groups: pd.Series) -> pd.DataFrame:
"""RNA-seq differential expression via pydeseq2.
``counts`` is genes (rows) x samples (columns) of raw integer counts. Returns the same
schema as ``_welch_de`` (log_fc = log2FC of disease vs healthy, plus pvalue/qvalue).
"""
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
# pydeseq2 wants samples (rows) x genes (cols), with an aligned metadata frame.
counts_t = counts.T.round().astype(int)
metadata = pd.DataFrame({"condition": groups.reindex(counts_t.index).to_numpy()}, index=counts_t.index)
dds = DeseqDataSet(counts=counts_t, metadata=metadata, design="~condition", quiet=True)
dds.deseq2()
stats = DeseqStats(dds, contrast=["condition", DISEASE_LABEL, HEALTHY_LABEL], quiet=True)
stats.summary()
res = stats.results_df.rename(columns={"log2FoldChange": "log_fc", "pvalue": "pvalue", "padj": "qvalue"})
return res[["log_fc", "pvalue", "qvalue"]].dropna(subset=["qvalue"]).sort_values("qvalue")
def compute_differential_expression(
expression: pd.DataFrame,
sample_groups: pd.Series,
*,
method: str,
) -> pd.DataFrame:
"""Compute gene-level log fold change and adjusted p-values.
For RNA-seq use ``pydeseq2``; for microarray log2-transform/normalize and use a
limma-equivalent (PLAN.md §6, Week 1 task 4).
"""Compute gene-level log fold change and adjusted p-values for one study.
Args:
expression: Genes (rows) x samples (columns) expression matrix.
sample_groups: Per-sample group label ('disease' / 'healthy'), indexed by sample.
method: 'deseq2' (RNA-seq) or 'limma' (microarray).
expression: Genes (rows) x samples (columns). Microarray intensities or RNA-seq counts.
sample_groups: Per-sample 'disease'/'healthy' label, indexed by sample id (column).
method: 'welch' (microarray, PLAN choice) or 'deseq2' (RNA-seq).
Returns:
A table indexed by gene with at least ``log_fc`` and ``qvalue`` columns.
Table indexed by gene/probe with ``log_fc``, ``pvalue``, ``qvalue`` (R1 sign convention).
"""
raise NotImplementedError("Differential expression: implement in Week 1 (notebook 02).")
groups = sample_groups.reindex(expression.columns)
if groups.isna().any():
missing = list(groups.index[groups.isna()])
raise ValueError(f"sample_groups missing labels for samples: {missing[:5]}...")
if method == "welch":
return _welch_de(expression, groups)
if method == "deseq2":
return _deseq2_de(expression, groups)
raise ValueError(f"Unknown method {method!r}; expected 'welch' or 'deseq2'.")
def collapse_probes_to_symbols(
de_table: pd.DataFrame,
probe_to_symbol: pd.Series,
expression_for_ranking: pd.DataFrame | None = None,
) -> pd.DataFrame:
"""Collapse a probe-level DE table to one row per HGNC symbol (R2).
When multiple probes map to the same symbol, keep the probe with the highest mean
expression (the standard collapseRows heuristic) if ``expression_for_ranking`` is given;
otherwise keep the probe with the smallest qvalue.
Args:
de_table: DE table indexed by probe id (from ``compute_differential_expression``).
probe_to_symbol: Probe id -> HGNC symbol mapping.
expression_for_ranking: Optional genes x samples matrix to rank duplicate probes.
Returns:
DE table indexed by HGNC symbol.
"""
df = de_table.copy()
df["symbol"] = probe_to_symbol.reindex(df.index)
df = df.dropna(subset=["symbol"])
if expression_for_ranking is not None:
rank_key = expression_for_ranking.reindex(df.index).mean(axis=1)
df = df.assign(_rank=rank_key).sort_values("_rank", ascending=False)
else:
df = df.sort_values("qvalue", ascending=True)
collapsed = df[~df["symbol"].duplicated(keep="first")].set_index("symbol")
return collapsed.drop(columns=[c for c in ("_rank",) if c in collapsed.columns])
def concordance_filter(
de_a: pd.DataFrame,
de_b: pd.DataFrame,
*,
qvalue_cutoff: float = QVALUE_CUTOFF,
) -> tuple[pd.DataFrame, ConcordanceSummary]:
"""Keep only genes concordant across two symbol-level DE tables (R3).
A gene qualifies iff q<``qvalue_cutoff`` in BOTH studies AND log_fc has the same sign in
both. Reported ``log_fc`` = mean of the two; ``qvalue`` = max of the two (worst case).
Ranked by ``max(q_a, q_b)`` ascending.
Returns:
(concordant_table indexed by symbol with log_fc/qvalue, ConcordanceSummary).
"""
merged = de_a.join(de_b, lsuffix="_a", rsuffix="_b", how="inner")
n_tested = len(merged)
sig_both = (merged["qvalue_a"] < qvalue_cutoff) & (merged["qvalue_b"] < qvalue_cutoff)
same_sign = np.sign(merged["log_fc_a"]) == np.sign(merged["log_fc_b"])
keep = merged[sig_both & same_sign].copy()
keep["log_fc"] = (keep["log_fc_a"] + keep["log_fc_b"]) / 2.0
keep["qvalue"] = keep[["qvalue_a", "qvalue_b"]].max(axis=1)
keep = keep[["log_fc", "qvalue"]].sort_values("qvalue")
summary = ConcordanceSummary(
n_genes_tested=n_tested,
n_concordant=len(keep),
n_up=int((keep["log_fc"] > 0).sum()),
n_down=int((keep["log_fc"] < 0).sum()),
)
return keep, summary
def build_signature(
de_table: pd.DataFrame,
concordant_table: pd.DataFrame,
provenance: SignatureProvenance,
*,
tier: ConfidenceTier,
tier_rationale: str,
limitations: list[str],
id_map: pd.DataFrame | None = None,
top_n: int = TOP_N_PER_DIRECTION,
qvalue_cutoff: float = QVALUE_CUTOFF,
) -> DiseaseSignature:
"""Assemble a ``DiseaseSignature`` from a differential expression table.
"""Assemble a ``DiseaseSignature`` from the concordant gene table.
Takes the top ``top_n`` up- and down-regulated genes (by qvalue, cut at
``qvalue_cutoff``) per PLAN.md §6, Week 1 task 5.
Takes the top ``top_n`` up- and down-regulated genes by qvalue per direction (R3). If
fewer than ``top_n`` qualify in a direction, takes all available (the caller should have
logged the count). ``id_map`` optionally provides 'entrez_id'/'ensembl_id' columns indexed
by symbol (from ``mygene``).
Args:
concordant_table: Output of ``concordance_filter`` (indexed by symbol).
provenance: Fully populated ``SignatureProvenance`` (both studies + concordance).
tier: Confidence tier (Tier A only if genuinely multi-source).
tier_rationale: One-line justification of the tier.
limitations: Honest limitations list (R8).
id_map: Optional symbol -> {entrez_id, ensembl_id} table.
top_n: Max genes per direction.
Returns:
A populated ``DiseaseSignature``.
"""
raise NotImplementedError("Signature assembly: implement in Week 1 (notebook 02).")
def _entries(rows: pd.DataFrame) -> list[GeneEntry]:
entries: list[GeneEntry] = []
for symbol, row in rows.iterrows():
ids = id_map.loc[symbol] if (id_map is not None and symbol in id_map.index) else None
entries.append(
GeneEntry(
gene=str(symbol),
entrez_id=(str(ids["entrez_id"]) if ids is not None and pd.notna(ids.get("entrez_id")) else None),
ensembl_id=(str(ids["ensembl_id"]) if ids is not None and pd.notna(ids.get("ensembl_id")) else None),
log_fc=float(row["log_fc"]),
qvalue=float(row["qvalue"]),
)
)
return entries
up = concordant_table[concordant_table["log_fc"] > 0].sort_values("qvalue").head(top_n)
down = concordant_table[concordant_table["log_fc"] < 0].sort_values("qvalue").head(top_n)
return DiseaseSignature(
up_regulated=_entries(up),
down_regulated=_entries(down),
provenance=provenance,
confidence_tier=tier,
tier_rationale=tier_rationale,
limitations=limitations,
)
def persist_signature(signature: DiseaseSignature, out_path: Path | None = None) -> Path:

View File

@@ -1,33 +1,65 @@
"""CMap-style connectivity scoring — the matching engine.
"""CMap-style connectivity scoring — the matching engine (Week 3, PLAN §6).
Week 3 (PLAN.md §6). Scores each drug's LINCS signature against the disease signature using
weighted Kolmogorov-Smirnov enrichment (Lamb 2006 / Subramanian 2017). Strongly *negative*
connectivity = strong reversal of the disease signature = candidate match.
Scores each drug's LINCS consensus signature against the disease signature using the weighted
Kolmogorov-Smirnov / GSEA enrichment statistic (Lamb 2006; Subramanian 2017). The query is the
disease up/down gene sets; the reference is each drug's 978 landmark genes ranked by z-score.
Uses ``cmapPy`` as the reference implementation. ``tests/test_scoring.py`` verifies the
implementation against a known reference.
Sign convention (PLAN §6): strongly **negative** connectivity = strong **reversal** of the
disease signature = candidate match. A drug that down-regulates the disease's up-genes and
up-regulates its down-genes scores negative.
"""
from __future__ import annotations
from pathlib import Path
import numpy as np
import pandas as pd
from pydantic import BaseModel
from . import RESULTS_DIR
from . import RANDOM_SEED, RESULTS_DIR
# Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
# Keys are pathway categories; values are substrings matched (case-insensitive) against a
# drug's ChEMBL target names.
SICKLE_PATHWAYS: dict[str, tuple[str, ...]] = {
"hbf_epigenetic": ("histone deacetylase", "hdac", "methyltransferase", "dnmt",
"ribonucleoside-diphosphate reductase", "ribonucleotide reductase"),
"hemoglobin": ("hemoglobin", "globin"),
"no_signaling": ("nitric oxide", "guanylate cyclase", "phosphodiesterase 5", "pde5"),
"inflammation": ("cyclooxygenase", "prostaglandin", "nf-kappa", "interleukin",
"leukotriene", "selectin", "tumor necrosis factor"),
"oxidative_stress": ("glutathione", "superoxide", "nadph oxidase", "thioredoxin", "nrf2"),
}
class ConnectivityResult(BaseModel):
"""Connectivity score for a single drug against the disease signature."""
def _enrichment_score(drug_profile: pd.Series, gene_set: set[str], weight: float = 1.0) -> float:
"""Weighted GSEA/KS enrichment score of ``gene_set`` in a drug's ranked profile.
chembl_id: str
drug_name: str
connectivity_score: float | None # None when the drug has no LINCS signature.
normalized_score: float | None = None
p_value: float | None = None
scored: bool # False => no signature available, not scored (do not skip silently).
n_genes_overlap: int | None = None
The profile is ranked by z-score (most up-regulated first). Hits increment a running sum in
proportion to ``|z|**weight``; misses decrement uniformly. ES is the max signed deviation
from zero. ES>0 => set enriched among up-regulated genes; ES<0 => among down-regulated.
"""
s = drug_profile.sort_values(ascending=False)
genes = s.index.to_numpy()
vals = s.to_numpy(dtype=float)
hit = np.fromiter((g in gene_set for g in genes), dtype=bool, count=len(genes))
n_hit = int(hit.sum())
n = len(genes)
if n_hit == 0 or n_hit == n:
return 0.0
w = (np.abs(vals) ** weight) * hit
sum_hit = w.sum()
if sum_hit == 0:
return 0.0
inc = w / sum_hit
dec = (~hit) / (n - n_hit)
running = np.cumsum(inc - dec)
hi, lo = running.max(), running.min()
return float(hi if abs(hi) >= abs(lo) else lo)
def connectivity_score(
@@ -35,46 +67,157 @@ def connectivity_score(
down_genes: list[str],
drug_signature: pd.Series,
) -> float:
"""Weighted KS connectivity score for one drug vs the disease up/down gene sets.
"""Weighted connectivity score (WTCS) for one drug vs the disease up/down sets.
Only the intersection of disease-signature genes and LINCS landmark genes is scored;
callers must record the overlap count (PLAN.md §6, Week 3 task 2).
Args:
up_genes: Disease up-regulated gene identifiers.
down_genes: Disease down-regulated gene identifiers.
drug_signature: Drug's expression vector indexed by gene identifier.
Returns:
Connectivity score in roughly [-1, 1]; strongly negative = strong reversal.
Only query genes present in the drug's profile index (the 978 landmarks) are used — callers
should record the overlap count (PLAN §6 Week 3 task 2). Returns the WTCS: if the two
enrichment scores share a sign the result is 0 (ambiguous), else ``(ES_up - ES_down)/2``.
Negative => reversal => candidate.
"""
raise NotImplementedError("Connectivity scoring: implement in Week 3 (notebook 04).")
profile_genes = set(drug_signature.index)
up = set(up_genes) & profile_genes
down = set(down_genes) & profile_genes
es_up = _enrichment_score(drug_signature, up)
es_down = _enrichment_score(drug_signature, down)
if np.sign(es_up) == np.sign(es_down):
return 0.0
return (es_up - es_down) / 2.0
def normalize_scores(scores: pd.Series) -> pd.Series:
"""Signed normalization (NCS, Subramanian 2017): divide by the mean magnitude of same-sign
scores, so positive and negative tails are separately scaled to a mean magnitude of 1."""
out = scores.astype(float).copy()
pos_mean = scores[scores > 0].mean()
neg_mean = scores[scores < 0].abs().mean()
if pos_mean and not np.isnan(pos_mean):
out[scores > 0] = scores[scores > 0] / pos_mean
if neg_mean and not np.isnan(neg_mean):
out[scores < 0] = scores[scores < 0] / neg_mean
return out
def rank_drugs(
signature_up: list[str],
signature_down: list[str],
drug_profiles: pd.DataFrame,
up_genes: list[str],
down_genes: list[str],
signature_matrix: pd.DataFrame,
) -> pd.DataFrame:
"""Score and rank all drugs against the disease signature.
"""Score and rank all drugs (rows of ``signature_matrix``: drug x landmark-gene z-scores).
Drugs without a LINCS signature are marked ``scored=False`` and excluded from the ranking
rather than dropped silently (PLAN.md §6, Week 3 task 2).
Returns a ranked table with the columns described in PLAN.md §6 (rank, drug_name,
chembl_id, connectivity_score, normalized_score, p_value, inclusion_reason,
known_targets, mechanism_summary).
Returns a table indexed by drug with ``rank`` (1 = strongest reversal = most negative),
``connectivity_score`` and ``normalized_score``. Drugs are expected to all have signatures
here; signature-less drugs are handled (marked not-scored) by the orchestration layer per
PLAN §6 Week 3 task 2.
"""
raise NotImplementedError("Drug ranking: implement in Week 3 (notebook 04).")
scores = pd.Series(
{drug: connectivity_score(up_genes, down_genes, signature_matrix.loc[drug])
for drug in signature_matrix.index},
name="connectivity_score",
)
df = pd.DataFrame({"connectivity_score": scores, "normalized_score": normalize_scores(scores)})
df = df.sort_values("connectivity_score") # most negative (reversal) first
df.insert(0, "rank", range(1, len(df) + 1))
return df
def mechanistic_prior(targets: list[str]) -> float:
"""Prior weight for a drug based on sickle-cell-relevant target pathways.
"""Count of sickle-cell-relevant pathway categories a drug's targets hit (PLAN §6 task 3).
Pathways of interest: HbF regulation, hemoglobin, NO signaling, inflammation, oxidative
stress (PLAN.md §6, Week 3 task 3). Used to build the secondary, prior-weighted ranking.
Higher = more mechanistically plausible. Used to build the secondary, prior-weighted ranking
alongside the raw connectivity ranking.
"""
raise NotImplementedError("Mechanistic prior: implement in Week 3 (notebook 04).")
if not targets:
return 0.0
text = " ; ".join(str(t) for t in targets).lower()
return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))
# --- KS connectivity + tau calibration (v1.1) -----------------------------------------------
# Unweighted Kolmogorov-Smirnov connectivity (Lamb 2006) is O(k) per query (depends only on the
# ranks of the query genes), which makes a permutation null over many random queries cheap. tau
# expresses each drug's real connectivity as a signed percentile within its own null — so
# broad-effect drugs that connect to *random* signatures too get down-weighted (specificity).
def _ks_es(rank_matrix: np.ndarray, query_cols: np.ndarray, n_genes: int) -> np.ndarray:
"""Vectorized unweighted KS enrichment score of a query gene set for every drug.
``rank_matrix`` is (n_drugs, n_genes) of 1..N rank positions (1 = most up-regulated).
Returns one ES per drug; ES>0 => query enriched among up-regulated genes.
"""
k = len(query_cols)
if k == 0:
return np.zeros(rank_matrix.shape[0])
p = np.sort(rank_matrix[:, query_cols], axis=1) # (n_drugs, k), positions ascending
j = np.arange(1, k + 1)
a = (j / k - p / n_genes).max(axis=1)
b = (p / n_genes - (j - 1) / k).max(axis=1)
return np.where(a >= b, a, -b)
def _ks_connectivity(rank_matrix: np.ndarray, up_cols: np.ndarray, down_cols: np.ndarray,
n_genes: int) -> np.ndarray:
"""KS connectivity per drug: (ES_up - ES_down)/2. Negative=reversal.
Note: unlike WTCS, this does NOT zero same-sign (ambiguous) connections — same-sign ES
partially cancel and land near 0 naturally. Hard-zeroing would collapse the random-query
null to a spike at 0 and make tau saturate, so the continuous form is required for calibration.
"""
es_up = _ks_es(rank_matrix, up_cols, n_genes)
es_down = _ks_es(rank_matrix, down_cols, n_genes)
return (es_up - es_down) / 2.0
def tau_calibrate(
up_genes: list[str],
down_genes: list[str],
signature_matrix: pd.DataFrame,
n_null: int = 1000,
seed: int = RANDOM_SEED,
) -> pd.DataFrame:
"""Rank drugs by tau: each drug's KS connectivity as a signed percentile vs a null of
random queries of the same size (PLAN §6; CMap tau, Subramanian 2017).
tau in [-100, 100]: -100 => reverses our signature more specifically than any random query
(strong, specific candidate); ~0 => connects to our signature no more than to random ones
(broad-effect / non-specific). Ranked by tau ascending (rank 1 = most specific reversal).
"""
genes = list(signature_matrix.columns)
gene_to_col = {g: i for i, g in enumerate(genes)}
n = len(genes)
rank_matrix = signature_matrix.rank(axis=1, ascending=False).to_numpy()
up_cols = np.array([gene_to_col[g] for g in set(up_genes) if g in gene_to_col], dtype=int)
down_cols = np.array([gene_to_col[g] for g in set(down_genes) if g in gene_to_col], dtype=int)
real = _ks_connectivity(rank_matrix, up_cols, down_cols, n)
rng = np.random.default_rng(seed)
k_up, k_down = len(up_cols), len(down_cols)
null = np.empty((rank_matrix.shape[0], n_null))
for m in range(n_null):
samp = rng.choice(n, size=k_up + k_down, replace=False)
null[:, m] = _ks_connectivity(rank_matrix, samp[:k_up], samp[k_up:], n)
null_mean = null.mean(axis=1)
null_std = null.std(axis=1)
null_std[null_std == 0] = np.nan
# Per-drug standardized connectivity: how many SDs the real reversal is below what random
# queries of the same size produce against THIS drug. Continuous (no percentile floor), so it
# discriminates within the strong-reversal tail. Negative = specific reversal.
spec_z = (real - null_mean) / null_std
frac = (null <= real[:, None]).mean(axis=1)
tau = 100.0 * (2.0 * frac - 1.0) # retained for reference; saturates at +/-100 in the tail
df = pd.DataFrame(
{"connectivity_ks": real, "null_mean": null_mean, "spec_z": spec_z, "tau": tau},
index=signature_matrix.index,
)
df = df.sort_values("spec_z") # most negative z = most specific reversal
df.insert(0, "rank", range(1, len(df) + 1))
return df
def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:

122
tests/test_disease.py Normal file
View File

@@ -0,0 +1,122 @@
"""Tests for the Week 1 signature-construction logic on synthetic data.
These verify the DE / probe-collapse / concordance math without touching the network, so the
pipeline is trustworthy before it is pointed at real GEO studies.
"""
from __future__ import annotations
import numpy as np
import pandas as pd
import pytest
from src.disease import (
DISEASE_LABEL,
HEALTHY_LABEL,
ConcordanceSummary,
SignatureProvenance,
StudyProvenance,
build_signature,
collapse_probes_to_symbols,
compute_differential_expression,
concordance_filter,
)
from src.provenance import ConfidenceTier
def _synthetic_study(seed: int, n_per_group: int = 12) -> tuple[pd.DataFrame, pd.Series]:
"""Genes x samples matrix where UP is higher and DOWN is lower in disease."""
rng = np.random.default_rng(seed)
genes = ["UP1", "UP2", "DOWN1", "DOWN2", "NOISE1", "NOISE2"]
samples = [f"d{i}" for i in range(n_per_group)] + [f"h{i}" for i in range(n_per_group)]
groups = pd.Series([DISEASE_LABEL] * n_per_group + [HEALTHY_LABEL] * n_per_group, index=samples)
base = rng.normal(8.0, 0.3, size=(len(genes), len(samples)))
df = pd.DataFrame(base, index=genes, columns=samples)
disease = groups == DISEASE_LABEL
df.loc["UP1", disease] += 3.0
df.loc["UP2", disease] += 2.0
df.loc["DOWN1", disease] -= 3.0
df.loc["DOWN2", disease] -= 2.0
return df, groups
def test_welch_de_recovers_direction_and_significance():
expr, groups = _synthetic_study(seed=1)
de = compute_differential_expression(expr, groups, method="welch")
assert de.loc["UP1", "log_fc"] > 0 and de.loc["UP1", "qvalue"] < 0.05
assert de.loc["DOWN1", "log_fc"] < 0 and de.loc["DOWN1", "qvalue"] < 0.05
# Pure-noise genes should not be significant.
assert de.loc["NOISE1", "qvalue"] > 0.05
def test_compute_de_rejects_unlabelled_samples():
expr, groups = _synthetic_study(seed=2)
with pytest.raises(ValueError):
compute_differential_expression(expr, groups.iloc[:-3], method="welch")
def test_collapse_probes_keeps_highest_mean_expression():
de = pd.DataFrame(
{"log_fc": [1.0, 2.0, -1.0], "pvalue": [0.1, 0.2, 0.3], "qvalue": [0.1, 0.2, 0.3]},
index=["probeA1", "probeA2", "probeB1"],
)
probe_to_symbol = pd.Series({"probeA1": "GENEA", "probeA2": "GENEA", "probeB1": "GENEB"})
expr = pd.DataFrame(
{"s1": [1.0, 100.0, 5.0], "s2": [1.0, 100.0, 5.0]},
index=["probeA1", "probeA2", "probeB1"],
)
collapsed = collapse_probes_to_symbols(de, probe_to_symbol, expression_for_ranking=expr)
assert set(collapsed.index) == {"GENEA", "GENEB"}
# probeA2 has the higher mean expression, so its log_fc (2.0) should win.
assert collapsed.loc["GENEA", "log_fc"] == 2.0
def test_concordance_filter_keeps_only_agreeing_genes():
de_a = pd.DataFrame(
{"log_fc": [2.0, -2.0, 1.5, 0.1], "qvalue": [0.001, 0.001, 0.2, 0.001]},
index=["UP1", "DOWN1", "WEAK", "DISAGREE"],
)
de_b = pd.DataFrame(
{"log_fc": [1.8, -2.2, 1.4, -0.1], "qvalue": [0.002, 0.002, 0.2, 0.002]},
index=["UP1", "DOWN1", "WEAK", "DISAGREE"],
)
keep, summary = concordance_filter(de_a, de_b)
assert set(keep.index) == {"UP1", "DOWN1"} # WEAK fails q-cut; DISAGREE flips sign
assert keep.loc["UP1", "log_fc"] == pytest.approx(1.9) # mean of the two
assert keep.loc["UP1", "qvalue"] == pytest.approx(0.002) # max of the two
assert isinstance(summary, ConcordanceSummary)
assert summary.n_genes_tested == 4 and summary.n_concordant == 2
assert summary.n_up == 1 and summary.n_down == 1
def test_build_signature_splits_directions_and_respects_top_n():
concordant = pd.DataFrame(
{
"log_fc": [3.0, 2.0, 1.0, -1.0, -2.0],
"qvalue": [0.001, 0.002, 0.003, 0.004, 0.005],
},
index=["UP1", "UP2", "UP3", "DOWN1", "DOWN2"],
)
prov = SignatureProvenance(
studies=[
StudyProvenance(geo_accession="GSE1", n_disease=12, n_healthy=12,
platform="P", tissue="whole blood", method="welch"),
StudyProvenance(geo_accession="GSE2", n_disease=15, n_healthy=11,
platform="P", tissue="whole blood", method="welch"),
],
concordance=ConcordanceSummary(n_genes_tested=100, n_concordant=5, n_up=3, n_down=2),
created_date="2026-06-23",
)
sig = build_signature(
concordant, prov, tier=ConfidenceTier.A,
tier_rationale="Two-study concordance", limitations=["cell-composition confound"],
top_n=2,
)
assert [g.gene for g in sig.up_regulated] == ["UP1", "UP2"] # top 2 up by qvalue
assert [g.gene for g in sig.down_regulated] == ["DOWN1", "DOWN2"]
assert sig.confidence_tier == ConfidenceTier.A

View File

@@ -1,14 +1,13 @@
"""Tests for the matching engine and provenance logic.
The headline test (PLAN.md §6, Week 3 task 4) verifies connectivity scoring against a known
reference within tolerance; it is marked xfail until the scorer is implemented in Week 3.
The tier-assignment tests run today — they pin the rules from PLAN.md §3 so the most
Connectivity tests (PLAN.md §6, Week 3 task 4) pin the weighted-KS scorer against hand-built
reference profiles. The tier-assignment tests pin the rules from PLAN.md §3 so the most
commercially important design decision can't silently drift.
"""
from __future__ import annotations
import pandas as pd
import pytest
from src.provenance import ConfidenceTier, assign_tier
@@ -52,14 +51,103 @@ class TestAssignTier:
assert assign_tier(**kwargs) == ConfidenceTier.B
@pytest.mark.xfail(reason="Connectivity scoring implemented in Week 3 (notebook 04).", strict=True)
def test_connectivity_score_matches_reference():
"""Verify connectivity scoring against a CMap/cmapPy reference within tolerance.
class TestConnectivityScore:
"""Reference checks for the weighted-KS connectivity score (PLAN §6 Week 3 task 4).
PLAN.md §6, Week 3 task 4. Replace this body with a known reference example
(disease up/down sets + drug signature -> expected score) once the scorer exists.
Query: up = {U1, U2}, down = {D1, D2}. We build drug profiles with a known relationship to
the query and assert the sign/ordering the CMap convention requires.
"""
from src.scoring import connectivity_score
score = connectivity_score(up_genes=[], down_genes=[], drug_signature=None) # noqa
assert score == pytest.approx(0.0, abs=1e-6)
UP = ["U1", "U2"]
DOWN = ["D1", "D2"]
@staticmethod
def _profile(values: dict[str, float]) -> pd.Series:
# 20 filler genes at ~0 so the query genes sit clearly at the extremes.
base = {f"N{i}": 0.01 * ((i % 5) - 2) for i in range(20)}
base.update(values)
return pd.Series(base)
def test_perfect_reversal_is_strongly_negative(self):
from src.scoring import connectivity_score
# Drug pushes disease-up genes DOWN (very negative) and disease-down genes UP (very
# positive) => reversal => negative connectivity.
prof = self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7})
assert connectivity_score(self.UP, self.DOWN, prof) < -0.4
def test_perfect_mimic_is_strongly_positive(self):
from src.scoring import connectivity_score
prof = self._profile({"U1": 8, "U2": 7, "D1": -8, "D2": -7})
assert connectivity_score(self.UP, self.DOWN, prof) > 0.4
def test_reversal_beats_mimic_and_null(self):
from src.scoring import connectivity_score
rev = connectivity_score(self.UP, self.DOWN, self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7}))
mimic = connectivity_score(self.UP, self.DOWN, self._profile({"U1": 8, "U2": 7, "D1": -8, "D2": -7}))
null = connectivity_score(self.UP, self.DOWN, self._profile({"U1": 0.2, "U2": -0.1, "D1": 0.1, "D2": -0.2}))
assert rev < null < mimic
assert abs(null) < abs(rev)
def test_same_sign_enrichment_returns_zero(self):
from src.scoring import connectivity_score
# Both up- and down-sets at the top => same-sign ES => ambiguous => 0 (WTCS rule).
prof = self._profile({"U1": 8, "U2": 7, "D1": 6, "D2": 5})
assert connectivity_score(self.UP, self.DOWN, prof) == 0.0
def test_genes_absent_from_profile_are_ignored(self):
from src.scoring import connectivity_score
prof = self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7})
# Adding a query gene not in the profile must not change the score.
s1 = connectivity_score(self.UP, self.DOWN, prof)
s2 = connectivity_score(self.UP + ["NOT_IN_PROFILE"], self.DOWN, prof)
assert s1 == pytest.approx(s2)
class TestMechanisticPrior:
def test_counts_distinct_sickle_pathways(self):
from src.scoring import mechanistic_prior
# ribonucleotide reductase (hydroxyurea) -> hbf_epigenetic category.
assert mechanistic_prior(["Ribonucleoside-diphosphate reductase RR1"]) == 1.0
# DNMT (epigenetic) + hemoglobin -> two categories.
assert mechanistic_prior(["DNA (cytosine-5)-methyltransferase 1", "Hemoglobin subunit beta"]) == 2.0
assert mechanistic_prior([]) == 0.0
assert mechanistic_prior(["Some unrelated kinase"]) == 0.0
class TestTauCalibration:
"""tau should reward a SPECIFIC reverser and give a near-zero score to a noise drug."""
@staticmethod
def _matrix() -> pd.DataFrame:
genes = [f"U{i}" for i in range(5)] + [f"D{i}" for i in range(5)] + [f"G{i}" for i in range(40)]
rng_vals = {g: 0.01 * ((hash(g) % 7) - 3) for g in genes} # tiny deterministic noise
# specific reverser: query-up genes at the bottom, query-down at the top, rest ~0
specific = dict(rng_vals)
for i in range(5):
specific[f"U{i}"] = -8 - i
specific[f"D{i}"] = 8 + i
noise = dict(rng_vals)
return pd.DataFrame([specific, noise], index=["specific", "noise"])[genes]
def test_specific_reverser_has_strongly_negative_tau(self):
from src.scoring import tau_calibrate
up = [f"U{i}" for i in range(5)]
down = [f"D{i}" for i in range(5)]
out = tau_calibrate(up, down, self._matrix(), n_null=300, seed=0)
# Ranked by spec_z (continuous); the specific reverser is the most negative.
assert out.loc["specific", "spec_z"] < -2
assert out.loc["specific", "spec_z"] < out.loc["noise", "spec_z"]
assert out.loc["specific", "tau"] < -50 # tau also flags it (may saturate near -100)
assert out.loc["specific", "rank"] == 1
def test_rank_drugs_orders_by_reversal():
from src.scoring import rank_drugs
genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]
base = {g: 0.0 for g in genes}
reverser = {**base, "U1": -8, "U2": -7, "D1": 8, "D2": 7}
mimic = {**base, "U1": 8, "U2": 7, "D1": -8, "D2": -7}
matrix = pd.DataFrame([reverser, mimic], index=["reverser", "mimic"])
ranked = rank_drugs(["U1", "U2"], ["D1", "D2"], matrix)
assert ranked.loc["reverser", "rank"] == 1
assert ranked.loc["reverser", "connectivity_score"] < ranked.loc["mimic", "connectivity_score"]