Compare commits
9 Commits
72f1a49de6
...
structure-
| Author | SHA1 | Date | |
|---|---|---|---|
| 51bd90df41 | |||
| 75f5383961 | |||
| 817bcda7dc | |||
| 6c2c71d73d | |||
| 7449dbeefb | |||
| 649f617019 | |||
| 0ce688449d | |||
| b62048614d | |||
| 3417f85eb1 |
3
.gitignore
vendored
3
.gitignore
vendored
@@ -31,3 +31,6 @@ env/
|
|||||||
.DS_Store
|
.DS_Store
|
||||||
.idea/
|
.idea/
|
||||||
.vscode/
|
.vscode/
|
||||||
|
|
||||||
|
# Local tool binaries (refetch per docs/structure_binding_notes.md)
|
||||||
|
tools/
|
||||||
|
|||||||
140
PLAN.md
140
PLAN.md
@@ -426,3 +426,143 @@ This MVP exists in a broader strategic context that was built through ~7 expert
|
|||||||
- **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product.
|
- **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product.
|
||||||
|
|
||||||
The MVP's job is to produce one credible result. Everything else follows from that.
|
The MVP's job is to produce one credible result. Everything else follows from that.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Phase 2 track — Structure-based binding (scoped 2026-06-23)
|
||||||
|
|
||||||
|
> **Status: scoped, not committed.** This is a follow-on track proposed *after* the MVP and its
|
||||||
|
> follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what
|
||||||
|
> those experiments empirically showed. Read §9–11 and the experiment commits first.
|
||||||
|
|
||||||
|
### 12.1 Why pivot modality (the evidence, not a hunch)
|
||||||
|
|
||||||
|
The expression-connectivity approach was built, validated, and pushed hard (gene-space
|
||||||
|
expansion, cell-composition deconvolution, reference-library tau, supervised learning). The
|
||||||
|
empirical verdict:
|
||||||
|
|
||||||
|
- Connectivity **recovers hydroxyurea** (top ~6–8%) but **cannot achieve specificity** —
|
||||||
|
unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four
|
||||||
|
independent methods.
|
||||||
|
- A supervised model on indication labels hit **0.925 CV AUC** — but it was a **degree-bias
|
||||||
|
mirage**: it learned drug popularity, not disease matching (it ranked hydroxyurea *231/300*).
|
||||||
|
- The decisive test: with drug-popularity features removed, the model trained on the actual
|
||||||
|
drug↔disease connectivity scored **AUC 0.491 — pure chance**. **The expression-connectivity
|
||||||
|
modality contains essentially no disease-specific therapeutic signal for this task.**
|
||||||
|
|
||||||
|
This is a *signal* problem, not a *model* problem — no amount of model sophistication (diffusion,
|
||||||
|
GNNs, etc.) extracts signal that isn't in the data. The response is to **change modality** to one
|
||||||
|
with a strong, physical, drug-specific signal: **does a molecule bind a sickle-relevant target?**
|
||||||
|
A drug that binds HbS is mechanistically specific by construction — the opposite of a coincidental
|
||||||
|
expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is
|
||||||
|
itself a diffusion model) actually has traction, because structure has physical ground truth.
|
||||||
|
|
||||||
|
### 12.2 Targets (sickle-specific, druggable, structurally characterised)
|
||||||
|
|
||||||
|
Small molecules only (§2). Curated shortlist with public structures and, crucially, **known
|
||||||
|
small-molecule binders to serve as positive controls**:
|
||||||
|
|
||||||
|
| Target | Mechanism in sickle | Known binder (positive control) |
|
||||||
|
|---|---|---|
|
||||||
|
| Hemoglobin (HBB/HBA tetramer, HbS) | Anti-polymerisation; R-state stabiliser | **voxelotor** (binds α-Val1) |
|
||||||
|
| PKR (PKLR, red-cell pyruvate kinase) | Activator → ↓2,3-BPG → ↑O2 affinity | **mitapivat**, etavopivat |
|
||||||
|
| DNMT1 | HbF induction (de-repress γ-globin) | **decitabine**, azacitidine |
|
||||||
|
| LSD1 / KDM1A | HbF induction | tranylcypromine analogues |
|
||||||
|
| HDAC1/2 | HbF induction | vorinostat, panobinostat |
|
||||||
|
| EHMT2 (G9a) | HbF induction | UNC0642 (tool) |
|
||||||
|
| PDE9 | ↑cGMP, anti-adhesion | PF-04447943 (sickle trial) |
|
||||||
|
|
||||||
|
Hard/excluded for v1: **BCL11A** (transcription factor, no classic pocket — the γ-globin master
|
||||||
|
repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism.
|
||||||
|
|
||||||
|
### 12.3 Method (baseline → generative co-folding)
|
||||||
|
|
||||||
|
1. **Prepare structures.** Pull target structures from the PDB; AF3/Boltz-predict any missing.
|
||||||
|
2. **Prepare ligands.** Reuse the existing ~300-drug set (we already have canonical SMILES from
|
||||||
|
ChEMBL); expandable to the full ChEMBL/LINCS catalogue.
|
||||||
|
3. **Dock + score**, in increasing sophistication:
|
||||||
|
- **Baseline:** classical docking (AutoDock Vina / smina) — fast, CPU, well-understood.
|
||||||
|
- **Generative co-folding:** an open AlphaFold3-class model — **Boltz-2** (predicts the
|
||||||
|
protein–ligand complex *and* a binding-affinity estimate, MIT-licensed), **Chai-1**, or
|
||||||
|
**DiffDock** (a diffusion model for docking — the legitimate home for the "diffusion"
|
||||||
|
instinct). These predict the bound pose directly and tend to beat classical docking.
|
||||||
|
- Report both; the baseline keeps us honest about whether the ML model adds anything.
|
||||||
|
|
||||||
|
### 12.4 Validation (a real recovery test, like §6 Week 4)
|
||||||
|
|
||||||
|
Pre-register before scoring: **the known structure-based sickle drugs must rank as top binders to
|
||||||
|
their targets** — voxelotor→hemoglobin, mitapivat→PKR, decitabine→DNMT1. Negative controls
|
||||||
|
(unrelated drugs) must *not* bind these pockets. This is a cleaner recovery test than the
|
||||||
|
expression one, because binding is mechanistically specific — it should not have the
|
||||||
|
coincidental-reverser problem that sank the connectivity approach.
|
||||||
|
|
||||||
|
### 12.5 The real prize — integrate, don't replace
|
||||||
|
|
||||||
|
The long-term value is **both modalities together**: a candidate that *reverses the disease
|
||||||
|
signature* (expression) **and** *binds a sickle-relevant target* (structure) is far more credible
|
||||||
|
than either alone. Structure supplies the specificity the expression layer lacks; expression
|
||||||
|
supplies the systems-level, target-agnostic view structure lacks. The platform thesis (§11) —
|
||||||
|
two databases + a matching engine — extends naturally to a third (structures) feeding the same
|
||||||
|
confidence-tiered data layer.
|
||||||
|
|
||||||
|
### 12.6 Honest pitfalls (do not ignore)
|
||||||
|
|
||||||
|
1. **Binding ≠ efficacy.** A molecule can bind and do nothing therapeutic. Structure-based hits
|
||||||
|
are still hypotheses (cf. §9.7).
|
||||||
|
2. **Only covers the enzyme/pocket subset.** Sickle's biggest lever (γ-globin reactivation via
|
||||||
|
BCL11A) is largely transcriptional and not small-molecule-tractable — structure-based screening
|
||||||
|
is blind to it. Be explicit about coverage.
|
||||||
|
3. **Docking/affinity accuracy is limited.** Pose prediction is decent; absolute affinity is hard.
|
||||||
|
Validate on known binders before trusting novel scores.
|
||||||
|
4. **Compute.** AF3-class models are GPU-heavy; the local Mac Studio (§2) may not suffice — this
|
||||||
|
track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule.
|
||||||
|
5. **Moat.** Structures and tools are public; the proprietary value is the curated target list,
|
||||||
|
the integration with the expression layer, and provenance/tiering — not the docker.
|
||||||
|
|
||||||
|
### 12.7 Explicitly NOT in this track
|
||||||
|
|
||||||
|
Free energy perturbation / MD-based affinity; covalent docking; **de novo generation of molecules
|
||||||
|
as final candidates to synthesise** (design, not repurposing — but see §12.9 for the
|
||||||
|
generate-then-retrieve hybrid, which *is* repurposing); BCL11A or any non-pocket target;
|
||||||
|
biologics; combination binding.
|
||||||
|
|
||||||
|
### 12.8 Open decisions before committing
|
||||||
|
|
||||||
|
- **Tooling:** classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend:
|
||||||
|
baseline first, for an honest reference — the lesson of the whole expression arc.)
|
||||||
|
- **Compute:** secure a GPU environment (the "all local" §2 assumption breaks here).
|
||||||
|
- **Scope of v1:** the 7-target shortlist above, or start with just Hb + PKR (the two with the
|
||||||
|
cleanest positive controls) to de-risk the harness before scaling targets.
|
||||||
|
|
||||||
|
### 12.9 Door left open — generative-guided retrieval (generate → match existing)
|
||||||
|
|
||||||
|
A legitimate way to bring generative models *into the repurposing frame* (vs the design frame
|
||||||
|
excluded in §12.7): don't generate molecules to synthesise — generate them as a **search beacon**.
|
||||||
|
|
||||||
|
**The idea.** Use a pocket-conditioned generative model (target-conditioned diffusion — e.g.
|
||||||
|
TargetDiff, DiffSBDD, Pocket2Mol) to propose idealised binders for a sickle target. Then retrieve
|
||||||
|
the **nearest existing drugs** to those proposals by chemical similarity (Tanimoto over Morgan
|
||||||
|
fingerprints, or a learned molecular embedding). The retrieved neighbours — real, already-approved
|
||||||
|
or clinical molecules — are the repurposing candidates. The generated molecule is never made; it
|
||||||
|
only *defines a region of chemical space* worth searching. This is the user-proposed framing and
|
||||||
|
it is sound: "generate the ideal, then find what we already have nearby."
|
||||||
|
|
||||||
|
**Why it could add value.** It can point at scaffolds / regions a known-binder-seeded or
|
||||||
|
brute-force docking sweep would miss, and it makes the target's binding requirements explicit as
|
||||||
|
geometry rather than as a single reference ligand.
|
||||||
|
|
||||||
|
**Honest caveats (why it's a *door*, not a commitment).**
|
||||||
|
1. **Generated molecules are often synthetically unrealistic / invalid** — which is exactly why
|
||||||
|
they must be used only as beacons, never as candidates.
|
||||||
|
2. **Similarity ≠ activity.** Activity cliffs mean a near-neighbour of a good binder can be inert.
|
||||||
|
So retrieved neighbours do **not** bypass validation — they must still be docked/scored (§12.3)
|
||||||
|
and clear the binding recovery test (§12.4). The generative step *proposes*; it does not *prove*.
|
||||||
|
3. **Marginal-value question.** Directly docking the whole existing library (§12.3) already covers
|
||||||
|
chemical space. Whether generate-then-retrieve beats that — by efficiency or by surfacing
|
||||||
|
non-obvious scaffolds — is an open empirical question and needs a head-to-head before it earns
|
||||||
|
real investment.
|
||||||
|
4. **Only as good as the pocket conditioning** of the generator, and the chemistry of the target.
|
||||||
|
|
||||||
|
**Status:** explore only *after* the §12.3–12.4 docking harness works and is validated on the
|
||||||
|
known binders — same discipline as everywhere else: prove the baseline, then test whether the
|
||||||
|
fancier method adds anything.
|
||||||
|
|||||||
0
data/raw/structures/.gitkeep
Normal file
0
data/raw/structures/.gitkeep
Normal file
@@ -61,9 +61,10 @@ Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
|
|||||||
38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
|
38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
|
||||||
- **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
|
- **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
|
||||||
not-single-source rule).
|
not-single-source rule).
|
||||||
- **Signature↔landmark overlap:** only 56/477 (12%) of the disease signature genes are LINCS
|
- **Gene space (v1.1):** scoring uses the full **12,328-gene** LINCS space, not just the 978
|
||||||
landmarks, so connectivity scoring (Week 3) uses a 30-up/26-down query. The erythroid hallmark
|
landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger
|
||||||
genes (CA1, AHSP, SLC4A1, HBG) are NOT landmarks. This is a key limitation for the recovery test.
|
space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from
|
||||||
|
LINCS entirely and remain unscoreable.
|
||||||
- Reproduce: `week2_curate_drugset.py` → `week2_chembl.py` → download Level-5 GCTX →
|
- Reproduce: `week2_curate_drugset.py` → `week2_chembl.py` → download Level-5 GCTX →
|
||||||
`week2_lincs_extract.py` → `week2_assemble.py`.
|
`week2_lincs_extract.py` → `week2_assemble.py`.
|
||||||
|
|
||||||
|
|||||||
@@ -34,20 +34,28 @@ Source: PLAN.md §9.
|
|||||||
7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
|
7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
|
||||||
to test, not discoveries. Use careful language in any write-up.
|
to test, not discoveries. Use careful language in any write-up.
|
||||||
|
|
||||||
8. **Only 12% of the signature is LINCS-scorable (56/477 genes).** The 978 landmark genes (from
|
8. **Gene-space bottleneck (v1 → fixed in v1.1).** v1 scored on only the 978 landmark genes (12%
|
||||||
cancer cell lines) miss the erythroid hallmark genes (CA1, AHSP, SLC4A1, HBG). Connectivity
|
signature overlap) — the main driver of the v1 failure. v1.1 uses the full 12,328-gene space
|
||||||
scoring runs on a thin inflammation/metabolic slice — the single biggest driver of the
|
(85% overlap) and recovers hydroxyurea. HBG1/HBG2 remain absent from LINCS entirely.
|
||||||
recovery-test failure. v2 fix: signature prediction or a mechanism graph to score the other 88%.
|
|
||||||
|
|
||||||
## Recovery test outcome (Week 4)
|
9. **No reference-signature library for tau.** Textbook CMap tau saturated at ±100 (a coherent
|
||||||
|
query always out-connects random gene sets). v1.1 substitutes a per-drug specificity z-score.
|
||||||
|
Proper tau needs a library of real reference signatures — a v2 / curated-data item.
|
||||||
|
|
||||||
The MVP **failed** all three pre-registered criteria on the primary raw ranking (hydroxyurea
|
10. **Negative-control criterion may be invalid for connectivity scoring.** Unrelated drugs
|
||||||
rank 40/top 13%; L-glutamine rank 100/WTCS=0; 1/5 negative controls in bottom half). The failure
|
(norethindrone, ciprofloxacin) rank as top specific reversers — connectivity measures
|
||||||
is fully attributable to signature/assay data limitations above, not the matching algorithm. See
|
expression reversal, not therapeutic relatedness.
|
||||||
|
|
||||||
|
## Recovery test outcome
|
||||||
|
|
||||||
|
Pre-registered test (**v1, confirmatory**): **FAILED** all three criteria (hydroxyurea rank
|
||||||
|
40/top 13%; L-glutamine rank 100; 1/5 negative controls bottom-half). Post-hoc (**v1.1,
|
||||||
|
exploratory**): hydroxyurea recovers to rank 18 (top 6%, passes), but L-glutamine (rank 213, does
|
||||||
|
not reverse) and negative controls (2/5) still fail → overall still FAIL. See
|
||||||
`recovery_test_report.md`.
|
`recovery_test_report.md`.
|
||||||
|
|
||||||
| Drug | Issue | Handling |
|
| Drug | Issue | v1.1 status |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| hydroxyurea | HbF mechanism not in scorable gene space | scored (rank 40); recovered only by prior-weighted ranking |
|
| hydroxyurea | needed the full gene space | rank 18 (top 6%) — recovered post-hoc |
|
||||||
| L-glutamine | signature present but WTCS ambiguous (=0) | scored (rank 100); no reversal signal |
|
| L-glutamine | metabolite, no reversal signal (positive connectivity) | rank 213 — genuine negative |
|
||||||
| all 300 | had LINCS signatures | 0 marked "not scored" — coverage was not the issue; specificity was |
|
| neg controls | reverse the generic inflammation signature | 2/5 bottom-half — criterion questionable |
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
# Sickle Cell Repurposing — Recovery Test Report
|
# Sickle Cell Repurposing — Recovery Test Report
|
||||||
|
|
||||||
> **Status: COMPLETE.** Reproduce with `scripts/week1_*` → `week2_*` → `week3_scoring.py` →
|
> **Status: COMPLETE (v1 confirmatory + v1.1 exploratory).** Reproduce with `scripts/week1_*` →
|
||||||
> `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.
|
> `week2_*` → `week3_scoring.py` → `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.
|
||||||
|
|
||||||
## Pre-registered success criteria
|
## Pre-registered success criteria
|
||||||
|
|
||||||
@@ -12,118 +12,116 @@ The MVP passes if:
|
|||||||
missing LINCS signature, **AND**
|
missing LINCS signature, **AND**
|
||||||
- At least **4 of 5** negative-control drugs rank in the **bottom half**.
|
- At least **4 of 5** negative-control drugs rank in the **bottom half**.
|
||||||
|
|
||||||
_Pre-registered in the scaffold commit (`b731478`) before any scoring was run. Primary ranking
|
_Pre-registered in the scaffold commit (`b731478`) before any scoring. **Primary (confirmatory)
|
||||||
= raw connectivity. The 5 negative controls were pre-specified by category rule (one per
|
analysis = v1**: 978 landmark genes, weighted connectivity score (WTCS). The 5 negative controls
|
||||||
category, alphabetically first available) without inspecting ranks._
|
were pre-specified by category rule without inspecting ranks._
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Section 1 — Methodology
|
## Section 1 — Methodology
|
||||||
|
|
||||||
We built a sickle cell disease signature from **two independent whole-blood microarray studies**
|
A sickle cell disease signature was built from **two whole-blood microarray studies** (GSE35007
|
||||||
(GSE35007, Illumina, SS vs AA; GSE16728, Affymetrix, patient vs control), keeping the **671
|
Illumina SS-vs-AA; GSE16728 Affymetrix patient-vs-control), keeping the **671 genes concordant**
|
||||||
genes concordant** (q<0.05, same direction) across both — a cross-platform, cross-population
|
across both (q<0.05, same direction) → a cross-platform Tier-A signature (250 up / 227 down).
|
||||||
Tier-A signature (250 up / 227 down). We built profiles for **300 small molecules** (2
|
Profiles were built for **300 small molecules** (2 ground-truth; 32 related-mechanism; 26
|
||||||
ground-truth: hydroxyurea, L-glutamine; 32 related-mechanism; 26 negative controls; 240 random),
|
negative controls; 240 random), each with a **LINCS L1000** consensus signature (mean Level-5
|
||||||
each with a consensus **LINCS L1000** signature (mean of Level-5 MODZ z-scores across cell
|
MODZ across cell lines, both CMap phases). Drugs were ranked by **CMap connectivity scoring**
|
||||||
lines, 978 landmark genes, both CMap phases). We ranked drugs by **CMap connectivity scoring**
|
(Kolmogorov-Smirnov, Lamb 2006 / Subramanian 2017): negative = reversal = candidate.
|
||||||
(weighted-KS, Lamb 2006 / Subramanian 2017): strongly negative = strong reversal of the disease
|
|
||||||
signature = candidate. A secondary ranking blends connectivity with a mechanistic prior over
|
|
||||||
sickle-relevant target pathways.
|
|
||||||
|
|
||||||
## Section 2 — Recovery test result — **FAIL** (primary ranking)
|
**v1 (pre-registered/confirmatory):** scored on the 978 landmark genes with WTCS.
|
||||||
|
**v1.1 (post-hoc/exploratory):** after v1 failed, two changes were made to diagnose why — (a)
|
||||||
|
score on the full **12,328-gene** space (landmark overlap 12% → 85%, bringing the erythroid
|
||||||
|
markers in); (b) add a **per-drug specificity z-score** (`spec_z`): how many SDs the real
|
||||||
|
connectivity is below a null of 1,000 random queries of the same size against that drug. Because
|
||||||
|
these changes followed inspection of the v1 result, **v1.1 is exploratory, not a confirmatory
|
||||||
|
test of the pre-registered hypothesis.**
|
||||||
|
|
||||||
| Drug | Rank | Percentile | Pass? |
|
## Section 2 — Recovery test result
|
||||||
|---|---|---|---|
|
|
||||||
| Hydroxyurea | 40 / 300 | top 13.3% | ❌ (needs top 30) |
|
|
||||||
| L-glutamine | 100 / 300 | top 33.3% | ❌ (WTCS=0, ambiguous; has a signature so not "missing") |
|
|
||||||
|
|
||||||
Negative controls (pre-specified; expected: bottom half):
|
| Criterion | v1 (confirmatory) | v1.1 (exploratory) |
|
||||||
|
|---|---|---|
|
||||||
|
| Hydroxyurea top-10% (≤30) | rank **40** (13.3%) ❌ | rank **18** (6.0%) ✅ |
|
||||||
|
| L-glutamine top-25% (≤75) | rank 100, WTCS=0 ❌ | rank 213, spec_z=+0.98 ❌ |
|
||||||
|
| ≥4/5 neg controls bottom-half | 1/5 ❌ | 2/5 ❌ |
|
||||||
|
| **Overall** | **FAIL** | **FAIL** (but hydroxyurea recovered) |
|
||||||
|
|
||||||
| Control | Category | Rank | Bottom half? |
|
v1.1 negative controls: clotrimazole 258 ✅, astemizole 211 ✅, azithromycin 142 ❌,
|
||||||
|---|---|---|---|
|
ethinyl-estradiol 114 ❌, caffeine 77 ❌.
|
||||||
| clotrimazole | antifungal | 89 | ❌ |
|
|
||||||
| astemizole | antihistamine | 291 | ✅ |
|
|
||||||
| azithromycin | antibiotic | 82 | ❌ |
|
|
||||||
| ethinyl-estradiol | hormone | 98 | ❌ |
|
|
||||||
| caffeine | misc | 84 | ❌ |
|
|
||||||
|
|
||||||
**Only 1/5 negative controls in the bottom half (need ≥4).**
|
**Honest reading.** The **pre-registered test FAILED (v1).** The post-hoc v1.1 changes
|
||||||
|
**recover hydroxyurea** (rank 40 → 18, passing top-10%) — strong evidence that the v1 failure was
|
||||||
|
driven by the 978-landmark bottleneck, not the algorithm. But two failures survive into v1.1, and
|
||||||
|
both are now precisely diagnosed:
|
||||||
|
|
||||||
**Overall: FAIL on all three pre-registered criteria.** This is reported as-is, without
|
1. **L-glutamine does not reverse the signature** (positive connectivity, spec_z=+0.98). This is
|
||||||
adjustment. For context only (not the pre-registered criterion): the secondary
|
intrinsic to its LINCS data — a metabolite with no reversal signal — not a coverage gap. More
|
||||||
mechanistic-prior ranking places hydroxyurea at **rank 7 (top 2.3%)** — but that ranking uses
|
genes cannot fix it.
|
||||||
prior knowledge of the drug's target, so it cannot be claimed as a blind recovery.
|
2. **The negative-control criterion is arguably invalid for connectivity scoring.** Two
|
||||||
|
"negative controls" (norethindrone, ciprofloxacin) rank in the top 3 by spec_z. Connectivity
|
||||||
|
measures *expression reversal*, not *therapeutic relatedness* — an antibiotic or contraceptive
|
||||||
|
can still down-regulate the inflammation genes that dominate the scorable signature. The test
|
||||||
|
design conflates the two.
|
||||||
|
|
||||||
**Why it failed — the honest diagnosis.** The disease signature is dominated by erythroid /
|
A note on the calibration: textbook CMap **tau** (percentile vs a reference population) was
|
||||||
reticulocyte biology (CA1, AHSP, SLC4A1) and the HbF axis that hydroxyurea actually acts on
|
implemented but **saturated at ±100** here, because a coherent real query always out-connects
|
||||||
(HBG1/HBG2) was lost (flat in GSE35007; removed by GSE16728's globin-depleted prep). Worse,
|
random gene sets — proper tau needs a library of *real* reference signatures, which this MVP
|
||||||
only **56 of 477 signature genes (12%) are LINCS landmark genes** — and none of the erythroid
|
lacks. The continuous `spec_z` is the workable substitute.
|
||||||
hallmark genes are. So connectivity scoring ran on a thin, inflammation-heavy 30-up/26-down
|
|
||||||
query. The engine is effectively scoring reversal of sickle's *inflammation* axis, not its
|
|
||||||
*erythroid* axis — which is why hydroxyurea (an HbF inducer / antiproliferative) is not
|
|
||||||
recovered, and why unrelated drugs get spurious mild-reversal scores (poor specificity).
|
|
||||||
|
|
||||||
## Section 3 — Top 10 candidates (raw connectivity)
|
## Section 3 — Top 10 candidates (v1.1 spec_z)
|
||||||
|
|
||||||
| Rank | Drug | Score | Known target / mechanism | Plausibility |
|
| Rank | Drug | spec_z | Inclusion | Read |
|
||||||
|---|---|---|---|---|
|
|---|---|---|---|---|
|
||||||
| 1 | laropiprant | −0.417 | Prostaglandin D2 receptor antagonist | Anti-inflammatory — coherent with inflammation-axis reversal |
|
| 1 | reserpic-acid | −3.80 | random | reserpine metabolite; non-obvious |
|
||||||
| 2 | BRD-K62768824 | −0.396 | (tool compound, no annotation) | Likely broad-effect false positive |
|
| 2 | norethindrone | −3.78 | **negative control** | false positive (see §2) |
|
||||||
| 3 | BRD-K71353154 | −0.393 | (tool compound) | Likely false positive |
|
| 3 | ciprofloxacin | −3.61 | **negative control** | false positive |
|
||||||
| 4 | lisinopril | −0.358 | ACE inhibitor | **Non-obvious; see §4** |
|
| 4 | resveratrol | −3.46 | related-mechanism | antioxidant studied in SCD — coherent |
|
||||||
| 5 | BRD-K53443165 | −0.358 | (tool compound) | Likely false positive |
|
| 5 | BRD-K57490754 | −3.37 | random | tool compound |
|
||||||
| 6 | talnetant | −0.347 | Neurokinin-3 (NK3) receptor antagonist | No obvious sickle rationale |
|
| 6 | anastrozole | −3.27 | random | aromatase inhibitor |
|
||||||
| 7 | BRD-K46936109 | −0.342 | (tool compound) | Likely false positive |
|
| 7–10 | BRD-* / palmitoylethanolamide | ~−3.1 | random | mostly tool compounds |
|
||||||
| 8 | lawsone | −0.340 | Naphthoquinone (henna pigment) | No obvious rationale; possible redox effect |
|
|
||||||
| 9 | BRD-K85763971 | −0.338 | (tool compound) | Likely false positive |
|
|
||||||
| 10 | BRD-K36516410 | −0.323 | (tool compound) | Likely false positive |
|
|
||||||
|
|
||||||
As anticipated (PLAN §9.4), the raw top-10 is dominated by unannotated broad-effect tool
|
That two negative controls outrank hydroxyurea is the single most informative result here — see §4.
|
||||||
compounds — these are **not** credible candidates and are not over-interpreted.
|
|
||||||
|
|
||||||
## Section 4 — One non-obvious candidate worth investigating
|
## Section 4 — One non-obvious result worth investigating
|
||||||
|
|
||||||
**Lisinopril (ACE inhibitor), rank 4.** This is the most interesting non-obvious hit: ACE
|
The most useful finding is **not** a candidate drug but the **negative-control failure**:
|
||||||
inhibitors are already used clinically in sickle cell disease for **renal protection**
|
unrelated drugs (norethindrone, ciprofloxacin) score as strong specific reversers. This is a
|
||||||
(reducing albuminuria / progression of sickle nephropathy), via mechanisms independent of the
|
real, generalizable lesson — for a signature whose *scorable* portion is generic
|
||||||
HbF pathway. Surfacing an agent with a genuine, mechanistically distinct sickle-cell rationale —
|
inflammation/metabolic genes, connectivity rewards any broad transcriptional perturbation that
|
||||||
from an inflammation/vascular-flavoured signature — is a small but real signal that the matching
|
touches those genes. The honest implication: **this signature is not specific enough to
|
||||||
approach can point at non-obvious biology. **This is a computational hypothesis, not a
|
discriminate true repurposing candidates from incidental expression reversers.** Of the
|
||||||
discovery**, and the connectivity rationale here (inflammation-axis reversal) is not the same as
|
plausibly-real hits, **resveratrol (rank 4)** — an antioxidant with prior sickle cell literature
|
||||||
lisinopril's known renal mechanism, so the match should be treated as suggestive only.
|
— is the most defensible, but it is a hypothesis, not a discovery.
|
||||||
|
|
||||||
## Section 5 — Honest limitations
|
## Section 5 — Honest limitations
|
||||||
|
|
||||||
1. **Cell-composition confound** — the whole-blood signature is dominated by reticulocyte/
|
1. **Pre-registered test failed; the pass is post-hoc.** v1.1's hydroxyurea recovery is
|
||||||
erythroid markers (composition, not pure disease-state regulation). v2 needs deconvolution.
|
exploratory and must be re-validated on a held-out disease before any claim is made.
|
||||||
2. **Missing HbF axis** — HBG1/HBG2 absent (globin depletion + flat in GSE35007), so the
|
2. **Missing HbF axis** — HBG1/HBG2 are absent from LINCS entirely (not just landmarks), so the
|
||||||
signature cannot encode the pathway hydroxyurea acts on.
|
pathway hydroxyurea acts on can never be scored by this method.
|
||||||
3. **12% signature↔landmark overlap** — only 56/477 genes are LINCS landmarks; the erythroid
|
3. **Signature specificity** — scorable genes are inflammation/metabolic; negative controls
|
||||||
hallmark genes are not scorable. The query collapses to a generic inflammation/metabolic slice.
|
reverse them too. Connectivity ≠ therapeutic relatedness.
|
||||||
4. **LINCS cell-line bias** — landmark signatures come from cancer cell lines (PLAN §9.2); poorly
|
4. **Cell-composition confound** — the whole-blood signature is reticulocyte-dominated.
|
||||||
suited to a blood disease.
|
5. **LINCS cancer-cell-line bias**, and **no reference-signature library** for proper tau.
|
||||||
5. **Poor negative-control specificity** — unrelated drugs received mild reversal scores; the
|
6. **No mechanistic validation** — all hits are computational hypotheses.
|
||||||
thin query yields a noisy connectivity distribution.
|
|
||||||
6. **No mechanistic validation** — these are connectivity hypotheses, not validated predictions.
|
|
||||||
|
|
||||||
## Section 6 — What v2 would fix
|
## Section 6 — What v2 would fix
|
||||||
|
|
||||||
- **Cell-type deconvolution** of the disease signature to separate disease-state regulation from
|
- **A reference-signature library** to make tau (proper specificity calibration) work — the
|
||||||
composition, recovering specificity.
|
single biggest fix to the negative-control problem, and a direct use of the curated-data moat.
|
||||||
- **A non-globin-depleted, RNA-seq whole-blood study** to retain the HbF axis.
|
- **Cell-type deconvolution** + a non-globin-depleted RNA-seq study to recover a more specific,
|
||||||
- **Signature prediction** (DeepCE-style) or a mechanism/knowledge graph to score the ~88% of
|
HbF-containing signature.
|
||||||
the signature that has no LINCS landmark — the single biggest lever on this result.
|
- **Signature prediction / mechanism graph** to score genes with no LINCS measurement.
|
||||||
- **A second disease** to test generalization (sickle results alone do not prove the platform —
|
- **A second disease** to test generalization and to honestly re-validate the v1.1 method
|
||||||
PLAN §9.5).
|
(PLAN §9.5).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Bottom line
|
### Bottom line
|
||||||
|
|
||||||
The pipeline is reproducible end-to-end and the method is sound, but on this signature it **does
|
The pre-registered recovery test **failed**. Post-hoc diagnosis shows the dominant cause was a
|
||||||
not recover the known sickle cell drugs**. The failure is fully explained by signature/assay
|
fixable gene-space bottleneck — correcting it **recovers hydroxyurea** — but also surfaces a
|
||||||
data limitations (erythroid biology lost; 12% landmark overlap), not by a flaw in the matching
|
deeper, genuine limitation: this whole-blood signature is **not specific enough** for
|
||||||
algorithm. The most valuable output of this MVP is therefore a precise, honest map of *what data
|
connectivity scoring to separate real candidates from incidental reversers (negative controls
|
||||||
quality the method needs to work* — which is exactly the de-risking the proof-of-concept was
|
rank at the top). The MVP's real deliverable is a precise, honest map of *what it takes to make
|
||||||
meant to deliver.
|
this method work*: a more specific (deconvolved, HbF-containing) signature and a reference library
|
||||||
|
for calibration — exactly the curated-data investments the platform thesis is built on.
|
||||||
|
|||||||
91
docs/structure_binding_notes.md
Normal file
91
docs/structure_binding_notes.md
Normal file
@@ -0,0 +1,91 @@
|
|||||||
|
# Structure-based binding track — working notes
|
||||||
|
|
||||||
|
Branch `structure-based-binding`. Implements PLAN §12. Baseline-first, start with the two cleanest
|
||||||
|
targets (Hemoglobin + PKR), de-risk the harness before scaling.
|
||||||
|
|
||||||
|
## Status (2026-06-23)
|
||||||
|
|
||||||
|
**Toolchain check (PLAN §12.6 pitfall 4, confirmed real):**
|
||||||
|
- ✅ RDKit installs on ARM Mac — ligand side ready.
|
||||||
|
- ❌ AutoDock Vina does NOT pip-install on ARM Mac; no docking binary available. Docking (§12.3)
|
||||||
|
is **blocked on toolchain** — must resolve via conda/micromamba (`vina`/`smina`), a GPU AF3-class
|
||||||
|
model (Boltz-2/Chai-1/DiffDock), or an x86 Vina binary under Rosetta.
|
||||||
|
|
||||||
|
**Structures obtained:** `5E83` (hemoglobin + voxelotor), `8XFD` (PKR + mitapivat) in
|
||||||
|
`data/raw/structures/`.
|
||||||
|
|
||||||
|
**Step 0 — ligand-based retrieval baseline (`scripts/binding_ligand_baseline.py`):**
|
||||||
|
RDKit Tanimoto of our 300 drugs vs known sickle binders.
|
||||||
|
- Engine VALIDATED on in-set classes: `decitabine`→azacitidine (0.62); `vorinostat`→scriptaid
|
||||||
|
(0.42), belinostat (0.28). Correctly clusters DNMT1 / HDAC HbF-inducers.
|
||||||
|
- But voxelotor / mitapivat have **no analog** in our set (max Tanimoto ~0.20–0.26). A 300-drug
|
||||||
|
library is too sparse to contain look-alikes of distinctive scaffolds.
|
||||||
|
|
||||||
|
**Takeaways:**
|
||||||
|
1. Ligand retrieval works but needs a **bigger drug library** to be useful for distinctive targets.
|
||||||
|
2. The targets without in-set analogs (Hb, PKR) need **actual docking** (§12.3) — which scores
|
||||||
|
binding directly, no look-alike required. That is the gating next step, and it needs the
|
||||||
|
toolchain solved.
|
||||||
|
|
||||||
|
## Step 1 — docking baseline (2026-06-24)
|
||||||
|
|
||||||
|
**Toolchain SOLVED on ARM Mac:** AutoDock Vina 1.2.5 mac binary (`tools/vina`, runs under Rosetta)
|
||||||
|
+ open-babel (brew) for prep. Docking runs end-to-end (`scripts/dock_positive_controls.py`).
|
||||||
|
Co-crystal ligands identified: 5L7 = voxelotor (5E83), WV2 = mitapivat (8XFD).
|
||||||
|
|
||||||
|
**Positive-control cross-docking — inconclusive, and instructively so.** Affinities (kcal/mol):
|
||||||
|
|
||||||
|
| ligand | hemoglobin | PKR |
|
||||||
|
|---|---|---|
|
||||||
|
| voxelotor | −8.1 | −9.3 |
|
||||||
|
| mitapivat | −10.0 | −11.2 |
|
||||||
|
| decitabine | −6.6 | −7.0 |
|
||||||
|
| hydroxyurea | −3.9 | −3.6 |
|
||||||
|
| caffeine | −6.1 | −6.4 |
|
||||||
|
|
||||||
|
The scores rank almost perfectly by **molecular size** (mitapivat > voxelotor > decitabine/caffeine
|
||||||
|
> hydroxyurea) in *both* pockets — mitapivat wins even on hemoglobin, which it doesn't target. So
|
||||||
|
raw Vina affinity is confounded by ligand size and per-pocket stickiness; it cannot yet
|
||||||
|
distinguish target-specific binding. This is the **docking analog of the connectivity specificity
|
||||||
|
problem** — raw scores carry a systematic bias (size here, broadness there) that masquerades as
|
||||||
|
signal. voxelotor *does* dock to Hb (−8.1, a real score); the cross-target test just isn't the
|
||||||
|
right validation.
|
||||||
|
|
||||||
|
## Step 2 — redocking-RMSD validation FAILS across the board (2026-06-24)
|
||||||
|
|
||||||
|
Redocked each co-crystal ligand into its own structure (`scripts/dock_validate.py`); RMSD vs
|
||||||
|
crystal pose via obrms:
|
||||||
|
|
||||||
|
| redock | RMSD | note |
|
||||||
|
|---|---|---|
|
||||||
|
| voxelotor → Hb (5E83) | NA | covalent binder (Schiff base, αVal1) — out of scope §12.7 |
|
||||||
|
| mitapivat → PKR (8XFD) | 8.2 Å | allosteric, cofactor (FBP/Mg) stripped |
|
||||||
|
| **vorinostat → HDAC2 (4LXZ, Zn kept)** | **7.9 Å** | classical non-covalent target — should have worked |
|
||||||
|
|
||||||
|
**The clean target also failing means this is a systematic PIPELINE-QUALITY problem, not target
|
||||||
|
choice.** The cheap Vina + open-babel setup produces scores but does not reproduce known binding
|
||||||
|
geometry, so its affinities are not yet trustworthy. Ligand efficiency (affinity / heavy atoms)
|
||||||
|
also doesn't fix it — it over-corrects, ranking tiny hydroxyurea (−0.78) "best".
|
||||||
|
|
||||||
|
Likely causes (in priority order):
|
||||||
|
1. **Low-quality receptor prep** — open-babel `-xr` is not production docking prep. Need
|
||||||
|
AutoDockTools `prepare_receptor` or **Meeko** + `reduce`/pdb2pqr for protonation, charges, and
|
||||||
|
proper AutoDock atom typing.
|
||||||
|
2. **Ligand prep** — should use Meeko (correct rotatable bonds / typing), not bare obabel `--gen3d`.
|
||||||
|
3. **RMSD metric** — obrms superimposes before RMSD; redocking validation wants symmetry-corrected
|
||||||
|
RMSD **in place** (receptor frame). Worth confirming with an in-place metric.
|
||||||
|
|
||||||
|
**Honest takeaway:** consistent with the whole project — the *quick* version of each method runs
|
||||||
|
but doesn't survive honest validation. Credible structure-based docking needs production prep
|
||||||
|
tooling (Meeko/ADFR), which is the real next investment for this track.
|
||||||
|
|
||||||
|
## Next steps
|
||||||
|
- [ ] Install **Meeko** (+ reduce / pdb2pqr) and redo receptor+ligand prep; re-run redocking RMSD.
|
||||||
|
- [ ] Fix the RMSD metric (in-place, symmetry-corrected) to rule out a measurement artifact.
|
||||||
|
- [ ] Only once redocking validates (<2 Å) are affinity scores trustworthy — then cross-dock /
|
||||||
|
screen the library and revisit ligand-efficiency / pose-based scoring.
|
||||||
|
- [ ] Later: AF3-class co-folding (Boltz-2/DiffDock via PyTorch-MPS — 24 GB ceiling) and the §12.9
|
||||||
|
generative beacon.
|
||||||
|
|
||||||
|
> **Hardware note:** this machine is **24 GB** unified memory (not the 96 GB PLAN §2 assumed),
|
||||||
|
> which caps local AF3-class model inference. Classical docking (above) is unaffected.
|
||||||
@@ -33,6 +33,12 @@ dev = [
|
|||||||
"pytest>=8.0",
|
"pytest>=8.0",
|
||||||
"ruff>=0.5",
|
"ruff>=0.5",
|
||||||
]
|
]
|
||||||
|
# Structure-based binding track (PLAN §12). Docking tool (vina/smina) is NOT pip-installable on
|
||||||
|
# ARM Mac — install via conda/micromamba or use a GPU AF3-class model; see docs/structure_binding_notes.md.
|
||||||
|
structure = [
|
||||||
|
"rdkit>=2024.3",
|
||||||
|
"requests>=2.31",
|
||||||
|
]
|
||||||
|
|
||||||
[tool.setuptools.packages.find]
|
[tool.setuptools.packages.find]
|
||||||
where = ["."]
|
where = ["."]
|
||||||
|
|||||||
66
scripts/binding_ligand_baseline.py
Normal file
66
scripts/binding_ligand_baseline.py
Normal file
@@ -0,0 +1,66 @@
|
|||||||
|
"""Structure-based track, step 0: ligand-based retrieval baseline (PLAN §12.9 engine).
|
||||||
|
|
||||||
|
Docking (§12.3) needs a toolchain that doesn't pip-install on ARM Mac (AutoDock Vina) — that's the
|
||||||
|
next dependency to solve. Meanwhile this runs now with pure RDKit: do any of our 300 drugs sit near
|
||||||
|
the KNOWN sickle binders (voxelotor, mitapivat, decitabine) in chemical space? This is the
|
||||||
|
retrieval engine §12.9 would point a generative beacon at, and a sanity check on the ligand data.
|
||||||
|
|
||||||
|
NOT docking and NOT a binding claim — chemical similarity only. Similarity != activity (§12.9).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import requests
|
||||||
|
from rdkit import Chem, DataStructs, RDLogger
|
||||||
|
from rdkit.Chem import rdFingerprintGenerator
|
||||||
|
|
||||||
|
RDLogger.DisableLog("rdApp.*")
|
||||||
|
MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
|
||||||
|
|
||||||
|
# Known sickle binders = positive-control beacons (target in parens).
|
||||||
|
BINDERS = ["voxelotor", "mitapivat", "decitabine", "vorinostat"]
|
||||||
|
|
||||||
|
|
||||||
|
def pubchem_smiles(name: str) -> str | None:
|
||||||
|
for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
|
||||||
|
try:
|
||||||
|
u = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON"
|
||||||
|
d = requests.get(u, timeout=30).json()["PropertyTable"]["Properties"][0]
|
||||||
|
if prop in d:
|
||||||
|
return d[prop]
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def fp(smi: str):
|
||||||
|
if not isinstance(smi, str) or smi in ("-666", ""):
|
||||||
|
return None
|
||||||
|
m = Chem.MolFromSmiles(smi)
|
||||||
|
return MORGAN.GetFingerprint(m) if m else None
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
binder_smi = {b: pubchem_smiles(b) for b in BINDERS}
|
||||||
|
print("known-binder SMILES:", {k: (v[:34] + "..." if v else "MISSING") for k, v in binder_smi.items()})
|
||||||
|
|
||||||
|
drugs = pd.read_csv("data/processed/drug_set_v1.csv")[["pert_iname", "canonical_smiles", "inclusion_reason"]]
|
||||||
|
reason = dict(zip(drugs.pert_iname, drugs.inclusion_reason))
|
||||||
|
drug_fp = {r.pert_iname: fp(r.canonical_smiles) for r in drugs.itertuples()}
|
||||||
|
drug_fp = {k: v for k, v in drug_fp.items() if v is not None}
|
||||||
|
print(f"fingerprinted {len(drug_fp)}/{len(drugs)} drugs\n")
|
||||||
|
|
||||||
|
for b, smi in binder_smi.items():
|
||||||
|
bfp = fp(smi)
|
||||||
|
if bfp is None:
|
||||||
|
print(f"{b}: no SMILES\n"); continue
|
||||||
|
sims = sorted(((DataStructs.TanimotoSimilarity(bfp, v), k) for k, v in drug_fp.items()), reverse=True)
|
||||||
|
print(f"nearest drugs to {b}:")
|
||||||
|
for s, k in sims[:5]:
|
||||||
|
print(f" {s:.3f} {k:22s} [{reason.get(k,'?')}]")
|
||||||
|
print()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
111
scripts/dock_positive_controls.py
Normal file
111
scripts/dock_positive_controls.py
Normal file
@@ -0,0 +1,111 @@
|
|||||||
|
"""Structure-based track §12.4: dock known binders + negatives into Hb and PKR.
|
||||||
|
|
||||||
|
Positive-control recovery test: voxelotor should dock best into hemoglobin (5E83), mitapivat best
|
||||||
|
into PKR (8XFD), and unrelated drugs (caffeine, hydroxyurea) should score worse. The box is
|
||||||
|
centered on each co-crystal ligand (5L7=voxelotor, WV2=mitapivat). AutoDock Vina (mac binary,
|
||||||
|
Rosetta) + open-babel prep. Affinity in kcal/mol; more negative = stronger predicted binding.
|
||||||
|
|
||||||
|
This is a docking baseline, not an efficacy claim (PLAN §12.6).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import tempfile
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import requests
|
||||||
|
|
||||||
|
VINA = "./tools/vina"
|
||||||
|
STRUCT = Path("data/raw/structures")
|
||||||
|
WORK = Path("data/processed/docking")
|
||||||
|
WORK.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
TARGETS = {"hemoglobin": ("5E83", "5L7"), "PKR": ("8XFD", "WV2")}
|
||||||
|
LIGANDS = ["voxelotor", "mitapivat", "decitabine", "hydroxyurea", "caffeine"]
|
||||||
|
EXPECTED = {"voxelotor": "hemoglobin", "mitapivat": "PKR"}
|
||||||
|
|
||||||
|
|
||||||
|
def pubchem_smiles(name: str) -> str | None:
|
||||||
|
for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
|
||||||
|
try:
|
||||||
|
d = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON",
|
||||||
|
timeout=30).json()["PropertyTable"]["Properties"][0]
|
||||||
|
if prop in d:
|
||||||
|
return d[prop]
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def prep_receptor_and_box(pdb: str, lig_res: str):
|
||||||
|
"""Receptor pdbqt (protein only) + box center/size from one copy of the co-crystal ligand."""
|
||||||
|
atoms, lig, chosen = [], [], None
|
||||||
|
for ln in (STRUCT / f"{pdb}.pdb").read_text().splitlines():
|
||||||
|
if ln.startswith("ATOM"):
|
||||||
|
atoms.append(ln)
|
||||||
|
elif ln.startswith("HETATM") and ln[17:20].strip() == lig_res:
|
||||||
|
key = (ln[21], ln[22:26])
|
||||||
|
chosen = chosen or key
|
||||||
|
if key == chosen:
|
||||||
|
lig.append(ln)
|
||||||
|
rec_pdb = WORK / f"{pdb}_rec.pdb"
|
||||||
|
rec_pdb.write_text("\n".join(atoms) + "\nEND\n")
|
||||||
|
rec_pdbqt = WORK / f"{pdb}_rec.pdbqt"
|
||||||
|
subprocess.run(["obabel", str(rec_pdb), "-O", str(rec_pdbqt), "-xr", "-p", "7.4"],
|
||||||
|
capture_output=True, check=True)
|
||||||
|
xyz = np.array([[float(l[30:38]), float(l[38:46]), float(l[46:54])] for l in lig])
|
||||||
|
center = xyz.mean(0)
|
||||||
|
size = np.clip((xyz.max(0) - xyz.min(0)) + 9.0, 18.0, 28.0)
|
||||||
|
return rec_pdbqt, center, size
|
||||||
|
|
||||||
|
|
||||||
|
def prep_ligand(name: str, smiles: str) -> Path | None:
|
||||||
|
out = WORK / f"lig_{name}.pdbqt"
|
||||||
|
r = subprocess.run(["obabel", f"-:{smiles}", "-O", str(out), "--gen3d", "-p", "7.4"],
|
||||||
|
capture_output=True, text=True)
|
||||||
|
return out if out.exists() and out.stat().st_size > 0 else None
|
||||||
|
|
||||||
|
|
||||||
|
def dock(rec_pdbqt: Path, lig_pdbqt: Path, center, size) -> float | None:
|
||||||
|
out = subprocess.run([VINA, "--receptor", str(rec_pdbqt), "--ligand", str(lig_pdbqt),
|
||||||
|
"--center_x", f"{center[0]:.2f}", "--center_y", f"{center[1]:.2f}",
|
||||||
|
"--center_z", f"{center[2]:.2f}", "--size_x", f"{size[0]:.1f}",
|
||||||
|
"--size_y", f"{size[1]:.1f}", "--size_z", f"{size[2]:.1f}",
|
||||||
|
"--exhaustiveness", "8", "--cpu", "4",
|
||||||
|
"--out", str(WORK / "o.pdbqt")], capture_output=True, text=True)
|
||||||
|
m = re.search(r"^\s+1\s+(-?\d+\.\d+)", out.stdout, re.M)
|
||||||
|
return float(m.group(1)) if m else None
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
smis = {n: pubchem_smiles(n) for n in LIGANDS}
|
||||||
|
ligs = {n: prep_ligand(n, s) for n, s in smis.items() if s}
|
||||||
|
ligs = {n: p for n, p in ligs.items() if p}
|
||||||
|
print(f"prepared ligands: {list(ligs)}")
|
||||||
|
|
||||||
|
boxes = {t: prep_receptor_and_box(pdb, lr) for t, (pdb, lr) in TARGETS.items()}
|
||||||
|
print("prepared receptors:", list(boxes))
|
||||||
|
|
||||||
|
print(f"\n{'ligand':14s}" + "".join(f"{t:>13s}" for t in TARGETS))
|
||||||
|
results = {}
|
||||||
|
for lname, lpath in ligs.items():
|
||||||
|
row = {}
|
||||||
|
for tname, (rec, center, size) in boxes.items():
|
||||||
|
row[tname] = dock(rec, lpath, center, size)
|
||||||
|
results[lname] = row
|
||||||
|
cells = "".join(f"{(f'{row[t]:.1f}' if row[t] is not None else 'NA'):>13s}" for t in TARGETS)
|
||||||
|
print(f" {lname:12s}{cells}")
|
||||||
|
|
||||||
|
print("\n=== positive-control recovery test (§12.4) ===")
|
||||||
|
for lig, exp_target in EXPECTED.items():
|
||||||
|
if lig in results:
|
||||||
|
best = min(results[lig], key=lambda t: results[lig][t] if results[lig][t] is not None else 99)
|
||||||
|
ok = best == exp_target
|
||||||
|
print(f" {lig:12s} best target = {best:11s} (expected {exp_target}) -> {'PASS' if ok else 'FAIL'}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
93
scripts/dock_validate.py
Normal file
93
scripts/dock_validate.py
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
"""Structure-binding §12.4: redocking-RMSD validation + ligand-efficiency normalization.
|
||||||
|
|
||||||
|
Two de-biased checks of the docking baseline:
|
||||||
|
A. REDOCKING RMSD — redock each co-crystal ligand into its own structure and measure pose RMSD
|
||||||
|
vs the crystal pose (open-babel obrms, symmetry-aware). <2 A = geometry validated. This is
|
||||||
|
the gold-standard positive control and is immune to the molecular-size bias that made the
|
||||||
|
cross-target affinity test inconclusive.
|
||||||
|
B. LIGAND EFFICIENCY — affinity / heavy-atom count, to de-bias the size effect in the raw scores.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from rdkit import Chem, RDLogger
|
||||||
|
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||||
|
from scripts.dock_positive_controls import ( # noqa: E402
|
||||||
|
STRUCT, VINA, WORK, TARGETS, pubchem_smiles, prep_ligand, prep_receptor_and_box,
|
||||||
|
)
|
||||||
|
|
||||||
|
RDLogger.DisableLog("rdApp.*")
|
||||||
|
XTAL_LIG = {"hemoglobin": ("5E83", "5L7", "voxelotor"), "PKR": ("8XFD", "WV2", "mitapivat")}
|
||||||
|
|
||||||
|
|
||||||
|
def extract_crystal_ligand(pdb: str, resname: str) -> Path:
|
||||||
|
lines, chosen = [], None
|
||||||
|
for ln in (STRUCT / f"{pdb}.pdb").read_text().splitlines():
|
||||||
|
if ln.startswith("HETATM") and ln[17:20].strip() == resname:
|
||||||
|
key = (ln[21], ln[22:26])
|
||||||
|
chosen = chosen or key
|
||||||
|
if key == chosen:
|
||||||
|
lines.append(ln)
|
||||||
|
out = WORK / f"xtal_{resname}.pdb"
|
||||||
|
out.write_text("\n".join(lines) + "\nEND\n")
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def dock_pose(rec, lig, center, size) -> Path | None:
|
||||||
|
pose = WORK / "redock_out.pdbqt"
|
||||||
|
subprocess.run([VINA, "--receptor", str(rec), "--ligand", str(lig),
|
||||||
|
"--center_x", f"{center[0]:.2f}", "--center_y", f"{center[1]:.2f}",
|
||||||
|
"--center_z", f"{center[2]:.2f}", "--size_x", f"{size[0]:.1f}",
|
||||||
|
"--size_y", f"{size[1]:.1f}", "--size_z", f"{size[2]:.1f}",
|
||||||
|
"--exhaustiveness", "16", "--out", str(pose)], capture_output=True, text=True)
|
||||||
|
if not pose.exists():
|
||||||
|
return None
|
||||||
|
top = WORK / "redock_top.pdb" # first model only
|
||||||
|
subprocess.run(["obabel", str(pose), "-O", str(top), "-f", "1", "-l", "1"], capture_output=True)
|
||||||
|
return top
|
||||||
|
|
||||||
|
|
||||||
|
def obrms(ref: Path, test: Path) -> float | None:
|
||||||
|
# strip H to compare heavy-atom poses; obrms is automorphism-aware
|
||||||
|
r2, t2 = WORK / "ref_noH.sdf", WORK / "test_noH.sdf"
|
||||||
|
subprocess.run(["obabel", str(ref), "-d", "-O", str(r2)], capture_output=True)
|
||||||
|
subprocess.run(["obabel", str(test), "-d", "-O", str(t2)], capture_output=True)
|
||||||
|
out = subprocess.run(["obrms", str(r2), str(t2)], capture_output=True, text=True).stdout
|
||||||
|
m = re.search(r"(-?\d+\.\d+)", out)
|
||||||
|
return float(m.group(1)) if m else None
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
print("=== A. Redocking-RMSD validation ===")
|
||||||
|
for target, (pdb, resname, drug) in XTAL_LIG.items():
|
||||||
|
rec, center, size = prep_receptor_and_box(pdb, resname)
|
||||||
|
smi = pubchem_smiles(drug)
|
||||||
|
lig = prep_ligand(drug, smi)
|
||||||
|
xtal = extract_crystal_ligand(pdb, resname)
|
||||||
|
pose = dock_pose(rec, lig, center, size)
|
||||||
|
rmsd = obrms(xtal, pose) if pose else None
|
||||||
|
verdict = "PASS (<2A)" if (rmsd is not None and rmsd < 2.0) else \
|
||||||
|
("MARGINAL (<3A)" if (rmsd is not None and rmsd < 3.0) else "FAIL")
|
||||||
|
rstr = f"{rmsd:.2f} A" if rmsd is not None else "NA"
|
||||||
|
print(f" {drug:12s} -> {target:11s} ({pdb}/{resname}): redock RMSD {rstr} {verdict}")
|
||||||
|
|
||||||
|
print("\n=== B. Ligand efficiency (affinity / heavy atoms), de-biasing size ===")
|
||||||
|
# raw affinities from the cross-dock baseline (scripts/dock_positive_controls.py)
|
||||||
|
aff = {"voxelotor": {"hemoglobin": -8.1, "PKR": -9.3}, "mitapivat": {"hemoglobin": -10.0, "PKR": -11.2},
|
||||||
|
"decitabine": {"hemoglobin": -6.6, "PKR": -7.0}, "hydroxyurea": {"hemoglobin": -3.9, "PKR": -3.6},
|
||||||
|
"caffeine": {"hemoglobin": -6.1, "PKR": -6.4}}
|
||||||
|
print(f" {'ligand':12s}{'heavy':>6s}" + "".join(f"{t+' LE':>14s}" for t in TARGETS))
|
||||||
|
for lig, row in aff.items():
|
||||||
|
ha = Chem.MolFromSmiles(pubchem_smiles(lig)).GetNumHeavyAtoms()
|
||||||
|
cells = "".join(f"{row[t]/ha:>14.3f}" for t in TARGETS)
|
||||||
|
print(f" {lig:12s}{ha:>6d}{cells}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
137
scripts/exp_deconv_signature.py
Normal file
137
scripts/exp_deconv_signature.py
Normal file
@@ -0,0 +1,137 @@
|
|||||||
|
"""Experiment: composition-adjusted sickle signature, to fix specificity (option 1).
|
||||||
|
|
||||||
|
The v1 signature is confounded by cell composition (SS patients have very different WBC/RBC
|
||||||
|
than AA controls). GSE35007 *measured* those counts per sample, so we adjust the differential
|
||||||
|
expression for them directly (a measured-covariate alternative to estimated deconvolution):
|
||||||
|
|
||||||
|
expression ~ disease + WBC + RBC + MCV + age + sex (per gene, vectorized OLS)
|
||||||
|
|
||||||
|
We compare the composition-ADJUSTED signature against an UNADJUSTED single-study signature
|
||||||
|
(same samples, model without the covariates), score both with the v1.1 engine (full gene space
|
||||||
|
+ spec_z), and report the recovery test for each. Writes nothing to committed artifacts.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import warnings
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import GEOparse
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from scipy.stats import false_discovery_control
|
||||||
|
from scipy.stats import t as tdist
|
||||||
|
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||||
|
from src.disease import collapse_probes_to_symbols # noqa: E402
|
||||||
|
from src.scoring import tau_calibrate # noqa: E402
|
||||||
|
|
||||||
|
warnings.filterwarnings("ignore")
|
||||||
|
PROCESSED = Path("data/processed")
|
||||||
|
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
|
||||||
|
SYMBOL_COLS = ["Symbol", "ILMN_Gene", "Gene Symbol", "GeneSymbol"]
|
||||||
|
|
||||||
|
|
||||||
|
def cval(gsm, key):
|
||||||
|
for c in gsm.metadata.get("characteristics_ch1", []):
|
||||||
|
if c.lower().startswith(key.lower()):
|
||||||
|
return c.split(":", 1)[1].strip()
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def load_data():
|
||||||
|
gse = GEOparse.get_GEO(geo="GSE35007", destdir="data/raw/geo", silent=True)
|
||||||
|
meta = []
|
||||||
|
for gid, g in gse.gsms.items():
|
||||||
|
meta.append({"gsm": gid, "hb": cval(g, "hb phenotype"), "wbc": cval(g, "white blood cells"),
|
||||||
|
"rbc": cval(g, "red blood cells"), "mcv": cval(g, "mean corpuscular volume"),
|
||||||
|
"age": cval(g, "age"), "sex": cval(g, "Sex")})
|
||||||
|
meta = pd.DataFrame(meta).set_index("gsm")
|
||||||
|
for c in ["wbc", "rbc", "mcv", "age"]:
|
||||||
|
meta[c] = pd.to_numeric(meta[c], errors="coerce")
|
||||||
|
meta["disease"] = meta["hb"].map({"SS": 1.0, "AA": 0.0})
|
||||||
|
meta["sex_m"] = (meta["sex"] == "M").astype(float)
|
||||||
|
keep = meta[meta["hb"].isin(["SS", "AA"])].dropna(subset=["wbc", "rbc", "mcv", "age", "disease"])
|
||||||
|
|
||||||
|
expr = pd.DataFrame({gid: gse.gsms[gid].table.set_index("ID_REF")["VALUE"] for gid in keep.index})
|
||||||
|
expr = expr.apply(pd.to_numeric, errors="coerce").dropna(how="any")
|
||||||
|
if float(np.nanmax(expr.to_numpy())) > 50:
|
||||||
|
expr = np.log2(expr.clip(lower=0) + 1.0)
|
||||||
|
|
||||||
|
gpl = list(gse.gpls.values())[0]
|
||||||
|
col = next((c for c in SYMBOL_COLS if c in gpl.table.columns), None)
|
||||||
|
sym = gpl.table.set_index("ID")[col].astype(str).str.strip().replace({"": np.nan, "nan": np.nan})
|
||||||
|
return expr, keep, sym.dropna()
|
||||||
|
|
||||||
|
|
||||||
|
def ols_de(expr, design, disease_idx):
|
||||||
|
"""Vectorized per-gene OLS; return DE table (log_fc=disease coef, pvalue, qvalue)."""
|
||||||
|
X = design.to_numpy(dtype=float)
|
||||||
|
Y = expr.T.to_numpy(dtype=float) # samples x genes
|
||||||
|
n, p = X.shape
|
||||||
|
XtX_inv = np.linalg.pinv(X.T @ X)
|
||||||
|
B = XtX_inv @ X.T @ Y
|
||||||
|
resid = Y - X @ B
|
||||||
|
dof = n - p
|
||||||
|
sigma2 = (resid ** 2).sum(0) / dof
|
||||||
|
se = np.sqrt(sigma2 * XtX_inv[disease_idx, disease_idx])
|
||||||
|
t = B[disease_idx] / se
|
||||||
|
pval = 2 * tdist.sf(np.abs(t), dof)
|
||||||
|
out = pd.DataFrame({"log_fc": B[disease_idx], "pvalue": pval}, index=expr.index).dropna()
|
||||||
|
out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def make_signature(de, sym, expr, top_n=250):
|
||||||
|
de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
|
||||||
|
sig = de_sym[de_sym["qvalue"] < 0.05]
|
||||||
|
up = sig[sig["log_fc"] > 0].nsmallest(top_n, "qvalue").index.tolist()
|
||||||
|
down = sig[sig["log_fc"] < 0].nsmallest(top_n, "qvalue").index.tolist()
|
||||||
|
return up, down
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate(label, up, down, lincs):
|
||||||
|
ranked = tau_calibrate(up, down, lincs, n_null=1000)
|
||||||
|
n = len(ranked)
|
||||||
|
top10, top25, half = int(n * .10), int(n * .25), n // 2
|
||||||
|
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
|
||||||
|
ranked = ranked.join(profiles[["inclusion_reason"]])
|
||||||
|
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
|
||||||
|
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
|
||||||
|
n_bottom = sum(r > half for r in negs.values())
|
||||||
|
n_ov = len((set(up) | set(down)) & set(lincs.columns))
|
||||||
|
print(f"\n### {label}: {len(up)} up / {len(down)} down (query overlap {n_ov})")
|
||||||
|
print(f" hydroxyurea rank {hu}/{n} (top {100*hu/n:.1f}%) top-10%? {hu <= top10}")
|
||||||
|
print(f" L-glutamine rank {glut}/{n} (top {100*glut/n:.1f}%) top-25%? {glut <= top25}")
|
||||||
|
print(f" neg controls bottom-half: {n_bottom}/5 {negs}")
|
||||||
|
print(" top 8: " + ", ".join(
|
||||||
|
f"{name}[{r['inclusion_reason'][:3]}]" for name, r in ranked.nsmallest(8, "spec_z").iterrows()))
|
||||||
|
return ranked
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
expr, meta, sym = load_data()
|
||||||
|
print(f"loaded {expr.shape[1]} samples x {expr.shape[0]} probes; "
|
||||||
|
f"{int(meta.disease.sum())} SS / {int((meta.disease==0).sum())} AA")
|
||||||
|
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
|
||||||
|
|
||||||
|
base = pd.DataFrame({"intercept": 1.0, "disease": meta["disease"]}, index=meta.index)
|
||||||
|
adj = base.assign(wbc=meta["wbc"], rbc=meta["rbc"], mcv=meta["mcv"], age=meta["age"], sex_m=meta["sex_m"])
|
||||||
|
|
||||||
|
de_unadj = ols_de(expr, base, disease_idx=1)
|
||||||
|
de_adj = ols_de(expr, adj, disease_idx=1)
|
||||||
|
|
||||||
|
up_u, dn_u = make_signature(de_unadj, sym, expr)
|
||||||
|
up_a, dn_a = make_signature(de_adj, sym, expr)
|
||||||
|
|
||||||
|
# how much does adjustment change the gene set?
|
||||||
|
overlap = len((set(up_u) | set(dn_u)) & (set(up_a) | set(dn_a)))
|
||||||
|
print(f"\nsignature gene overlap unadjusted vs adjusted: {overlap}/{len(set(up_u)|set(dn_u))}")
|
||||||
|
|
||||||
|
evaluate("UNADJUSTED (GSE35007 only)", up_u, dn_u, lincs)
|
||||||
|
evaluate("COMPOSITION-ADJUSTED", up_a, dn_a, lincs)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
102
scripts/exp_genespace.py
Normal file
102
scripts/exp_genespace.py
Normal file
@@ -0,0 +1,102 @@
|
|||||||
|
"""Experiment (v1.1): re-score on a larger LINCS gene space and re-run the recovery test.
|
||||||
|
|
||||||
|
v1 used only the 978 landmark genes (12% signature overlap). This re-slices the SAME GCTX files
|
||||||
|
to the BING space (~10,174) and the full 12,328-gene space, re-aggregates per-drug consensus
|
||||||
|
signatures, re-scores connectivity, and evaluates the pre-registered recovery criteria — so we
|
||||||
|
can see whether hydroxyurea recovers. Writes nothing to the committed v1 artifacts.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import gzip
|
||||||
|
import io
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||||
|
from src.scoring import rank_drugs # noqa: E402
|
||||||
|
|
||||||
|
LINCS = Path("data/raw/lincs")
|
||||||
|
PROCESSED = Path("data/processed")
|
||||||
|
GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
|
||||||
|
SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
|
||||||
|
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
|
||||||
|
|
||||||
|
|
||||||
|
def read_gz(name):
|
||||||
|
return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
|
||||||
|
|
||||||
|
|
||||||
|
def gene_ids_for_space(space: str):
|
||||||
|
g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
|
||||||
|
if space == "bing":
|
||||||
|
g = g[g.pr_is_bing == 1]
|
||||||
|
# 'all' -> keep everything
|
||||||
|
ids = [str(x) for x in g.pr_gene_id]
|
||||||
|
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
|
||||||
|
return ids, id_to_symbol
|
||||||
|
|
||||||
|
|
||||||
|
def extract(space, drug_names, gene_ids, id_to_symbol):
|
||||||
|
from cmapPy.pandasGEXpress.parse import parse
|
||||||
|
frames = []
|
||||||
|
for ph in (1, 2):
|
||||||
|
sig = read_gz(SIG_INFO[ph])
|
||||||
|
sig = sig[(sig.pert_type == "trt_cp") & (sig.pert_iname.isin(drug_names))]
|
||||||
|
if sig.empty:
|
||||||
|
continue
|
||||||
|
gct = parse(str(GCTX[ph]), rid=gene_ids, cid=sig.sig_id.tolist())
|
||||||
|
data = gct.data_df
|
||||||
|
s2d = dict(zip(sig.sig_id, sig.pert_iname))
|
||||||
|
frames.append(data.T.groupby(data.columns.map(s2d)).mean())
|
||||||
|
print(f" [{space}] phase {ph}: {sig.pert_iname.nunique()} drugs sliced", flush=True)
|
||||||
|
combined = pd.concat(frames).groupby(level=0).mean()
|
||||||
|
combined.columns = [id_to_symbol.get(c, c) for c in combined.columns]
|
||||||
|
combined = combined.loc[:, ~combined.columns.duplicated()] # drop dup symbols
|
||||||
|
return combined
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate(space, sig_matrix, up, down):
|
||||||
|
landmark_overlap = None
|
||||||
|
ranked = rank_drugs(up, down, sig_matrix)
|
||||||
|
n = len(ranked)
|
||||||
|
top10, top25, half = int(n * 0.10), int(n * 0.25), n // 2
|
||||||
|
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
|
||||||
|
ranked = ranked.join(profiles[["inclusion_reason"]])
|
||||||
|
|
||||||
|
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
|
||||||
|
glut_s = ranked.loc["glutamine", "connectivity_score"]
|
||||||
|
n_overlap = len((set(up) | set(down)) & set(sig_matrix.columns))
|
||||||
|
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
|
||||||
|
n_bottom = sum(r > half for r in negs.values())
|
||||||
|
|
||||||
|
print(f"\n=== gene space: {space.upper()} ({sig_matrix.shape[1]} genes; query overlap {n_overlap}) ===")
|
||||||
|
print(f" hydroxyurea: rank {hu}/{n} (top {100*hu/n:.1f}%) top-10%? {hu <= top10}")
|
||||||
|
print(f" L-glutamine: rank {glut}/{n} (top {100*glut/n:.1f}%), WTCS={glut_s:.3f} top-25%? {glut <= top25}")
|
||||||
|
print(f" neg controls in bottom half: {n_bottom}/5 {negs}")
|
||||||
|
crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
|
||||||
|
print(f" OVERALL: {'PASS' if crit else 'FAIL'}")
|
||||||
|
print(" top 8:")
|
||||||
|
for name, r in ranked.nsmallest(8, "connectivity_score").iterrows():
|
||||||
|
print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} [{r['inclusion_reason']}]")
|
||||||
|
return ranked
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
|
||||||
|
up = [g["gene"] for g in sig["up_regulated"]]
|
||||||
|
down = [g["gene"] for g in sig["down_regulated"]]
|
||||||
|
drug_names = set(pd.read_csv(PROCESSED / "drug_set_v1.csv").pert_iname)
|
||||||
|
|
||||||
|
for space in ("bing", "all"):
|
||||||
|
print(f"\n>>> extracting {space} ...", flush=True)
|
||||||
|
ids, id2sym = gene_ids_for_space(space)
|
||||||
|
mat = extract(space, drug_names, ids, id2sym)
|
||||||
|
evaluate(space, mat, up, down)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
118
scripts/phaseA_reference_tau.py
Normal file
118
scripts/phaseA_reference_tau.py
Normal file
@@ -0,0 +1,118 @@
|
|||||||
|
"""Phase A: calibrate sickle connectivity against a REAL disease-signature reference population.
|
||||||
|
|
||||||
|
The v1.1 tau saturated because it used random gene-set nulls. Proper specificity calibration
|
||||||
|
asks: does a drug reverse SICKLE more than it reverses diseases in general? We download a library
|
||||||
|
of real disease signatures (Enrichr "Disease Signatures from GEO", up+down), compute each drug's
|
||||||
|
connectivity to every reference disease, and express its sickle connectivity as a z-score within
|
||||||
|
that per-drug reference distribution. Broad-effect drugs (reverse everything) -> z~0 -> down-ranked.
|
||||||
|
|
||||||
|
Re-runs the recovery test and compares to v1.1 (random-null). Writes nothing committed.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
import requests
|
||||||
|
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||||
|
from src.scoring import _ks_connectivity # noqa: E402
|
||||||
|
|
||||||
|
PROCESSED = Path("data/processed")
|
||||||
|
RAW = Path("data/raw/disease_sigs")
|
||||||
|
ENRICHR = "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName={}"
|
||||||
|
LIBS = {"up": "Disease_Signatures_from_GEO_up_2014", "down": "Disease_Signatures_from_GEO_down_2014"}
|
||||||
|
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_gmt(name: str) -> dict[str, list[str]]:
|
||||||
|
RAW.mkdir(parents=True, exist_ok=True)
|
||||||
|
path = RAW / f"{name}.gmt"
|
||||||
|
if not path.exists():
|
||||||
|
r = requests.get(ENRICHR.format(name), timeout=120)
|
||||||
|
r.raise_for_status()
|
||||||
|
path.write_text(r.text)
|
||||||
|
out = {}
|
||||||
|
for line in path.read_text().splitlines():
|
||||||
|
parts = line.split("\t")
|
||||||
|
if len(parts) < 3:
|
||||||
|
continue
|
||||||
|
term = parts[0]
|
||||||
|
genes = [g.split(",")[0].strip().upper() for g in parts[2:] if g.strip()]
|
||||||
|
out[term] = genes
|
||||||
|
print(f" {name}: {path.stat().st_size/1e6:.1f} MB, {len(out)} terms")
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def build_reference() -> list[dict]:
|
||||||
|
up = fetch_gmt(LIBS["up"])
|
||||||
|
down = fetch_gmt(LIBS["down"])
|
||||||
|
shared = set(up) & set(down)
|
||||||
|
refs = []
|
||||||
|
for term in shared:
|
||||||
|
if "sickle" in term.lower():
|
||||||
|
continue # exclude the target disease from its own reference population
|
||||||
|
refs.append({"name": term, "up": up[term], "down": down[term]})
|
||||||
|
print(f"reference disease signatures (paired up+down, sickle excluded): {len(refs)}")
|
||||||
|
return refs
|
||||||
|
|
||||||
|
|
||||||
|
def cols_for(genes, gene_to_col):
|
||||||
|
return np.array([gene_to_col[g] for g in set(genes) if g in gene_to_col], dtype=int)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
import json
|
||||||
|
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
|
||||||
|
sk_up = [g["gene"] for g in sig["up_regulated"]]
|
||||||
|
sk_down = [g["gene"] for g in sig["down_regulated"]]
|
||||||
|
|
||||||
|
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
|
||||||
|
genes = list(lincs.columns)
|
||||||
|
gene_to_col = {g: i for i, g in enumerate(genes)}
|
||||||
|
n = len(genes)
|
||||||
|
R = lincs.rank(axis=1, ascending=False).to_numpy()
|
||||||
|
|
||||||
|
refs = build_reference()
|
||||||
|
# connectivity of every drug to every reference disease -> (n_drugs, n_refs)
|
||||||
|
C = np.empty((R.shape[0], len(refs)))
|
||||||
|
for j, d in enumerate(refs):
|
||||||
|
C[:, j] = _ks_connectivity(R, cols_for(d["up"], gene_to_col), cols_for(d["down"], gene_to_col), n)
|
||||||
|
# drop reference diseases with too few mapped genes (degenerate columns)
|
||||||
|
mapped = np.array([len(cols_for(d["up"], gene_to_col)) + len(cols_for(d["down"], gene_to_col)) for d in refs])
|
||||||
|
keep = mapped >= 10
|
||||||
|
C = C[:, keep]
|
||||||
|
print(f"usable reference diseases (>=10 mapped genes): {keep.sum()}")
|
||||||
|
|
||||||
|
real = _ks_connectivity(R, cols_for(sk_up, gene_to_col), cols_for(sk_down, gene_to_col), n)
|
||||||
|
ref_mean, ref_std = C.mean(axis=1), C.std(axis=1)
|
||||||
|
ref_std[ref_std == 0] = np.nan
|
||||||
|
spec_z = (real - ref_mean) / ref_std # negative = reverses sickle more than diseases-in-general
|
||||||
|
|
||||||
|
ranked = pd.DataFrame({"spec_z": spec_z, "connectivity": real}, index=lincs.index).sort_values("spec_z")
|
||||||
|
ranked.insert(0, "rank", range(1, len(ranked) + 1))
|
||||||
|
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
|
||||||
|
ranked = ranked.join(profiles[["inclusion_reason"]])
|
||||||
|
|
||||||
|
N = len(ranked)
|
||||||
|
top10, top25, half = int(N * .10), int(N * .25), N // 2
|
||||||
|
hu, glut = int(ranked.loc["hydroxyurea", "rank"]), int(ranked.loc["glutamine", "rank"])
|
||||||
|
negs = {d: int(ranked.loc[d, "rank"]) for d in NEG5 if d in ranked.index}
|
||||||
|
n_bottom = sum(r > half for r in negs.values())
|
||||||
|
|
||||||
|
print("\n==== RECOVERY TEST (reference-calibrated tau) ====")
|
||||||
|
print(f" hydroxyurea rank {hu}/{N} (top {100*hu/N:.1f}%) top-10%? {hu <= top10} z={ranked.loc['hydroxyurea','spec_z']:.2f}")
|
||||||
|
print(f" L-glutamine rank {glut}/{N} (top {100*glut/N:.1f}%) top-25%? {glut <= top25}")
|
||||||
|
print(f" neg controls bottom-half: {n_bottom}/5 {negs}")
|
||||||
|
crit = (hu <= top10) and (glut <= top25) and (n_bottom >= 4)
|
||||||
|
print(f" OVERALL: {'PASS' if crit else 'FAIL'}")
|
||||||
|
print("\n top 12 by reference-calibrated z:")
|
||||||
|
for name, r in ranked.nsmallest(12, "spec_z").iterrows():
|
||||||
|
print(f" {int(r['rank']):2d} {name:20s} z={r['spec_z']:6.2f} [{r['inclusion_reason']}]")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
137
scripts/phaseD_supervised.py
Normal file
137
scripts/phaseD_supervised.py
Normal file
@@ -0,0 +1,137 @@
|
|||||||
|
"""Phase D: supervised cross-disease repurposing — can labels break the specificity ceiling?
|
||||||
|
|
||||||
|
Connectivity alone can't tell therapeutic from coincidental reversal. A supervised model trained
|
||||||
|
on known drug-disease pairs CAN learn that pattern — given features that expose drug "broadness"
|
||||||
|
(a drug that reverses everything is non-specific). We train on 839 GEO disease signatures with
|
||||||
|
Repurposing-Hub indications as labels, evaluate with disease-grouped CV, then apply to HELD-OUT
|
||||||
|
sickle and check whether the coincidental reversers (norethindrone, ciprofloxacin) finally drop.
|
||||||
|
|
||||||
|
Baseline = rank by raw connectivity (the single conn feature). Win = model down-ranks the
|
||||||
|
negative controls vs baseline while keeping hydroxyurea high.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from sklearn.ensemble import GradientBoostingClassifier
|
||||||
|
from sklearn.model_selection import GroupKFold, cross_val_predict
|
||||||
|
from sklearn.metrics import roc_auc_score
|
||||||
|
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||||
|
from src.scoring import _ks_connectivity # noqa: E402
|
||||||
|
|
||||||
|
PROCESSED = Path("data/processed")
|
||||||
|
SIGS = Path("data/raw/disease_sigs")
|
||||||
|
NEG5 = ["clotrimazole", "astemizole", "azithromycin", "ethinyl-estradiol", "caffeine"]
|
||||||
|
ID_RE = re.compile(r"\s*(c\d{6,}|doid[-\s]?\d+|gse\d+|umls\S+)\s*", re.I)
|
||||||
|
|
||||||
|
|
||||||
|
def clean_disease(term: str) -> str:
|
||||||
|
return ID_RE.sub(" ", term).strip().lower()
|
||||||
|
|
||||||
|
|
||||||
|
def load_disease_sigs():
|
||||||
|
def parse(name):
|
||||||
|
out = {}
|
||||||
|
for line in (SIGS / name).read_text().splitlines():
|
||||||
|
p = line.split("\t")
|
||||||
|
if len(p) < 3:
|
||||||
|
continue
|
||||||
|
out[p[0]] = [g.split(",")[0].strip().upper() for g in p[2:] if g.strip()]
|
||||||
|
return out
|
||||||
|
up = parse("Disease_Perturbations_from_GEO_up.gmt")
|
||||||
|
down = parse("Disease_Perturbations_from_GEO_down.gmt")
|
||||||
|
terms = sorted(set(up) & set(down))
|
||||||
|
return [{"term": t, "disease": clean_disease(t), "up": up[t], "down": down[t]} for t in terms]
|
||||||
|
|
||||||
|
|
||||||
|
def cols_for(genes, g2c):
|
||||||
|
return np.array([g2c[g] for g in set(genes) if g in g2c], dtype=int)
|
||||||
|
|
||||||
|
|
||||||
|
def featurize(conn_col, C):
|
||||||
|
"""Per-(drug,disease) features for one disease column conn_col, given full matrix C."""
|
||||||
|
drug_mean, drug_std = C.mean(1), C.std(1)
|
||||||
|
drug_min = C.min(1)
|
||||||
|
broad = (C < -0.10).mean(1) # fraction of diseases this drug strongly reverses
|
||||||
|
dmean, dstd = conn_col.mean(), conn_col.std() or 1.0
|
||||||
|
return np.column_stack([
|
||||||
|
conn_col, drug_mean, drug_std, drug_min, broad,
|
||||||
|
np.full_like(conn_col, dmean),
|
||||||
|
(conn_col - drug_mean) / np.where(drug_std == 0, 1, drug_std), # specificity within drug
|
||||||
|
(conn_col - dmean) / dstd, # specificity within disease
|
||||||
|
])
|
||||||
|
|
||||||
|
|
||||||
|
FEATS = ["conn", "drug_mean", "drug_std", "drug_min", "broadness",
|
||||||
|
"disease_mean", "z_within_drug", "z_within_disease"]
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
lincs = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet")
|
||||||
|
drugs = list(lincs.index)
|
||||||
|
g2c = {g: i for i, g in enumerate(lincs.columns)}
|
||||||
|
n = len(lincs.columns)
|
||||||
|
R = lincs.rank(axis=1, ascending=False).to_numpy()
|
||||||
|
|
||||||
|
refs = [d for d in load_disease_sigs()
|
||||||
|
if len(cols_for(d["up"], g2c)) + len(cols_for(d["down"], g2c)) >= 10]
|
||||||
|
C = np.column_stack([_ks_connectivity(R, cols_for(d["up"], g2c), cols_for(d["down"], g2c), n) for d in refs])
|
||||||
|
print(f"{len(drugs)} drugs x {len(refs)} disease signatures; connectivity matrix {C.shape}")
|
||||||
|
|
||||||
|
# labels from Repurposing Hub indications
|
||||||
|
hub = pd.read_csv("data/raw/repurposing_drugs.txt", sep="\t", comment="!", low_memory=False)
|
||||||
|
ind = {r.pert_iname: [i.strip().lower() for i in re.split(r"[|,]", r.indication) if len(i.strip()) > 3]
|
||||||
|
for r in hub.itertuples() if isinstance(r.indication, str)}
|
||||||
|
drug_idx = {d: i for i, d in enumerate(drugs)}
|
||||||
|
|
||||||
|
X_rows, y, grp = [], [], []
|
||||||
|
for j, d in enumerate(refs):
|
||||||
|
feats = featurize(C[:, j], C)
|
||||||
|
dz = d["disease"]
|
||||||
|
for i, drug in enumerate(drugs):
|
||||||
|
inds = ind.get(drug, [])
|
||||||
|
label = int(any(dz == k or (len(dz) > 4 and dz in k) or (len(k) > 4 and k in dz) for k in inds))
|
||||||
|
X_rows.append(feats[i]); y.append(label); grp.append(j)
|
||||||
|
X = np.array(X_rows); y = np.array(y); grp = np.array(grp)
|
||||||
|
print(f"pairs: {len(y)}, positives: {y.sum()} ({100*y.mean():.2f}%)")
|
||||||
|
|
||||||
|
# disease-grouped CV (generalize to unseen diseases)
|
||||||
|
clf = GradientBoostingClassifier(random_state=0)
|
||||||
|
proba = cross_val_predict(clf, X, y, cv=GroupKFold(5), groups=grp, method="predict_proba")[:, 1]
|
||||||
|
print(f"disease-grouped CV AUC: {roc_auc_score(y, proba):.3f} (conn-only AUC: {roc_auc_score(y, X[:,0]*-1):.3f})")
|
||||||
|
|
||||||
|
clf.fit(X, y)
|
||||||
|
print("feature importances: " + ", ".join(f"{f}={imp:.2f}" for f, imp in
|
||||||
|
sorted(zip(FEATS, clf.feature_importances_), key=lambda t: -t[1])))
|
||||||
|
|
||||||
|
# apply to HELD-OUT sickle
|
||||||
|
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
|
||||||
|
sk = _ks_connectivity(R, cols_for([g["gene"] for g in sig["up_regulated"]], g2c),
|
||||||
|
cols_for([g["gene"] for g in sig["down_regulated"]], g2c), n)
|
||||||
|
Xs = featurize(sk, C)
|
||||||
|
p_sickle = clf.predict_proba(Xs)[:, 1]
|
||||||
|
|
||||||
|
res = pd.DataFrame({"model_p": p_sickle, "conn": sk}, index=drugs)
|
||||||
|
res["model_rank"] = res["model_p"].rank(ascending=False).astype(int)
|
||||||
|
res["conn_rank"] = res["conn"].rank(ascending=True).astype(int) # most negative = best
|
||||||
|
N = len(res)
|
||||||
|
print(f"\n{'drug':14s} {'model_rank':>10s} {'conn_rank(base)':>16s}")
|
||||||
|
for d in ["hydroxyurea", "glutamine"] + NEG5 + ["norethindrone", "ciprofloxacin"]:
|
||||||
|
if d in res.index:
|
||||||
|
r = res.loc[d]
|
||||||
|
print(f" {d:14s} {int(r['model_rank']):6d}/{N} {int(r['conn_rank']):6d}/{N}")
|
||||||
|
nb_model = sum(res.loc[d, "model_rank"] > N/2 for d in NEG5 if d in res.index)
|
||||||
|
nb_conn = sum(res.loc[d, "conn_rank"] > N/2 for d in NEG5 if d in res.index)
|
||||||
|
print(f"\nneg controls bottom-half: model {nb_model}/5 vs baseline {nb_conn}/5")
|
||||||
|
print("model top 10:", ", ".join(res.sort_values('model_p', ascending=False).head(10).index))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -36,9 +36,15 @@ def read_gz_tsv(name: str) -> pd.DataFrame:
|
|||||||
|
|
||||||
|
|
||||||
def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
|
def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
|
||||||
lm = pd.read_csv(LINCS / "landmark_genes.csv")
|
"""Gene row-ids + id->symbol map for the scored gene space.
|
||||||
ids = [str(x) for x in lm["pr_gene_id"]]
|
|
||||||
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in lm.itertuples()}
|
v1.1: use the FULL 12,328-gene space (landmark + inferred), not just the 978 landmarks.
|
||||||
|
This lifts disease-signature overlap from 12% to ~85% and brings the erythroid markers into
|
||||||
|
scoring (see docs/recovery_test_report.md). Inferred genes are model-predicted (noisier).
|
||||||
|
"""
|
||||||
|
g = pd.read_csv(LINCS / "GSE92742_gene_info.txt.gz", sep="\t")
|
||||||
|
ids = [str(x) for x in g["pr_gene_id"]]
|
||||||
|
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in g.itertuples()}
|
||||||
return ids, id_to_symbol
|
return ids, id_to_symbol
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,13 +1,11 @@
|
|||||||
"""Week 3: run connectivity scoring over all drugs -> ranked_candidates_v1.csv (PLAN §6).
|
"""Week 3 (v1.1): connectivity scoring over the full gene space with tau calibration.
|
||||||
|
|
||||||
Loads the disease signature + the 300 drug LINCS signatures, computes the weighted-KS
|
Primary ranking is now **tau** (KS connectivity expressed as a signed percentile vs a null of
|
||||||
connectivity score per drug, and produces two rankings:
|
random queries) — this calibrates out broad-effect drugs that connect to random signatures too,
|
||||||
1. raw connectivity (most negative = strongest reversal = rank 1)
|
the specificity fix. The weighted connectivity score (WTCS) is retained as a reference column,
|
||||||
2. a secondary ranking blending connectivity with a mechanistic prior (sickle-relevant
|
and a secondary ranking blends tau with the sickle-pathway mechanistic prior.
|
||||||
target pathways), to temper broad-effect drugs (HDAC/kinase) that dominate raw rankings.
|
|
||||||
|
|
||||||
The formal recovery test (ground-truth + negative-control evaluation against the pre-registered
|
Output: data/results/ranked_candidates_v1.csv.
|
||||||
criteria) is Week 4; this script only prints a sanity peek.
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
@@ -19,10 +17,13 @@ import pandas as pd
|
|||||||
|
|
||||||
import sys
|
import sys
|
||||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||||
from src.scoring import mechanistic_prior, persist_ranking, rank_drugs # noqa: E402
|
from src.scoring import ( # noqa: E402
|
||||||
|
connectivity_score, mechanistic_prior, normalize_scores, persist_ranking, tau_calibrate,
|
||||||
|
)
|
||||||
|
|
||||||
PROCESSED = Path("data/processed")
|
PROCESSED = Path("data/processed")
|
||||||
PRIOR_LAMBDA = 0.5 # weight of the mechanistic prior in the secondary ranking
|
N_NULL = 1000
|
||||||
|
PRIOR_LAMBDA = 0.5 # spec_z credit per matched sickle pathway, for the blended ranking
|
||||||
|
|
||||||
|
|
||||||
def main() -> None:
|
def main() -> None:
|
||||||
@@ -30,52 +31,51 @@ def main() -> None:
|
|||||||
up = [g["gene"] for g in sig["up_regulated"]]
|
up = [g["gene"] for g in sig["up_regulated"]]
|
||||||
down = [g["gene"] for g in sig["down_regulated"]]
|
down = [g["gene"] for g in sig["down_regulated"]]
|
||||||
|
|
||||||
sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 978 symbols
|
sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 12,328 genes
|
||||||
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
|
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
|
||||||
|
|
||||||
landmark = set(sig_matrix.columns)
|
n_up = len(set(up) & set(sig_matrix.columns))
|
||||||
n_up_ov = len(set(up) & landmark)
|
n_down = len(set(down) & set(sig_matrix.columns))
|
||||||
n_down_ov = len(set(down) & landmark)
|
print(f"gene space: {sig_matrix.shape[1]} genes; query overlap {n_up} up + {n_down} down = {n_up + n_down}")
|
||||||
print(f"query overlap with 978 landmarks: {n_up_ov} up + {n_down_ov} down = {n_up_ov + n_down_ov}")
|
|
||||||
print(f"scoring {len(sig_matrix)} drugs (all scored; 0 without signature)")
|
|
||||||
|
|
||||||
ranked = rank_drugs(up, down, sig_matrix)
|
# primary: tau calibration
|
||||||
|
print(f"computing tau over {N_NULL} random-query permutations ...", flush=True)
|
||||||
|
ranked = tau_calibrate(up, down, sig_matrix, n_null=N_NULL)
|
||||||
|
|
||||||
# attach metadata + mechanistic prior
|
# reference: weighted connectivity score (WTCS) + NCS
|
||||||
|
wtcs = pd.Series({d: connectivity_score(up, down, sig_matrix.loc[d]) for d in sig_matrix.index},
|
||||||
|
name="connectivity_score")
|
||||||
|
ranked["connectivity_score"] = wtcs
|
||||||
|
ranked["normalized_score"] = normalize_scores(wtcs)
|
||||||
|
|
||||||
|
# metadata + mechanistic prior
|
||||||
ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
|
ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
|
||||||
ranked["mechanistic_prior"] = ranked["targets"].apply(
|
ranked["mechanistic_prior"] = ranked["targets"].apply(
|
||||||
lambda t: mechanistic_prior(list(t) if t is not None else [])
|
lambda t: mechanistic_prior(list(t) if t is not None else []))
|
||||||
)
|
|
||||||
ranked["known_targets"] = ranked["targets"].apply(
|
ranked["known_targets"] = ranked["targets"].apply(
|
||||||
lambda t: "; ".join(t) if t is not None and len(t) else ""
|
lambda t: "; ".join(t) if t is not None and len(t) else "")
|
||||||
)
|
|
||||||
ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})
|
ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})
|
||||||
|
|
||||||
# secondary, prior-weighted ranking: relevant drugs pushed toward better (more negative)
|
# secondary, prior-weighted ranking (relevant drugs pushed toward more-negative spec_z)
|
||||||
ranked["blended_score"] = ranked["normalized_score"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
|
ranked["blended_score"] = ranked["spec_z"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
|
||||||
ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)
|
ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)
|
||||||
|
|
||||||
out = ranked.rename_axis("drug_name").reset_index()[[
|
out = ranked.rename_axis("drug_name").reset_index()[[
|
||||||
"rank", "drug_name", "chembl_id", "connectivity_score", "normalized_score",
|
"rank", "drug_name", "chembl_id", "spec_z", "tau", "connectivity_ks", "connectivity_score",
|
||||||
"inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
|
"inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
|
||||||
]]
|
]]
|
||||||
path = persist_ranking(out)
|
path = persist_ranking(out)
|
||||||
print(f"wrote {path} ({len(out)} drugs)")
|
print(f"wrote {path} ({len(out)} drugs)")
|
||||||
|
|
||||||
# --- sanity peek (formal recovery test is Week 4) ---
|
print("\n--- sanity peek (spec_z ranking) ---")
|
||||||
print("\n--- sanity peek (raw connectivity rank) ---")
|
|
||||||
for gt in ["hydroxyurea", "glutamine"]:
|
for gt in ["hydroxyurea", "glutamine"]:
|
||||||
r = ranked.loc[gt]
|
r = ranked.loc[gt]
|
||||||
pct = 100 * r["rank"] / len(ranked)
|
print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {100*r['rank']/len(ranked):.0f}%), "
|
||||||
print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {pct:.0f}%), "
|
f"spec_z={r['spec_z']:.2f}")
|
||||||
f"score={r['connectivity_score']:.3f}")
|
print(" top 10 by spec_z:")
|
||||||
neg = ranked[ranked["inclusion_reason"] == "negative_control"]
|
for name, r in ranked.nsmallest(10, "spec_z").iterrows():
|
||||||
print(f" negative controls in bottom half: "
|
print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:6.2f} [{r['inclusion_reason']:16s}] "
|
||||||
f"{(neg['rank'] > len(ranked) / 2).sum()}/{len(neg)}")
|
f"{str(r['known_targets'])[:38]}")
|
||||||
print("\n top 5 raw candidates:")
|
|
||||||
for name, r in ranked.nsmallest(5, "connectivity_score").iterrows():
|
|
||||||
print(f" {int(r['rank']):3d} {name:18s} {r['connectivity_score']:+.3f} "
|
|
||||||
f"[{r['inclusion_reason']}] {r['known_targets'][:50]}")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
@@ -36,6 +36,7 @@ def main() -> None:
|
|||||||
return int(df.loc[name, "rank"]) if name in df.index else None
|
return int(df.loc[name, "rank"]) if name in df.index else None
|
||||||
|
|
||||||
hu, glut = rk("hydroxyurea"), rk("glutamine")
|
hu, glut = rk("hydroxyurea"), rk("glutamine")
|
||||||
|
glut_z = df.loc["glutamine", "spec_z"]
|
||||||
|
|
||||||
# pick negative controls present in the ranking
|
# pick negative controls present in the ranking
|
||||||
negs = {}
|
negs = {}
|
||||||
@@ -47,9 +48,8 @@ def main() -> None:
|
|||||||
print("=" * 60)
|
print("=" * 60)
|
||||||
print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}")
|
print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}")
|
||||||
print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%) -> top-10%? {hu <= top10_cut}")
|
print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%) -> top-10%? {hu <= top10_cut}")
|
||||||
glut_score = df.loc["glutamine", "connectivity_score"]
|
print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), spec_z={glut_z:+.2f} "
|
||||||
print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), WTCS={glut_score:.3f} "
|
f"-> top-25%? {glut <= top25_cut} (positive z => does not reverse; has a signature)")
|
||||||
f"-> top-25%? {glut <= top25_cut} (has signature, so NOT 'missing-signature unscorable')")
|
|
||||||
print("\nnegative controls (pre-specified, 1 per category):")
|
print("\nnegative controls (pre-specified, 1 per category):")
|
||||||
n_bottom = 0
|
n_bottom = 0
|
||||||
for d, (cat, r) in negs.items():
|
for d, (cat, r) in negs.items():
|
||||||
@@ -70,10 +70,10 @@ def main() -> None:
|
|||||||
print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} "
|
print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} "
|
||||||
f"(top {100*hu_b/n:.1f}%)")
|
f"(top {100*hu_b/n:.1f}%)")
|
||||||
|
|
||||||
print("\n--- TOP 10 (raw connectivity) ---")
|
print("\n--- TOP 10 (primary spec_z ranking) ---")
|
||||||
top10 = df.nsmallest(10, "connectivity_score")
|
top10 = df.sort_values("rank").head(10)
|
||||||
for name, r in top10.iterrows():
|
for name, r in top10.iterrows():
|
||||||
print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} "
|
print(f" {int(r['rank']):2d} {name:18s} z={r['spec_z']:+.2f} "
|
||||||
f"[{r['inclusion_reason']}] {str(r['known_targets'])[:45]}")
|
f"[{r['inclusion_reason']}] {str(r['known_targets'])[:45]}")
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
89
src/binding.py
Normal file
89
src/binding.py
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
"""Structure-based binding track (PLAN §12).
|
||||||
|
|
||||||
|
Two capabilities:
|
||||||
|
- ligand-based retrieval (RDKit, works now): find existing drugs near a query molecule in
|
||||||
|
chemical space — validated, and the engine behind §12.9 generative-guided retrieval.
|
||||||
|
- structure-based docking (§12.3): score whether a ligand binds a target pocket. Blocked on an
|
||||||
|
ARM-Mac docking toolchain (AutoDock Vina does not pip-install); see ``dock`` for options.
|
||||||
|
|
||||||
|
Caveat carried throughout: chemical similarity != activity, and docking != efficacy (§12.6).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from rdkit import Chem, DataStructs, RDLogger
|
||||||
|
from rdkit.Chem import rdFingerprintGenerator
|
||||||
|
|
||||||
|
RDLogger.DisableLog("rdApp.*")
|
||||||
|
_MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
|
||||||
|
|
||||||
|
STRUCT_DIR = Path("data/raw/structures")
|
||||||
|
|
||||||
|
# Known sickle small-molecule binders, by target (positive controls for the §12.4 recovery test).
|
||||||
|
KNOWN_BINDERS = {
|
||||||
|
"hemoglobin": "voxelotor",
|
||||||
|
"PKR": "mitapivat",
|
||||||
|
"DNMT1": "decitabine",
|
||||||
|
"HDAC": "vorinostat",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Curated target structures (PLAN §12.2). Add PDB ids as the harness grows.
|
||||||
|
TARGET_PDB = {
|
||||||
|
"hemoglobin": "5E83", # hemoglobin + voxelotor (GBT440)
|
||||||
|
"PKR": "8XFD", # pyruvate kinase R + mitapivat
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def morgan_fp(smiles: str):
|
||||||
|
"""Morgan (ECFP4) fingerprint, or None for invalid / missing SMILES ('-666', '')."""
|
||||||
|
if not isinstance(smiles, str) or smiles in ("-666", ""):
|
||||||
|
return None
|
||||||
|
mol = Chem.MolFromSmiles(smiles)
|
||||||
|
return _MORGAN.GetFingerprint(mol) if mol else None
|
||||||
|
|
||||||
|
|
||||||
|
def tanimoto(smiles_a: str, smiles_b: str) -> float | None:
|
||||||
|
fa, fb = morgan_fp(smiles_a), morgan_fp(smiles_b)
|
||||||
|
if fa is None or fb is None:
|
||||||
|
return None
|
||||||
|
return DataStructs.TanimotoSimilarity(fa, fb)
|
||||||
|
|
||||||
|
|
||||||
|
def retrieve_nearest(
|
||||||
|
query_smiles: str,
|
||||||
|
library: dict[str, str],
|
||||||
|
top_n: int = 5,
|
||||||
|
) -> list[tuple[float, str]]:
|
||||||
|
"""Rank a library of {name: smiles} by Tanimoto similarity to a query molecule.
|
||||||
|
|
||||||
|
This is the §12.9 retrieval step: the query may be a known binder (positive-control beacon)
|
||||||
|
or a generated idealised binder; the returned existing drugs are repurposing candidates that
|
||||||
|
STILL require docking/validation (similarity != activity).
|
||||||
|
"""
|
||||||
|
qfp = morgan_fp(query_smiles)
|
||||||
|
if qfp is None:
|
||||||
|
raise ValueError("invalid query SMILES")
|
||||||
|
sims = []
|
||||||
|
for name, smi in library.items():
|
||||||
|
fp = morgan_fp(smi)
|
||||||
|
if fp is not None:
|
||||||
|
sims.append((DataStructs.TanimotoSimilarity(qfp, fp), name))
|
||||||
|
return sorted(sims, reverse=True)[:top_n]
|
||||||
|
|
||||||
|
|
||||||
|
def dock(target: str, ligand_smiles: str) -> float:
|
||||||
|
"""Dock a ligand into a target pocket and return a binding score (PLAN §12.3).
|
||||||
|
|
||||||
|
Blocked: AutoDock Vina does not pip-install on ARM Mac and no docking binary is on PATH.
|
||||||
|
Resolve the toolchain first (one of):
|
||||||
|
- conda/micromamba: ``vina`` (conda-forge) or ``smina`` (bioconda), osx-arm64 builds
|
||||||
|
- an AF3-class co-folding model on GPU: Boltz-2 / Chai-1 / DiffDock (also predicts affinity)
|
||||||
|
- x86 Vina binary under Rosetta 2
|
||||||
|
Then: fetch TARGET_PDB[target], define the pocket box, prep the ligand (Meeko), score.
|
||||||
|
"""
|
||||||
|
raise NotImplementedError(
|
||||||
|
"Docking toolchain unresolved on ARM Mac (PLAN §12.6 pitfall 4 / §12.8). "
|
||||||
|
"See docstring for options."
|
||||||
|
)
|
||||||
@@ -16,7 +16,7 @@ from pathlib import Path
|
|||||||
import numpy as np
|
import numpy as np
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
|
|
||||||
from . import RESULTS_DIR
|
from . import RANDOM_SEED, RESULTS_DIR
|
||||||
|
|
||||||
# Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
|
# Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
|
||||||
# Keys are pathway categories; values are substrings matched (case-insensitive) against a
|
# Keys are pathway categories; values are substrings matched (case-insensitive) against a
|
||||||
@@ -134,6 +134,92 @@ def mechanistic_prior(targets: list[str]) -> float:
|
|||||||
return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))
|
return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))
|
||||||
|
|
||||||
|
|
||||||
|
# --- KS connectivity + tau calibration (v1.1) -----------------------------------------------
|
||||||
|
# Unweighted Kolmogorov-Smirnov connectivity (Lamb 2006) is O(k) per query (depends only on the
|
||||||
|
# ranks of the query genes), which makes a permutation null over many random queries cheap. tau
|
||||||
|
# expresses each drug's real connectivity as a signed percentile within its own null — so
|
||||||
|
# broad-effect drugs that connect to *random* signatures too get down-weighted (specificity).
|
||||||
|
|
||||||
|
|
||||||
|
def _ks_es(rank_matrix: np.ndarray, query_cols: np.ndarray, n_genes: int) -> np.ndarray:
|
||||||
|
"""Vectorized unweighted KS enrichment score of a query gene set for every drug.
|
||||||
|
|
||||||
|
``rank_matrix`` is (n_drugs, n_genes) of 1..N rank positions (1 = most up-regulated).
|
||||||
|
Returns one ES per drug; ES>0 => query enriched among up-regulated genes.
|
||||||
|
"""
|
||||||
|
k = len(query_cols)
|
||||||
|
if k == 0:
|
||||||
|
return np.zeros(rank_matrix.shape[0])
|
||||||
|
p = np.sort(rank_matrix[:, query_cols], axis=1) # (n_drugs, k), positions ascending
|
||||||
|
j = np.arange(1, k + 1)
|
||||||
|
a = (j / k - p / n_genes).max(axis=1)
|
||||||
|
b = (p / n_genes - (j - 1) / k).max(axis=1)
|
||||||
|
return np.where(a >= b, a, -b)
|
||||||
|
|
||||||
|
|
||||||
|
def _ks_connectivity(rank_matrix: np.ndarray, up_cols: np.ndarray, down_cols: np.ndarray,
|
||||||
|
n_genes: int) -> np.ndarray:
|
||||||
|
"""KS connectivity per drug: (ES_up - ES_down)/2. Negative=reversal.
|
||||||
|
|
||||||
|
Note: unlike WTCS, this does NOT zero same-sign (ambiguous) connections — same-sign ES
|
||||||
|
partially cancel and land near 0 naturally. Hard-zeroing would collapse the random-query
|
||||||
|
null to a spike at 0 and make tau saturate, so the continuous form is required for calibration.
|
||||||
|
"""
|
||||||
|
es_up = _ks_es(rank_matrix, up_cols, n_genes)
|
||||||
|
es_down = _ks_es(rank_matrix, down_cols, n_genes)
|
||||||
|
return (es_up - es_down) / 2.0
|
||||||
|
|
||||||
|
|
||||||
|
def tau_calibrate(
|
||||||
|
up_genes: list[str],
|
||||||
|
down_genes: list[str],
|
||||||
|
signature_matrix: pd.DataFrame,
|
||||||
|
n_null: int = 1000,
|
||||||
|
seed: int = RANDOM_SEED,
|
||||||
|
) -> pd.DataFrame:
|
||||||
|
"""Rank drugs by tau: each drug's KS connectivity as a signed percentile vs a null of
|
||||||
|
random queries of the same size (PLAN §6; CMap tau, Subramanian 2017).
|
||||||
|
|
||||||
|
tau in [-100, 100]: -100 => reverses our signature more specifically than any random query
|
||||||
|
(strong, specific candidate); ~0 => connects to our signature no more than to random ones
|
||||||
|
(broad-effect / non-specific). Ranked by tau ascending (rank 1 = most specific reversal).
|
||||||
|
"""
|
||||||
|
genes = list(signature_matrix.columns)
|
||||||
|
gene_to_col = {g: i for i, g in enumerate(genes)}
|
||||||
|
n = len(genes)
|
||||||
|
rank_matrix = signature_matrix.rank(axis=1, ascending=False).to_numpy()
|
||||||
|
|
||||||
|
up_cols = np.array([gene_to_col[g] for g in set(up_genes) if g in gene_to_col], dtype=int)
|
||||||
|
down_cols = np.array([gene_to_col[g] for g in set(down_genes) if g in gene_to_col], dtype=int)
|
||||||
|
real = _ks_connectivity(rank_matrix, up_cols, down_cols, n)
|
||||||
|
|
||||||
|
rng = np.random.default_rng(seed)
|
||||||
|
k_up, k_down = len(up_cols), len(down_cols)
|
||||||
|
null = np.empty((rank_matrix.shape[0], n_null))
|
||||||
|
for m in range(n_null):
|
||||||
|
samp = rng.choice(n, size=k_up + k_down, replace=False)
|
||||||
|
null[:, m] = _ks_connectivity(rank_matrix, samp[:k_up], samp[k_up:], n)
|
||||||
|
|
||||||
|
null_mean = null.mean(axis=1)
|
||||||
|
null_std = null.std(axis=1)
|
||||||
|
null_std[null_std == 0] = np.nan
|
||||||
|
# Per-drug standardized connectivity: how many SDs the real reversal is below what random
|
||||||
|
# queries of the same size produce against THIS drug. Continuous (no percentile floor), so it
|
||||||
|
# discriminates within the strong-reversal tail. Negative = specific reversal.
|
||||||
|
spec_z = (real - null_mean) / null_std
|
||||||
|
|
||||||
|
frac = (null <= real[:, None]).mean(axis=1)
|
||||||
|
tau = 100.0 * (2.0 * frac - 1.0) # retained for reference; saturates at +/-100 in the tail
|
||||||
|
|
||||||
|
df = pd.DataFrame(
|
||||||
|
{"connectivity_ks": real, "null_mean": null_mean, "spec_z": spec_z, "tau": tau},
|
||||||
|
index=signature_matrix.index,
|
||||||
|
)
|
||||||
|
df = df.sort_values("spec_z") # most negative z = most specific reversal
|
||||||
|
df.insert(0, "rank", range(1, len(df) + 1))
|
||||||
|
return df
|
||||||
|
|
||||||
|
|
||||||
def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:
|
def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:
|
||||||
"""Write the ranked candidate list to ``data/results/ranked_candidates_v1.csv``."""
|
"""Write the ranked candidate list to ``data/results/ranked_candidates_v1.csv``."""
|
||||||
out_path = out_path or (RESULTS_DIR / "ranked_candidates_v1.csv")
|
out_path = out_path or (RESULTS_DIR / "ranked_candidates_v1.csv")
|
||||||
|
|||||||
@@ -114,6 +114,33 @@ class TestMechanisticPrior:
|
|||||||
assert mechanistic_prior(["Some unrelated kinase"]) == 0.0
|
assert mechanistic_prior(["Some unrelated kinase"]) == 0.0
|
||||||
|
|
||||||
|
|
||||||
|
class TestTauCalibration:
|
||||||
|
"""tau should reward a SPECIFIC reverser and give a near-zero score to a noise drug."""
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _matrix() -> pd.DataFrame:
|
||||||
|
genes = [f"U{i}" for i in range(5)] + [f"D{i}" for i in range(5)] + [f"G{i}" for i in range(40)]
|
||||||
|
rng_vals = {g: 0.01 * ((hash(g) % 7) - 3) for g in genes} # tiny deterministic noise
|
||||||
|
# specific reverser: query-up genes at the bottom, query-down at the top, rest ~0
|
||||||
|
specific = dict(rng_vals)
|
||||||
|
for i in range(5):
|
||||||
|
specific[f"U{i}"] = -8 - i
|
||||||
|
specific[f"D{i}"] = 8 + i
|
||||||
|
noise = dict(rng_vals)
|
||||||
|
return pd.DataFrame([specific, noise], index=["specific", "noise"])[genes]
|
||||||
|
|
||||||
|
def test_specific_reverser_has_strongly_negative_tau(self):
|
||||||
|
from src.scoring import tau_calibrate
|
||||||
|
up = [f"U{i}" for i in range(5)]
|
||||||
|
down = [f"D{i}" for i in range(5)]
|
||||||
|
out = tau_calibrate(up, down, self._matrix(), n_null=300, seed=0)
|
||||||
|
# Ranked by spec_z (continuous); the specific reverser is the most negative.
|
||||||
|
assert out.loc["specific", "spec_z"] < -2
|
||||||
|
assert out.loc["specific", "spec_z"] < out.loc["noise", "spec_z"]
|
||||||
|
assert out.loc["specific", "tau"] < -50 # tau also flags it (may saturate near -100)
|
||||||
|
assert out.loc["specific", "rank"] == 1
|
||||||
|
|
||||||
|
|
||||||
def test_rank_drugs_orders_by_reversal():
|
def test_rank_drugs_orders_by_reversal():
|
||||||
from src.scoring import rank_drugs
|
from src.scoring import rank_drugs
|
||||||
genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]
|
genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]
|
||||||
|
|||||||
Reference in New Issue
Block a user