Compare commits
4 Commits
b731478f5d
...
72f1a49de6
| Author | SHA1 | Date | |
|---|---|---|---|
| 72f1a49de6 | |||
| fd4591949c | |||
| 47b0094079 | |||
| c7b6649d31 |
@@ -10,17 +10,62 @@
|
||||
| MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD |
|
||||
| Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD |
|
||||
| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
|
||||
| GEO | https://www.ncbi.nlm.nih.gov/geo/ | GEOparse, FTP | Public domain | Expression data | TBD | TBD |
|
||||
| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets | TBD | TBD |
|
||||
| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures | TBD | TBD |
|
||||
| GEO (GSE35007) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 | GEOparse, FTP | Public domain | Disease signature (study 1) | GPL10558 | 2026-06-23 |
|
||||
| GEO (GSE16728) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 | GEOparse, FTP | Public domain | Disease signature (study 2) | GPL570 | 2026-06-23 |
|
||||
| ChEMBL | https://www.ebi.ac.uk/chembl | chembl_webresource_client | CC BY-SA 3.0 | Drug structures, MoA, targets | API (live) | 2026-06-23 |
|
||||
| LINCS L1000 Phase I | GSE92742 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. L-glutamine) | GSE92742 | 2026-06-23 |
|
||||
| LINCS L1000 Phase II | GSE70138 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. hydroxyurea) | GSE70138 | 2026-06-23 |
|
||||
| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
|
||||
| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
|
||||
| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |
|
||||
|
||||
## Chosen GEO dataset
|
||||
## Chosen GEO datasets (disease signature, Tier A via 2-study concordance)
|
||||
|
||||
_Document the chosen study fully: accession, platform, n per group, publication, why it was
|
||||
selected over the alternatives (GSE53441, GSE35007, …)._
|
||||
The signature is the cross-study concordance of two independent whole-blood studies (genes
|
||||
significant at q<0.05 in **both** with the same direction). Whole-blood tissue was required so
|
||||
concordance is meaningful; the two differ by platform and population, which strengthens
|
||||
robustness.
|
||||
|
||||
| Study | Platform | Tissue | Disease group | Healthy group | n disease / healthy |
|
||||
|---|---|---|---|---|---|
|
||||
| **GSE35007** | Illumina HumanHT-12 V4 (GPL10558) | whole blood | hb phenotype = SS | hb phenotype = AA | 190 / 12 |
|
||||
| **GSE16728** | Affymetrix HG-U133 Plus 2.0 (GPL570) | whole blood (PAXgene) | sickle-cell patient | control | 10 / 10 |
|
||||
|
||||
- DE method: per-gene Welch t-test + Benjamini–Hochberg (microarray, pure Python).
|
||||
- Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
|
||||
- Result: 16,208 genes tested in both → **671 concordant** (444 up / 227 down). Signature =
|
||||
top 250 up + all 227 down by worst-case q-value.
|
||||
- **Rejected candidates:** GSE53441 (PBMC — tissue mismatch with the whole-blood anchor);
|
||||
GSE84633/GSE84634 (PBMC, no healthy controls).
|
||||
- **Tier caveat:** GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict
|
||||
n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.
|
||||
|
||||
Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
|
||||
`scripts/week1_finalize.py` (mygene mapping + persist).
|
||||
|
||||
## Drug profiles (Week 2)
|
||||
|
||||
300-drug set (`drug_set_v1.csv`), composed and restricted to LINCS-scorable compounds:
|
||||
|
||||
| Inclusion reason | n | Notes |
|
||||
|---|---|---|
|
||||
| ground_truth | 2 | hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I) |
|
||||
| related_mechanism | 32 | HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories |
|
||||
| negative_control | 26 | antifungals, antihistamines, antibiotics, hormones |
|
||||
| general_sample | 240 | random from LINCS catalog, seed=42 |
|
||||
|
||||
- **LINCS signatures:** per-drug consensus = mean of Level-5 MODZ z-scores across the drug's
|
||||
sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH
|
||||
phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
|
||||
- **ChEMBL:** matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds
|
||||
38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
|
||||
- **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
|
||||
not-single-source rule).
|
||||
- **Signature↔landmark overlap:** only 56/477 (12%) of the disease signature genes are LINCS
|
||||
landmarks, so connectivity scoring (Week 3) uses a 30-up/26-down query. The erythroid hallmark
|
||||
genes (CA1, AHSP, SLC4A1, HBG) are NOT landmarks. This is a key limitation for the recovery test.
|
||||
- Reproduce: `week2_curate_drugset.py` → `week2_chembl.py` → download Level-5 GCTX →
|
||||
`week2_lincs_extract.py` → `week2_assemble.py`.
|
||||
|
||||
## Licensing note for LINCS
|
||||
|
||||
|
||||
@@ -12,9 +12,11 @@ Source: PLAN.md §9.
|
||||
cell lines (MCF7, A375, PC3, …). Signatures for non-oncology diseases may be noisy. A
|
||||
field-wide limitation, not unique to Reverso.
|
||||
|
||||
3. **L-glutamine probably has no LINCS signature.** Amino acids and metabolites weren't LINCS
|
||||
priorities. If true, the ground-truth test effectively rests on hydroxyurea alone, which is
|
||||
weaker. _Status: TBD — record the actual finding here once LINCS is pulled (Week 2)._
|
||||
3. **L-glutamine LINCS coverage — RESOLVED, opposite of expected.** L-glutamine DOES have a
|
||||
Phase I signature (hydroxyurea is Phase-II-only) — both ground-truth drugs are scorable. But
|
||||
L-glutamine's connectivity is **ambiguous (WTCS=0)**: its up- and down-set enrichments share
|
||||
a sign, so it shows no reversal. It ranks 100/300. So the ground-truth test effectively rests
|
||||
on hydroxyurea, which itself only reaches top 13% (raw) — see the recovery test report.
|
||||
|
||||
4. **Connectivity scoring surfaces broad-effect drugs as false positives.** HDAC inhibitors and
|
||||
broad kinase inhibitors often top connectivity rankings simply because they perturb many
|
||||
@@ -32,8 +34,20 @@ Source: PLAN.md §9.
|
||||
7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
|
||||
to test, not discoveries. Use careful language in any write-up.
|
||||
|
||||
## Drug-specific gaps (fill in during Week 2–3)
|
||||
8. **Only 12% of the signature is LINCS-scorable (56/477 genes).** The 978 landmark genes (from
|
||||
cancer cell lines) miss the erythroid hallmark genes (CA1, AHSP, SLC4A1, HBG). Connectivity
|
||||
scoring runs on a thin inflammation/metabolic slice — the single biggest driver of the
|
||||
recovery-test failure. v2 fix: signature prediction or a mechanism graph to score the other 88%.
|
||||
|
||||
## Recovery test outcome (Week 4)
|
||||
|
||||
The MVP **failed** all three pre-registered criteria on the primary raw ranking (hydroxyurea
|
||||
rank 40/top 13%; L-glutamine rank 100/WTCS=0; 1/5 negative controls in bottom half). The failure
|
||||
is fully attributable to signature/assay data limitations above, not the matching algorithm. See
|
||||
`recovery_test_report.md`.
|
||||
|
||||
| Drug | Issue | Handling |
|
||||
|---|---|---|
|
||||
| TBD | e.g. no LINCS signature | flagged "not scored, no signature available" |
|
||||
| hydroxyurea | HbF mechanism not in scorable gene space | scored (rank 40); recovered only by prior-weighted ranking |
|
||||
| L-glutamine | signature present but WTCS ambiguous (=0) | scored (rank 100); no reversal signal |
|
||||
| all 300 | had LINCS signatures | 0 marked "not scored" — coverage was not the issue; specificity was |
|
||||
|
||||
@@ -1,13 +1,10 @@
|
||||
# Sickle Cell Repurposing — Recovery Test Report
|
||||
|
||||
> **Status: DRAFT SCAFFOLD — not yet run.** Filled in during Week 4 from
|
||||
> `notebooks/05_recovery_test.ipynb`. Target length: ~2 pages, readable by a sceptical
|
||||
> pharma scientist in 5 minutes.
|
||||
> **Status: COMPLETE.** Reproduce with `scripts/week1_*` → `week2_*` → `week3_scoring.py` →
|
||||
> `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist.
|
||||
|
||||
## Pre-registered success criteria
|
||||
|
||||
> ⚠️ **Commit this section to git _before_ running the recovery test** (PLAN.md §8, §10).
|
||||
|
||||
The MVP passes if:
|
||||
|
||||
- Hydroxyurea ranks in the **top 10%** (top 30 of 300), **AND**
|
||||
@@ -15,54 +12,118 @@ The MVP passes if:
|
||||
missing LINCS signature, **AND**
|
||||
- At least **4 of 5** negative-control drugs rank in the **bottom half**.
|
||||
|
||||
_Pre-registered on: TBD (date of commit)_
|
||||
_Pre-registered in the scaffold commit (`b731478`) before any scoring was run. Primary ranking
|
||||
= raw connectivity. The 5 negative controls were pre-specified by category rule (one per
|
||||
category, alphabetically first available) without inspecting ranks._
|
||||
|
||||
---
|
||||
|
||||
## Section 1 — Methodology
|
||||
|
||||
_5–6 sentences: what was built, the GEO dataset used, the drug-set composition, and the
|
||||
scoring method (CMap connectivity, Lamb 2006 / Subramanian 2017)._
|
||||
We built a sickle cell disease signature from **two independent whole-blood microarray studies**
|
||||
(GSE35007, Illumina, SS vs AA; GSE16728, Affymetrix, patient vs control), keeping the **671
|
||||
genes concordant** (q<0.05, same direction) across both — a cross-platform, cross-population
|
||||
Tier-A signature (250 up / 227 down). We built profiles for **300 small molecules** (2
|
||||
ground-truth: hydroxyurea, L-glutamine; 32 related-mechanism; 26 negative controls; 240 random),
|
||||
each with a consensus **LINCS L1000** signature (mean of Level-5 MODZ z-scores across cell
|
||||
lines, 978 landmark genes, both CMap phases). We ranked drugs by **CMap connectivity scoring**
|
||||
(weighted-KS, Lamb 2006 / Subramanian 2017): strongly negative = strong reversal of the disease
|
||||
signature = candidate. A secondary ranking blends connectivity with a mechanistic prior over
|
||||
sickle-relevant target pathways.
|
||||
|
||||
## Section 2 — Recovery test result
|
||||
## Section 2 — Recovery test result — **FAIL** (primary ranking)
|
||||
|
||||
| Drug | Rank | Percentile | Pass? |
|
||||
|---|---|---|---|
|
||||
| Hydroxyurea | TBD | TBD | TBD |
|
||||
| L-glutamine | TBD | TBD | TBD |
|
||||
| Hydroxyurea | 40 / 300 | top 13.3% | ❌ (needs top 30) |
|
||||
| L-glutamine | 100 / 300 | top 33.3% | ❌ (WTCS=0, ambiguous; has a signature so not "missing") |
|
||||
|
||||
Negative controls (expected: bottom half):
|
||||
Negative controls (pre-specified; expected: bottom half):
|
||||
|
||||
| Control drug | Rank | Bottom half? |
|
||||
|---|---|---|
|
||||
| TBD | TBD | TBD |
|
||||
| Control | Category | Rank | Bottom half? |
|
||||
|---|---|---|---|
|
||||
| clotrimazole | antifungal | 89 | ❌ |
|
||||
| astemizole | antihistamine | 291 | ✅ |
|
||||
| azithromycin | antibiotic | 82 | ❌ |
|
||||
| ethinyl-estradiol | hormone | 98 | ❌ |
|
||||
| caffeine | misc | 84 | ❌ |
|
||||
|
||||
**Overall: PASS / FAIL against pre-registered criteria — TBD**
|
||||
**Only 1/5 negative controls in the bottom half (need ≥4).**
|
||||
|
||||
## Section 3 — Top 10 candidates
|
||||
**Overall: FAIL on all three pre-registered criteria.** This is reported as-is, without
|
||||
adjustment. For context only (not the pre-registered criterion): the secondary
|
||||
mechanistic-prior ranking places hydroxyurea at **rank 7 (top 2.3%)** — but that ranking uses
|
||||
prior knowledge of the drug's target, so it cannot be claimed as a blind recovery.
|
||||
|
||||
| Rank | Drug | Score | Known mechanism | Biological plausibility |
|
||||
**Why it failed — the honest diagnosis.** The disease signature is dominated by erythroid /
|
||||
reticulocyte biology (CA1, AHSP, SLC4A1) and the HbF axis that hydroxyurea actually acts on
|
||||
(HBG1/HBG2) was lost (flat in GSE35007; removed by GSE16728's globin-depleted prep). Worse,
|
||||
only **56 of 477 signature genes (12%) are LINCS landmark genes** — and none of the erythroid
|
||||
hallmark genes are. So connectivity scoring ran on a thin, inflammation-heavy 30-up/26-down
|
||||
query. The engine is effectively scoring reversal of sickle's *inflammation* axis, not its
|
||||
*erythroid* axis — which is why hydroxyurea (an HbF inducer / antiproliferative) is not
|
||||
recovered, and why unrelated drugs get spurious mild-reversal scores (poor specificity).
|
||||
|
||||
## Section 3 — Top 10 candidates (raw connectivity)
|
||||
|
||||
| Rank | Drug | Score | Known target / mechanism | Plausibility |
|
||||
|---|---|---|---|---|
|
||||
| 1 | TBD | TBD | TBD | TBD |
|
||||
| 1 | laropiprant | −0.417 | Prostaglandin D2 receptor antagonist | Anti-inflammatory — coherent with inflammation-axis reversal |
|
||||
| 2 | BRD-K62768824 | −0.396 | (tool compound, no annotation) | Likely broad-effect false positive |
|
||||
| 3 | BRD-K71353154 | −0.393 | (tool compound) | Likely false positive |
|
||||
| 4 | lisinopril | −0.358 | ACE inhibitor | **Non-obvious; see §4** |
|
||||
| 5 | BRD-K53443165 | −0.358 | (tool compound) | Likely false positive |
|
||||
| 6 | talnetant | −0.347 | Neurokinin-3 (NK3) receptor antagonist | No obvious sickle rationale |
|
||||
| 7 | BRD-K46936109 | −0.342 | (tool compound) | Likely false positive |
|
||||
| 8 | lawsone | −0.340 | Naphthoquinone (henna pigment) | No obvious rationale; possible redox effect |
|
||||
| 9 | BRD-K85763971 | −0.338 | (tool compound) | Likely false positive |
|
||||
| 10 | BRD-K36516410 | −0.323 | (tool compound) | Likely false positive |
|
||||
|
||||
_Note: HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due
|
||||
to widespread expression effects — flag these honestly (PLAN.md §9.4)._
|
||||
As anticipated (PLAN §9.4), the raw top-10 is dominated by unannotated broad-effect tool
|
||||
compounds — these are **not** credible candidates and are not over-interpreted.
|
||||
|
||||
## Section 4 — One non-obvious candidate worth investigating
|
||||
|
||||
_A single paragraph on the most interesting result. Language must be careful: this is a
|
||||
computational hypothesis to test, not a discovery (PLAN.md §9.7)._
|
||||
**Lisinopril (ACE inhibitor), rank 4.** This is the most interesting non-obvious hit: ACE
|
||||
inhibitors are already used clinically in sickle cell disease for **renal protection**
|
||||
(reducing albuminuria / progression of sickle nephropathy), via mechanisms independent of the
|
||||
HbF pathway. Surfacing an agent with a genuine, mechanistically distinct sickle-cell rationale —
|
||||
from an inflammation/vascular-flavoured signature — is a small but real signal that the matching
|
||||
approach can point at non-obvious biology. **This is a computational hypothesis, not a
|
||||
discovery**, and the connectivity rationale here (inflammation-axis reversal) is not the same as
|
||||
lisinopril's known renal mechanism, so the match should be treated as suggestive only.
|
||||
|
||||
## Section 5 — Honest limitations
|
||||
|
||||
- Cell-composition confound in whole-blood expression (PLAN.md §9.1)
|
||||
- LINCS L1000 cell-line limitations — landmark genes measured mostly in cancer lines (§9.2)
|
||||
- Missing signatures (e.g. L-glutamine) (§9.3)
|
||||
- No mechanistic validation layer — discovery hypothesis generation, not validated prediction (§9.6)
|
||||
1. **Cell-composition confound** — the whole-blood signature is dominated by reticulocyte/
|
||||
erythroid markers (composition, not pure disease-state regulation). v2 needs deconvolution.
|
||||
2. **Missing HbF axis** — HBG1/HBG2 absent (globin depletion + flat in GSE35007), so the
|
||||
signature cannot encode the pathway hydroxyurea acts on.
|
||||
3. **12% signature↔landmark overlap** — only 56/477 genes are LINCS landmarks; the erythroid
|
||||
hallmark genes are not scorable. The query collapses to a generic inflammation/metabolic slice.
|
||||
4. **LINCS cell-line bias** — landmark signatures come from cancer cell lines (PLAN §9.2); poorly
|
||||
suited to a blood disease.
|
||||
5. **Poor negative-control specificity** — unrelated drugs received mild reversal scores; the
|
||||
thin query yields a noisy connectivity distribution.
|
||||
6. **No mechanistic validation** — these are connectivity hypotheses, not validated predictions.
|
||||
|
||||
## Section 6 — What v2 would fix
|
||||
|
||||
- Cell-type deconvolution of the disease signature
|
||||
- Knowledge graph fallback for missing-signature drugs
|
||||
- A second disease to test generalization (the real test — sickle cell results do not prove
|
||||
the platform generalizes, §9.5)
|
||||
- **Cell-type deconvolution** of the disease signature to separate disease-state regulation from
|
||||
composition, recovering specificity.
|
||||
- **A non-globin-depleted, RNA-seq whole-blood study** to retain the HbF axis.
|
||||
- **Signature prediction** (DeepCE-style) or a mechanism/knowledge graph to score the ~88% of
|
||||
the signature that has no LINCS landmark — the single biggest lever on this result.
|
||||
- **A second disease** to test generalization (sickle results alone do not prove the platform —
|
||||
PLAN §9.5).
|
||||
|
||||
---
|
||||
|
||||
### Bottom line
|
||||
|
||||
The pipeline is reproducible end-to-end and the method is sound, but on this signature it **does
|
||||
not recover the known sickle cell drugs**. The failure is fully explained by signature/assay
|
||||
data limitations (erythroid biology lost; 12% landmark overlap), not by a flaw in the matching
|
||||
algorithm. The most valuable output of this MVP is therefore a precise, honest map of *what data
|
||||
quality the method needs to work* — which is exactly the de-risking the proof-of-concept was
|
||||
meant to deliver.
|
||||
|
||||
139
scripts/week1_explore.py
Normal file
139
scripts/week1_explore.py
Normal file
@@ -0,0 +1,139 @@
|
||||
"""Week 1 exploratory run: build per-study DE + concordance for the whole-blood pair.
|
||||
|
||||
Read-only decision aid (NOT the final pipeline): downloads GSE35007 + GSE16728, runs DE on
|
||||
each, collapses to HGNC symbols, computes concordance, and reports whether the sanity-gate
|
||||
genes survive — so we can choose Tier A (concordance pair) vs Tier B (GSE35007 alone) with
|
||||
real numbers. Intermediate DE tables are cached under data/raw/geo/.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import GEOparse
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
from src.disease import ( # noqa: E402
|
||||
DISEASE_LABEL,
|
||||
HEALTHY_LABEL,
|
||||
collapse_probes_to_symbols,
|
||||
compute_differential_expression,
|
||||
concordance_filter,
|
||||
)
|
||||
|
||||
GEO_DIR = Path("data/raw/geo")
|
||||
GEO_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
SANITY_GENES = ["HBG1", "HBG2", "ALAS2", "CA1", "AHSP", "SLC4A1", "HBB", "ERAF"]
|
||||
SYMBOL_COL_CANDIDATES = [
|
||||
"Symbol", "ILMN_Gene", "Gene Symbol", "GENE_SYMBOL", "Gene symbol",
|
||||
"gene_assignment", "GeneSymbol",
|
||||
]
|
||||
|
||||
|
||||
def log(msg: str) -> None:
|
||||
print(msg, flush=True)
|
||||
|
||||
|
||||
def get_symbol_map(gpl) -> pd.Series:
|
||||
tbl = gpl.table
|
||||
col = next((c for c in SYMBOL_COL_CANDIDATES if c in tbl.columns), None)
|
||||
if col is None:
|
||||
raise RuntimeError(f"No symbol column in GPL {gpl.name}; columns={list(tbl.columns)[:15]}")
|
||||
s = tbl.set_index("ID")[col].astype(str).str.strip()
|
||||
# gene_assignment fields are '// SYMBOL // ...'; take the first real token.
|
||||
if col == "gene_assignment":
|
||||
s = s.str.split("//").str[1].str.strip()
|
||||
s = s.replace({"": np.nan, "nan": np.nan, "---": np.nan})
|
||||
return s.dropna()
|
||||
|
||||
|
||||
def build_expression(gse, sample_ids: list[str]) -> pd.DataFrame:
|
||||
cols = {}
|
||||
for gsm_id in sample_ids:
|
||||
tbl = gse.gsms[gsm_id].table
|
||||
cols[gsm_id] = tbl.set_index("ID_REF")["VALUE"]
|
||||
return pd.DataFrame(cols)
|
||||
|
||||
|
||||
def char_value(gsm, key: str) -> str | None:
|
||||
for c in gsm.metadata.get("characteristics_ch1", []):
|
||||
if c.lower().startswith(key.lower()):
|
||||
return c.split(":", 1)[1].strip()
|
||||
return None
|
||||
|
||||
|
||||
def run_study(gse, groups: dict[str, str], label: str) -> pd.DataFrame:
|
||||
"""groups: gsm_id -> 'disease'/'healthy'. Returns symbol-level DE table."""
|
||||
sample_ids = list(groups.keys())
|
||||
log(f"[{label}] {len(sample_ids)} samples "
|
||||
f"({sum(v==DISEASE_LABEL for v in groups.values())} disease / "
|
||||
f"{sum(v==HEALTHY_LABEL for v in groups.values())} healthy)")
|
||||
expr = build_expression(gse, sample_ids).apply(pd.to_numeric, errors="coerce").dropna(how="all")
|
||||
sample_groups = pd.Series(groups)
|
||||
de = compute_differential_expression(expr, sample_groups, method="welch")
|
||||
gpl = list(gse.gpls.values())[0]
|
||||
sym = get_symbol_map(gpl)
|
||||
de_sym = collapse_probes_to_symbols(de, sym, expression_for_ranking=expr)
|
||||
log(f"[{label}] probes->symbols: {len(de)} probes -> {len(de_sym)} symbols; "
|
||||
f"sig q<0.05: {(de_sym['qvalue']<0.05).sum()}")
|
||||
de_sym.to_parquet(GEO_DIR / f"de_{label}.parquet")
|
||||
return de_sym
|
||||
|
||||
|
||||
def main() -> None:
|
||||
# --- GSE35007: whole blood, hb phenotype SS (disease) vs AA (healthy) ---
|
||||
log("downloading GSE35007 ...")
|
||||
gse1 = GEOparse.get_GEO(geo="GSE35007", destdir=str(GEO_DIR), silent=True)
|
||||
g1 = {}
|
||||
for gsm_id, gsm in gse1.gsms.items():
|
||||
hb = char_value(gsm, "hb phenotype")
|
||||
if hb == "SS":
|
||||
g1[gsm_id] = DISEASE_LABEL
|
||||
elif hb == "AA":
|
||||
g1[gsm_id] = HEALTHY_LABEL
|
||||
de1 = run_study(gse1, g1, "GSE35007")
|
||||
|
||||
# --- GSE16728: whole blood only (PAXgene preps; exclude PBMC), patient vs control ---
|
||||
log("downloading GSE16728 ...")
|
||||
gse2 = GEOparse.get_GEO(geo="GSE16728", destdir=str(GEO_DIR), silent=True)
|
||||
g2 = {}
|
||||
for gsm_id, gsm in gse2.gsms.items():
|
||||
prep = char_value(gsm, "rna prep") or ""
|
||||
subj = char_value(gsm, "subject") or ""
|
||||
if "PBMC" in prep:
|
||||
continue # whole-blood only
|
||||
if subj.lower().startswith("sickle"):
|
||||
g2[gsm_id] = DISEASE_LABEL
|
||||
elif subj.lower().startswith("control"):
|
||||
g2[gsm_id] = HEALTHY_LABEL
|
||||
de2 = run_study(gse2, g2, "GSE16728")
|
||||
|
||||
# --- Concordance ---
|
||||
keep, summary = concordance_filter(de1, de2)
|
||||
log("\n==== CONCORDANCE (whole-blood pair) ====")
|
||||
log(f"genes tested in both: {summary.n_genes_tested}")
|
||||
log(f"concordant (q<0.05 both + same sign): {summary.n_concordant} "
|
||||
f"(up={summary.n_up}, down={summary.n_down})")
|
||||
keep.to_parquet(GEO_DIR / "concordant_genes.parquet")
|
||||
|
||||
log("\n==== SANITY-GATE GENES ====")
|
||||
for g in SANITY_GENES:
|
||||
in1 = g in de1.index
|
||||
in2 = g in de2.index
|
||||
row1 = de1.loc[g] if in1 else None
|
||||
row2 = de2.loc[g] if in2 else None
|
||||
concord = "YES" if g in keep.index else "no"
|
||||
d1 = f"lfc={row1['log_fc']:+.2f},q={row1['qvalue']:.1e}" if in1 else "absent"
|
||||
d2 = f"lfc={row2['log_fc']:+.2f},q={row2['qvalue']:.1e}" if in2 else "absent"
|
||||
log(f" {g:7s} | GSE35007 {d1:28s} | GSE16728 {d2:28s} | concordant={concord}")
|
||||
|
||||
log("\n==== TOP 15 CONCORDANT (by worst-case q) ====")
|
||||
log(keep.head(15).to_string())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
121
scripts/week1_finalize.py
Normal file
121
scripts/week1_finalize.py
Normal file
@@ -0,0 +1,121 @@
|
||||
"""Finalize the Week 1 sickle cell signature (Tier A, whole-blood concordance pair).
|
||||
|
||||
Reads the cached concordant gene table, maps symbols -> Entrez/Ensembl via mygene, attaches
|
||||
two-study provenance + concordance summary, and persists
|
||||
data/processed/sickle_cell_signature_v1.json.
|
||||
|
||||
Tier note: GSE16728 is exactly 10/group, which fails assign_tier()'s strict n>10. Tier A is
|
||||
assigned explicitly here on the basis of cross-study, cross-platform, cross-population
|
||||
concordance (the founder's decision), with the n=10 / prep-merge caveat documented in the
|
||||
tier rationale and limitations.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import mygene
|
||||
import pandas as pd
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
from src.disease import ( # noqa: E402
|
||||
ConcordanceSummary,
|
||||
SignatureProvenance,
|
||||
StudyProvenance,
|
||||
build_signature,
|
||||
persist_signature,
|
||||
)
|
||||
from src.provenance import ConfidenceTier # noqa: E402
|
||||
|
||||
CREATED_DATE = "2026-06-23"
|
||||
GEO_DIR = Path("data/raw/geo")
|
||||
|
||||
|
||||
def map_ids(symbols: list[str]) -> pd.DataFrame:
|
||||
mg = mygene.MyGeneInfo()
|
||||
res = mg.querymany(
|
||||
symbols, scopes="symbol", fields="entrezgene,ensembl.gene",
|
||||
species="human", as_dataframe=True, verbose=False,
|
||||
)
|
||||
res = res[~res.index.duplicated(keep="first")]
|
||||
|
||||
def _ensembl(v):
|
||||
if isinstance(v, list) and v:
|
||||
v = v[0]
|
||||
if isinstance(v, dict):
|
||||
return v.get("gene")
|
||||
return v
|
||||
|
||||
id_map = pd.DataFrame(index=res.index)
|
||||
id_map["entrez_id"] = res["entrezgene"] if "entrezgene" in res.columns else None
|
||||
ens_col = "ensembl.gene" if "ensembl.gene" in res.columns else ("ensembl" if "ensembl" in res.columns else None)
|
||||
id_map["ensembl_id"] = res[ens_col].map(_ensembl) if ens_col else None
|
||||
return id_map
|
||||
|
||||
|
||||
def main() -> None:
|
||||
concordant = pd.read_parquet(GEO_DIR / "concordant_genes.parquet")
|
||||
print(f"loaded {len(concordant)} concordant genes "
|
||||
f"({(concordant['log_fc']>0).sum()} up / {(concordant['log_fc']<0).sum()} down)")
|
||||
|
||||
id_map = map_ids(list(concordant.index))
|
||||
print(f"mygene mapped {id_map['entrez_id'].notna().sum()}/{len(id_map)} symbols to Entrez")
|
||||
|
||||
provenance = SignatureProvenance(
|
||||
studies=[
|
||||
StudyProvenance(
|
||||
geo_accession="GSE35007", n_disease=190, n_healthy=12,
|
||||
platform="Illumina HumanHT-12 V4 (GPL10558)",
|
||||
tissue="whole blood", method="welch",
|
||||
),
|
||||
StudyProvenance(
|
||||
geo_accession="GSE16728", n_disease=10, n_healthy=10,
|
||||
platform="Affymetrix HG-U133 Plus 2.0 (GPL570)",
|
||||
tissue="whole blood (PAXgene; globin-reduced + total RNA preps merged)",
|
||||
method="welch",
|
||||
),
|
||||
],
|
||||
concordance=ConcordanceSummary(
|
||||
n_genes_tested=16208, n_concordant=671, n_up=444, n_down=227,
|
||||
),
|
||||
created_date=CREATED_DATE,
|
||||
)
|
||||
|
||||
tier_rationale = (
|
||||
"Measured RNA expression concordant across two independent peer-reviewed whole-blood "
|
||||
"studies on different platforms (Illumina GPL10558, Affymetrix GPL570) and populations "
|
||||
"(pediatric West-African GSE35007; adult GSE16728). Tier A rests on cross-study / "
|
||||
"cross-platform concordance. Caveat: GSE16728 is exactly 10/group (two PAXgene preps "
|
||||
"merged), which does not meet the strict n>10 per-study threshold on its own."
|
||||
)
|
||||
limitations = [
|
||||
"Whole-blood expression is partly driven by cell-composition differences "
|
||||
"(reticulocytosis in sickle patients), not pure disease-state regulation; the "
|
||||
"signature is dominated by erythroid markers (CA1, AHSP, SLC4A1). v2 needs cell-type "
|
||||
"deconvolution.",
|
||||
"Fetal-hemoglobin genes (HBG1/HBG2) are NOT captured: flat in GSE35007 and removed by "
|
||||
"GSE16728's globin-depleted RNA prep. The signature therefore does not directly encode "
|
||||
"the HbF axis that hydroxyurea acts on, which may weaken the hydroxyurea recovery test.",
|
||||
"GSE35007 is highly imbalanced (SS n=190 vs AA n=12); the healthy arm is small.",
|
||||
"GSE16728 whole-blood arm is small (10/group) and merges two RNA-prep batches.",
|
||||
"Cross-platform probe->symbol collapse (keep max-mean-expression probe) loses "
|
||||
"isoform-level resolution and drops genes absent from either platform.",
|
||||
]
|
||||
|
||||
signature = build_signature(
|
||||
concordant, provenance,
|
||||
tier=ConfidenceTier.A,
|
||||
tier_rationale=tier_rationale,
|
||||
limitations=limitations,
|
||||
id_map=id_map,
|
||||
top_n=250,
|
||||
)
|
||||
path = persist_signature(signature)
|
||||
print(f"wrote {path}")
|
||||
print(f"signature: {len(signature.up_regulated)} up / {len(signature.down_regulated)} down, "
|
||||
f"tier {signature.confidence_tier.value}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
91
scripts/week2_assemble.py
Normal file
91
scripts/week2_assemble.py
Normal file
@@ -0,0 +1,91 @@
|
||||
"""Week 2, task 4: assemble drug_profiles_v1.parquet (PLAN §6).
|
||||
|
||||
Joins the curated drug set + ChEMBL enrichment + LINCS consensus signatures into one profile
|
||||
table. Each drug carries a confidence tier: LINCS is a single source, so signature-backed drugs
|
||||
are Tier B at best (assign_tier with single_source=True); drugs with no signature are Tier C and
|
||||
marked not-scored (not dropped silently — PLAN §6 Week 3 task 2).
|
||||
|
||||
The 978-gene signature order is the column order of lincs_signatures_v1.parquet (landmark
|
||||
symbols); each profile's `lincs_signature` is that vector (or null).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import ast
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
from src.drugs import persist_drug_profiles # noqa: E402
|
||||
from src.provenance import ConfidenceTier, assign_tier # noqa: E402
|
||||
|
||||
PROCESSED = Path("data/processed")
|
||||
DRUG_SET = PROCESSED / "drug_set_v1.csv"
|
||||
CHEMBL = Path("data/raw/chembl/chembl_enrichment.parquet")
|
||||
LINCS_SIG = PROCESSED / "lincs_signatures_v1.parquet"
|
||||
|
||||
|
||||
def main() -> None:
|
||||
drugs = pd.read_csv(DRUG_SET)
|
||||
chembl = pd.read_parquet(CHEMBL)[
|
||||
["pert_iname", "chembl_id", "pref_name", "smiles", "mechanism_of_action", "targets"]
|
||||
]
|
||||
sigs = pd.read_parquet(LINCS_SIG) # rows=pert_iname, cols=978 landmark symbols
|
||||
gene_order = list(sigs.columns)
|
||||
|
||||
df = drugs.merge(chembl, on="pert_iname", how="left")
|
||||
|
||||
rows = []
|
||||
for r in df.itertuples():
|
||||
has_sig = r.pert_iname in sigs.index
|
||||
vector = sigs.loc[r.pert_iname].tolist() if has_sig else None
|
||||
# LINCS = single source => Tier B max when measured; no signature => Tier C.
|
||||
tier = assign_tier(
|
||||
is_measured=has_sig, n_per_group=None, peer_reviewed=True, single_source=True
|
||||
) if has_sig else ConfidenceTier.C
|
||||
targets = r.targets
|
||||
if isinstance(targets, str):
|
||||
try:
|
||||
targets = ast.literal_eval(targets)
|
||||
except (ValueError, SyntaxError):
|
||||
targets = []
|
||||
elif hasattr(targets, "tolist"): # numpy ndarray from parquet round-trip
|
||||
targets = targets.tolist()
|
||||
elif targets is None or (not isinstance(targets, (list, tuple))):
|
||||
targets = []
|
||||
rows.append({
|
||||
"name": r.pert_iname,
|
||||
"chembl_id": r.chembl_id if pd.notna(r.chembl_id) else None,
|
||||
"pref_name": r.pref_name if pd.notna(r.pref_name) else None,
|
||||
"inchikey": r.inchi_key if pd.notna(r.inchi_key) else None,
|
||||
"smiles": r.smiles if pd.notna(r.smiles) else None,
|
||||
"targets": list(targets),
|
||||
"mechanism_of_action": r.mechanism_of_action if pd.notna(r.mechanism_of_action) else None,
|
||||
"inclusion_reason": r.inclusion_reason,
|
||||
"lincs_phase": r.phase,
|
||||
"scored": has_sig,
|
||||
"lincs_signature": vector,
|
||||
"confidence_tier": tier.value,
|
||||
})
|
||||
|
||||
profiles = pd.DataFrame(rows)
|
||||
# Persist the gene order alongside, so Week-3 scoring can align the vectors.
|
||||
(PROCESSED / "lincs_gene_order.txt").write_text("\n".join(gene_order))
|
||||
path = persist_drug_profiles(profiles)
|
||||
|
||||
print(f"drug_profiles_v1: {len(profiles)} drugs")
|
||||
print(f" scored (have LINCS signature): {profiles['scored'].sum()}")
|
||||
print(f" not scored: {(~profiles['scored']).sum()}")
|
||||
print(" by inclusion reason (scored rate):")
|
||||
print(profiles.groupby("inclusion_reason")["scored"].agg(["sum", "count"]).to_string())
|
||||
print(" tier split:", profiles["confidence_tier"].value_counts().to_dict())
|
||||
for gt in ["hydroxyurea", "glutamine"]:
|
||||
row = profiles[profiles["name"] == gt]
|
||||
print(f" ground truth '{gt}': scored={bool(row['scored'].iloc[0]) if len(row) else 'ABSENT'}")
|
||||
print(f"wrote {path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
99
scripts/week2_chembl.py
Normal file
99
scripts/week2_chembl.py
Normal file
@@ -0,0 +1,99 @@
|
||||
"""Week 2, task 2: enrich the drug set with ChEMBL structure/target/mechanism data.
|
||||
|
||||
Drugs are matched to ChEMBL by the InChIKey already carried from LINCS pert_info (reliable),
|
||||
then mechanism-of-action and target names are pulled. Compounds absent from ChEMBL (many
|
||||
research/tool compounds in the random arm) keep null ChEMBL fields — they still have LINCS
|
||||
signatures for scoring; only the Week-3 mechanistic prior won't apply. Output cached to
|
||||
data/raw/chembl/chembl_enrichment.parquet.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
from chembl_webresource_client.new_client import new_client
|
||||
|
||||
DRUG_SET = Path("data/processed/drug_set_v1.csv")
|
||||
OUT = Path("data/raw/chembl/chembl_enrichment.parquet")
|
||||
BATCH = 40
|
||||
|
||||
|
||||
def chunks(seq, n):
|
||||
for i in range(0, len(seq), n):
|
||||
yield seq[i:i + n]
|
||||
|
||||
|
||||
def main() -> None:
|
||||
drugs = pd.read_csv(DRUG_SET)
|
||||
inchikeys = sorted({k for k in drugs["inchi_key"].dropna() if isinstance(k, str) and len(k) > 10})
|
||||
print(f"{len(drugs)} drugs; {len(inchikeys)} usable InChIKeys to resolve")
|
||||
|
||||
molecule = new_client.molecule
|
||||
mechanism = new_client.mechanism
|
||||
target = new_client.target
|
||||
|
||||
# 1) InChIKey -> ChEMBL molecule (id, name, smiles)
|
||||
mol_rows = []
|
||||
for i, batch in enumerate(chunks(inchikeys, BATCH)):
|
||||
res = molecule.filter(molecule_structures__standard_inchi_key__in=batch).only(
|
||||
["molecule_chembl_id", "pref_name", "molecule_structures"])
|
||||
for m in res:
|
||||
ms = m.get("molecule_structures") or {}
|
||||
mol_rows.append({
|
||||
"chembl_id": m["molecule_chembl_id"],
|
||||
"pref_name": m.get("pref_name"),
|
||||
"smiles": ms.get("canonical_smiles"),
|
||||
"inchi_key": ms.get("standard_inchi_key"),
|
||||
})
|
||||
print(f" molecules batch {i+1}: cumulative {len(mol_rows)} hits", flush=True)
|
||||
mols = pd.DataFrame(mol_rows).drop_duplicates("inchi_key")
|
||||
chembl_ids = sorted(mols["chembl_id"].unique())
|
||||
print(f"resolved {len(mols)} molecules -> {len(chembl_ids)} ChEMBL ids")
|
||||
|
||||
# 2) ChEMBL id -> mechanism of action + target ids
|
||||
mech_rows = []
|
||||
for batch in chunks(chembl_ids, BATCH):
|
||||
for m in mechanism.filter(molecule_chembl_id__in=batch).only(
|
||||
["molecule_chembl_id", "mechanism_of_action", "target_chembl_id"]):
|
||||
mech_rows.append(m)
|
||||
mech = pd.DataFrame(mech_rows)
|
||||
print(f"mechanism records: {len(mech)}")
|
||||
|
||||
# 3) target id -> name
|
||||
tgt_names = {}
|
||||
if not mech.empty:
|
||||
tids = sorted({t for t in mech["target_chembl_id"].dropna().unique()})
|
||||
for batch in chunks(tids, BATCH):
|
||||
for t in target.filter(target_chembl_id__in=batch).only(["target_chembl_id", "pref_name"]):
|
||||
tgt_names[t["target_chembl_id"]] = t.get("pref_name")
|
||||
|
||||
# aggregate mechanism/targets per molecule
|
||||
def agg(df):
|
||||
moa = sorted({x for x in df["mechanism_of_action"].dropna()})
|
||||
tns = sorted({tgt_names.get(t) for t in df["target_chembl_id"].dropna() if tgt_names.get(t)})
|
||||
return pd.Series({"mechanism_of_action": "; ".join(moa) or None, "targets": tns})
|
||||
|
||||
if not mech.empty:
|
||||
per_mol = mech.groupby("molecule_chembl_id").apply(agg, include_groups=False).reset_index()
|
||||
per_mol = per_mol.rename(columns={"molecule_chembl_id": "chembl_id"})
|
||||
mols = mols.merge(per_mol, on="chembl_id", how="left")
|
||||
else:
|
||||
mols["mechanism_of_action"] = None
|
||||
mols["targets"] = None
|
||||
|
||||
# join back to the drug set on inchi_key
|
||||
enriched = drugs.merge(mols, on="inchi_key", how="left", suffixes=("", "_chembl"))
|
||||
OUT.parent.mkdir(parents=True, exist_ok=True)
|
||||
enriched.to_parquet(OUT, index=False)
|
||||
|
||||
n_resolved = enriched["chembl_id"].notna().sum()
|
||||
n_moa = enriched["mechanism_of_action"].notna().sum()
|
||||
print(f"\nenriched {len(enriched)} drugs: {n_resolved} matched ChEMBL, {n_moa} have MoA")
|
||||
print(f"by reason, ChEMBL match rate:")
|
||||
print(enriched.assign(matched=enriched["chembl_id"].notna()).groupby("inclusion_reason")["matched"].mean().round(2).to_string())
|
||||
print(f"wrote {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
131
scripts/week2_curate_drugset.py
Normal file
131
scripts/week2_curate_drugset.py
Normal file
@@ -0,0 +1,131 @@
|
||||
"""Week 2, task 1: curate the deliberately-composed ~300-drug set (PLAN §6).
|
||||
|
||||
Composition: 2 ground-truth + ~50 related-mechanism + ~50 negative controls + ~200 random.
|
||||
The universe is restricted to compounds that actually have a LINCS Level-5 signature (in
|
||||
Phase I and/or Phase II), so every curated drug is scorable. Output: drug_set_v1.csv.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import gzip
|
||||
import io
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
from src import RANDOM_SEED # noqa: E402
|
||||
|
||||
LINCS = Path("data/raw/lincs")
|
||||
OUT = Path("data/processed/drug_set_v1.csv")
|
||||
|
||||
GROUND_TRUTH = ["hydroxyurea", "glutamine"] # glutamine == L-glutamine in LINCS
|
||||
|
||||
# Curated by mechanism (PLAN §6). Intersected with the LINCS catalog below, so misses are
|
||||
# silently dropped — we keep whatever actually has a signature.
|
||||
RELATED_MECHANISM = [
|
||||
# HbF inducers / epigenetic
|
||||
"decitabine", "azacitidine", "vorinostat", "panobinostat", "romidepsin", "entinostat",
|
||||
"mocetinostat", "belinostat", "pomalidomide", "lenalidomide", "thalidomide", "apicidin",
|
||||
"trichostatin-a", "scriptaid", "valproic-acid",
|
||||
# NO / vascular
|
||||
"sildenafil", "tadalafil", "nitroprusside",
|
||||
# antioxidants
|
||||
"n-acetyl-cysteine", "resveratrol", "curcumin", "quercetin", "sulforaphane",
|
||||
# anti-inflammatory studied in SCD
|
||||
"dexamethasone", "prednisolone", "hydrocortisone", "ibuprofen", "indomethacin",
|
||||
"sulfasalazine", "montelukast", "aspirin",
|
||||
# iron / heme / SCD-adjacent
|
||||
"hemin", "deferoxamine", "deferasirox", "simvastatin", "atorvastatin", "ticagrelor",
|
||||
]
|
||||
|
||||
NEGATIVE_CONTROL = [
|
||||
# antifungals
|
||||
"fluconazole", "ketoconazole", "itraconazole", "clotrimazole", "terbinafine", "miconazole",
|
||||
# antihistamines
|
||||
"loratadine", "cetirizine", "fexofenadine", "diphenhydramine", "chlorpheniramine",
|
||||
"astemizole",
|
||||
# antibiotics
|
||||
"amoxicillin", "ciprofloxacin", "doxycycline", "trimethoprim", "azithromycin", "tetracycline",
|
||||
"nitrofurantoin",
|
||||
# hormones / contraceptives
|
||||
"levonorgestrel", "ethinyl-estradiol", "norethindrone", "medroxyprogesterone-acetate",
|
||||
# misc unrelated
|
||||
"omeprazole", "ranitidine", "loperamide", "caffeine", "acetaminophen", "lidocaine",
|
||||
]
|
||||
|
||||
# Fill the random sample so the total set is ~300 (the denominator the pre-registered
|
||||
# recovery-test thresholds assume: "top 30 of 300"). Curated mechanism/control drugs are
|
||||
# capped by what LINCS actually contains, so the random arm absorbs the remainder.
|
||||
TARGET_TOTAL = 300
|
||||
|
||||
|
||||
def load_catalog() -> pd.DataFrame:
|
||||
"""Compounds with >=1 Level-5 signature, annotated with phase + inchi/smiles."""
|
||||
|
||||
def read_gz(fn, **kw):
|
||||
return pd.read_csv(io.BytesIO(gzip.decompress(Path(fn).read_bytes())), sep="\t", **kw)
|
||||
|
||||
sig1 = read_gz(LINCS / "GSE92742_sig_info.txt.gz", low_memory=False)
|
||||
sig2 = read_gz(LINCS / "GSE70138_sig_info.txt.gz", low_memory=False)
|
||||
cp1 = set(sig1[sig1["pert_type"] == "trt_cp"]["pert_iname"])
|
||||
cp2 = set(sig2[sig2["pert_type"] == "trt_cp"]["pert_iname"])
|
||||
|
||||
pert1 = read_gz(LINCS / "GSE92742_pert_info.txt.gz", low_memory=False)
|
||||
pert2 = read_gz(LINCS / "GSE70138_pert_info.txt.gz", low_memory=False)
|
||||
info = pd.concat([pert1, pert2], ignore_index=True)
|
||||
info = info[info["pert_type"] == "trt_cp"].drop_duplicates("pert_iname", keep="first")
|
||||
info = info.set_index("pert_iname")
|
||||
|
||||
names = cp1 | cp2
|
||||
rows = []
|
||||
for nm in names:
|
||||
phase = "both" if nm in cp1 and nm in cp2 else ("P1" if nm in cp1 else "P2")
|
||||
rec = info.loc[nm] if nm in info.index else None
|
||||
rows.append({
|
||||
"pert_iname": nm,
|
||||
"phase": phase,
|
||||
"pert_id": rec["pert_id"] if rec is not None else None,
|
||||
"inchi_key": rec["inchi_key"] if rec is not None else None,
|
||||
"canonical_smiles": rec["canonical_smiles"] if rec is not None else None,
|
||||
})
|
||||
return pd.DataFrame(rows).set_index("pert_iname")
|
||||
|
||||
|
||||
def pick(catalog: pd.DataFrame, names: list[str], reason: str) -> pd.DataFrame:
|
||||
present = [n for n in names if n in catalog.index]
|
||||
missing = [n for n in names if n not in catalog.index]
|
||||
if missing:
|
||||
print(f" [{reason}] {len(present)}/{len(names)} in LINCS; dropped: {missing}")
|
||||
out = catalog.loc[present].copy()
|
||||
out["inclusion_reason"] = reason
|
||||
return out
|
||||
|
||||
|
||||
def main() -> None:
|
||||
catalog = load_catalog()
|
||||
print(f"LINCS scorable compound universe: {len(catalog)}")
|
||||
|
||||
gt = pick(catalog, GROUND_TRUTH, "ground_truth")
|
||||
rel = pick(catalog, RELATED_MECHANISM, "related_mechanism")
|
||||
neg = pick(catalog, NEGATIVE_CONTROL, "negative_control")
|
||||
|
||||
chosen = pd.concat([gt, rel, neg])
|
||||
remaining = catalog.drop(index=chosen.index)
|
||||
n_random = TARGET_TOTAL - len(chosen)
|
||||
rand = remaining.sample(n=n_random, random_state=RANDOM_SEED).copy()
|
||||
rand["inclusion_reason"] = "general_sample"
|
||||
|
||||
drug_set = pd.concat([gt, rel, neg, rand]).reset_index()
|
||||
OUT.parent.mkdir(parents=True, exist_ok=True)
|
||||
drug_set.to_csv(OUT, index=False)
|
||||
|
||||
print(f"\ndrug_set_v1.csv: {len(drug_set)} drugs")
|
||||
print(drug_set["inclusion_reason"].value_counts().to_string())
|
||||
print(f"phase split:\n{drug_set['phase'].value_counts().to_string()}")
|
||||
print(f"wrote {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
101
scripts/week2_lincs_extract.py
Normal file
101
scripts/week2_lincs_extract.py
Normal file
@@ -0,0 +1,101 @@
|
||||
"""Week 2, task 3: extract per-drug LINCS L1000 consensus signatures.
|
||||
|
||||
For each drug in the set, slice its Level-5 MODZ signatures (978 landmark genes x its sig_ids)
|
||||
out of the big GCTX via cmapPy, then aggregate to ONE consensus 978-vector per drug by mean
|
||||
across its sig_ids (the "MODZ aggregation across cell lines/replicates" of PLAN §6). hydroxyurea
|
||||
lives in Phase II, L-glutamine in Phase I, so both phases are processed and merged.
|
||||
|
||||
Output: data/processed/lincs_signatures_v1.parquet (rows = pert_iname, cols = 978 landmark
|
||||
gene symbols, values = mean MODZ z-score).
|
||||
|
||||
Usage:
|
||||
python scripts/week2_lincs_extract.py --phase 2 # test on Phase II (already downloaded)
|
||||
python scripts/week2_lincs_extract.py # both phases (needs Phase I gunzipped)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import gzip
|
||||
import io
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
from cmapPy.pandasGEXpress.parse import parse
|
||||
|
||||
LINCS = Path("data/raw/lincs")
|
||||
DRUG_SET = Path("data/processed/drug_set_v1.csv")
|
||||
OUT = Path("data/processed/lincs_signatures_v1.parquet")
|
||||
|
||||
GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
|
||||
SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
|
||||
|
||||
|
||||
def read_gz_tsv(name: str) -> pd.DataFrame:
|
||||
return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
|
||||
|
||||
|
||||
def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
|
||||
lm = pd.read_csv(LINCS / "landmark_genes.csv")
|
||||
ids = [str(x) for x in lm["pr_gene_id"]]
|
||||
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in lm.itertuples()}
|
||||
return ids, id_to_symbol
|
||||
|
||||
|
||||
def extract_phase(phase: int, drug_names: set[str], landmark_ids: list[str]) -> pd.DataFrame:
|
||||
"""Return DataFrame: rows=pert_iname, cols=landmark gene_id (str), one mean vector per drug."""
|
||||
sig = read_gz_tsv(SIG_INFO[phase])
|
||||
sig = sig[(sig["pert_type"] == "trt_cp") & (sig["pert_iname"].isin(drug_names))]
|
||||
if sig.empty:
|
||||
return pd.DataFrame()
|
||||
sig_ids = sig["sig_id"].tolist()
|
||||
print(f"[phase {phase}] {sig['pert_iname'].nunique()} drugs, {len(sig_ids)} signatures to slice", flush=True)
|
||||
|
||||
gctoo = parse(str(GCTX[phase]), rid=landmark_ids, cid=sig_ids)
|
||||
data = gctoo.data_df # rows=gene_id, cols=sig_id
|
||||
sig_to_drug = dict(zip(sig["sig_id"], sig["pert_iname"]))
|
||||
# mean across each drug's sig_ids -> one consensus vector per drug
|
||||
per_drug = data.T.groupby(data.columns.map(sig_to_drug)).mean()
|
||||
print(f"[phase {phase}] aggregated to {len(per_drug)} drug consensus vectors", flush=True)
|
||||
return per_drug # rows=pert_iname, cols=gene_id
|
||||
|
||||
|
||||
def main() -> None:
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--phase", type=int, choices=[1, 2], default=None, help="single phase (test)")
|
||||
args = ap.parse_args()
|
||||
|
||||
drugs = pd.read_csv(DRUG_SET)
|
||||
drug_names = set(drugs["pert_iname"])
|
||||
landmark_ids, id_to_symbol = landmark_ids_and_symbols()
|
||||
|
||||
phases = [args.phase] if args.phase else [1, 2]
|
||||
frames = []
|
||||
for ph in phases:
|
||||
if not GCTX[ph].exists():
|
||||
print(f"[phase {ph}] {GCTX[ph]} missing — skipping")
|
||||
continue
|
||||
frames.append(extract_phase(ph, drug_names, landmark_ids))
|
||||
|
||||
frames = [f for f in frames if not f.empty]
|
||||
if not frames:
|
||||
print("no signatures extracted")
|
||||
return
|
||||
# A drug present in both phases: average the two phase consensus vectors.
|
||||
combined = pd.concat(frames).groupby(level=0).mean()
|
||||
combined.columns = [id_to_symbol.get(c, c) for c in combined.columns] # gene_id -> symbol
|
||||
|
||||
covered = sorted(set(combined.index))
|
||||
missing = sorted(drug_names - set(covered))
|
||||
print(f"\nsignatures extracted for {len(covered)}/{len(drug_names)} drugs")
|
||||
for gt in ["hydroxyurea", "glutamine"]:
|
||||
print(f" ground truth '{gt}': {'OK' if gt in covered else 'MISSING'}")
|
||||
if args.phase is None:
|
||||
combined.to_parquet(OUT)
|
||||
print(f"wrote {OUT} ({combined.shape[0]} drugs x {combined.shape[1]} landmark genes)")
|
||||
if missing:
|
||||
print(f"{len(missing)} drugs without signature (will be marked not-scored in Week 3)")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
82
scripts/week3_scoring.py
Normal file
82
scripts/week3_scoring.py
Normal file
@@ -0,0 +1,82 @@
|
||||
"""Week 3: run connectivity scoring over all drugs -> ranked_candidates_v1.csv (PLAN §6).
|
||||
|
||||
Loads the disease signature + the 300 drug LINCS signatures, computes the weighted-KS
|
||||
connectivity score per drug, and produces two rankings:
|
||||
1. raw connectivity (most negative = strongest reversal = rank 1)
|
||||
2. a secondary ranking blending connectivity with a mechanistic prior (sickle-relevant
|
||||
target pathways), to temper broad-effect drugs (HDAC/kinase) that dominate raw rankings.
|
||||
|
||||
The formal recovery test (ground-truth + negative-control evaluation against the pre-registered
|
||||
criteria) is Week 4; this script only prints a sanity peek.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
from src.scoring import mechanistic_prior, persist_ranking, rank_drugs # noqa: E402
|
||||
|
||||
PROCESSED = Path("data/processed")
|
||||
PRIOR_LAMBDA = 0.5 # weight of the mechanistic prior in the secondary ranking
|
||||
|
||||
|
||||
def main() -> None:
|
||||
sig = json.loads((PROCESSED / "sickle_cell_signature_v1.json").read_text())
|
||||
up = [g["gene"] for g in sig["up_regulated"]]
|
||||
down = [g["gene"] for g in sig["down_regulated"]]
|
||||
|
||||
sig_matrix = pd.read_parquet(PROCESSED / "lincs_signatures_v1.parquet") # drug x 978 symbols
|
||||
profiles = pd.read_parquet(PROCESSED / "drug_profiles_v1.parquet").set_index("name")
|
||||
|
||||
landmark = set(sig_matrix.columns)
|
||||
n_up_ov = len(set(up) & landmark)
|
||||
n_down_ov = len(set(down) & landmark)
|
||||
print(f"query overlap with 978 landmarks: {n_up_ov} up + {n_down_ov} down = {n_up_ov + n_down_ov}")
|
||||
print(f"scoring {len(sig_matrix)} drugs (all scored; 0 without signature)")
|
||||
|
||||
ranked = rank_drugs(up, down, sig_matrix)
|
||||
|
||||
# attach metadata + mechanistic prior
|
||||
ranked = ranked.join(profiles[["chembl_id", "inclusion_reason", "targets", "mechanism_of_action"]])
|
||||
ranked["mechanistic_prior"] = ranked["targets"].apply(
|
||||
lambda t: mechanistic_prior(list(t) if t is not None else [])
|
||||
)
|
||||
ranked["known_targets"] = ranked["targets"].apply(
|
||||
lambda t: "; ".join(t) if t is not None and len(t) else ""
|
||||
)
|
||||
ranked = ranked.rename(columns={"mechanism_of_action": "mechanism_summary"})
|
||||
|
||||
# secondary, prior-weighted ranking: relevant drugs pushed toward better (more negative)
|
||||
ranked["blended_score"] = ranked["normalized_score"] - PRIOR_LAMBDA * ranked["mechanistic_prior"]
|
||||
ranked["blended_rank"] = ranked["blended_score"].rank(method="first").astype(int)
|
||||
|
||||
out = ranked.rename_axis("drug_name").reset_index()[[
|
||||
"rank", "drug_name", "chembl_id", "connectivity_score", "normalized_score",
|
||||
"inclusion_reason", "mechanistic_prior", "blended_rank", "known_targets", "mechanism_summary",
|
||||
]]
|
||||
path = persist_ranking(out)
|
||||
print(f"wrote {path} ({len(out)} drugs)")
|
||||
|
||||
# --- sanity peek (formal recovery test is Week 4) ---
|
||||
print("\n--- sanity peek (raw connectivity rank) ---")
|
||||
for gt in ["hydroxyurea", "glutamine"]:
|
||||
r = ranked.loc[gt]
|
||||
pct = 100 * r["rank"] / len(ranked)
|
||||
print(f" {gt:12s} rank {int(r['rank'])}/{len(ranked)} (top {pct:.0f}%), "
|
||||
f"score={r['connectivity_score']:.3f}")
|
||||
neg = ranked[ranked["inclusion_reason"] == "negative_control"]
|
||||
print(f" negative controls in bottom half: "
|
||||
f"{(neg['rank'] > len(ranked) / 2).sum()}/{len(neg)}")
|
||||
print("\n top 5 raw candidates:")
|
||||
for name, r in ranked.nsmallest(5, "connectivity_score").iterrows():
|
||||
print(f" {int(r['rank']):3d} {name:18s} {r['connectivity_score']:+.3f} "
|
||||
f"[{r['inclusion_reason']}] {r['known_targets'][:50]}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
81
scripts/week4_recovery_test.py
Normal file
81
scripts/week4_recovery_test.py
Normal file
@@ -0,0 +1,81 @@
|
||||
"""Week 4: formal recovery test against the pre-registered criteria (PLAN §6).
|
||||
|
||||
Pre-registered criteria (committed in docs/recovery_test_report.md before this run):
|
||||
- hydroxyurea in top 10% (top 30 of 300), AND
|
||||
- L-glutamine in top 25% (top 75) OR documented unscorable due to missing LINCS signature, AND
|
||||
- >=4 of 5 pre-specified negative controls in the bottom half.
|
||||
|
||||
The 5 negative controls are pre-specified here by a category rule (one per category, alphabetically
|
||||
first available) so the choice does not peek at ranks. Primary ranking = raw connectivity.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
|
||||
RANKED = Path("data/results/ranked_candidates_v1.csv")
|
||||
|
||||
# One per unrelated category, alphabetical-first — chosen without looking at ranks.
|
||||
NEG_CONTROL_CATEGORIES = {
|
||||
"antifungal": ["clotrimazole", "fluconazole", "itraconazole", "ketoconazole", "miconazole", "terbinafine"],
|
||||
"antihistamine": ["astemizole", "cetirizine", "diphenhydramine", "fexofenadine", "loratadine"],
|
||||
"antibiotic": ["azithromycin", "ciprofloxacin", "doxycycline", "tetracycline", "trimethoprim"],
|
||||
"hormone": ["ethinyl-estradiol", "levonorgestrel", "medroxyprogesterone-acetate", "norethindrone"],
|
||||
"misc": ["caffeine", "lidocaine", "loperamide", "omeprazole", "ranitidine"],
|
||||
}
|
||||
|
||||
|
||||
def main() -> None:
|
||||
df = pd.read_csv(RANKED).set_index("drug_name")
|
||||
n = len(df)
|
||||
top10_cut, top25_cut, half = int(n * 0.10), int(n * 0.25), n // 2
|
||||
|
||||
def rk(name):
|
||||
return int(df.loc[name, "rank"]) if name in df.index else None
|
||||
|
||||
hu, glut = rk("hydroxyurea"), rk("glutamine")
|
||||
|
||||
# pick negative controls present in the ranking
|
||||
negs = {}
|
||||
for cat, options in NEG_CONTROL_CATEGORIES.items():
|
||||
pick = next((d for d in options if d in df.index), None)
|
||||
if pick:
|
||||
negs[pick] = (cat, rk(pick))
|
||||
|
||||
print("=" * 60)
|
||||
print(f"N = {n}; top10 cut = {top10_cut}, top25 cut = {top25_cut}, bottom-half > {half}")
|
||||
print(f"\nhydroxyurea: rank {hu} (top {100*hu/n:.1f}%) -> top-10%? {hu <= top10_cut}")
|
||||
glut_score = df.loc["glutamine", "connectivity_score"]
|
||||
print(f"L-glutamine: rank {glut} (top {100*glut/n:.1f}%), WTCS={glut_score:.3f} "
|
||||
f"-> top-25%? {glut <= top25_cut} (has signature, so NOT 'missing-signature unscorable')")
|
||||
print("\nnegative controls (pre-specified, 1 per category):")
|
||||
n_bottom = 0
|
||||
for d, (cat, r) in negs.items():
|
||||
in_bottom = r > half
|
||||
n_bottom += in_bottom
|
||||
print(f" {d:18s} [{cat:13s}] rank {r:3d} bottom-half? {in_bottom}")
|
||||
print(f" -> {n_bottom}/5 in bottom half (need >=4)")
|
||||
|
||||
crit_hu = hu <= top10_cut
|
||||
crit_glut = glut <= top25_cut
|
||||
crit_neg = n_bottom >= 4
|
||||
overall = crit_hu and crit_glut and crit_neg
|
||||
print(f"\nCRITERIA: hydroxyurea={crit_hu}, L-glutamine={crit_glut}, neg-controls={crit_neg}")
|
||||
print(f"OVERALL (raw ranking): {'PASS' if overall else 'FAIL'}")
|
||||
|
||||
# secondary prior-weighted view (reported, not the primary criterion)
|
||||
hu_b = int(df.loc["hydroxyurea", "blended_rank"])
|
||||
print(f"\nsecondary (mechanistic-prior) ranking: hydroxyurea blended_rank {hu_b} "
|
||||
f"(top {100*hu_b/n:.1f}%)")
|
||||
|
||||
print("\n--- TOP 10 (raw connectivity) ---")
|
||||
top10 = df.nsmallest(10, "connectivity_score")
|
||||
for name, r in top10.iterrows():
|
||||
print(f" {int(r['rank']):2d} {name:18s} {r['connectivity_score']:+.3f} "
|
||||
f"[{r['inclusion_reason']}] {str(r['known_targets'])[:45]}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
247
src/disease.py
247
src/disease.py
@@ -1,19 +1,27 @@
|
||||
"""Disease signature construction.
|
||||
|
||||
Week 1 (PLAN.md §6). Builds a Tier-A sickle cell signature from GEO expression data via
|
||||
differential expression, then persists it with full provenance to
|
||||
``data/processed/sickle_cell_signature_v1.json``.
|
||||
differential expression on TWO studies, keeping only genes that are concordant across both
|
||||
(the multi-source evidence that earns Tier A — see project memory). Persists with full
|
||||
provenance to ``data/processed/sickle_cell_signature_v1.json``.
|
||||
|
||||
This module defines the persisted schema (pydantic) and the construction stubs. The actual
|
||||
data pull + differential expression is driven from ``notebooks/02_disease_signature.ipynb``.
|
||||
Pipeline (driven from ``notebooks/02_disease_signature.ipynb``):
|
||||
per study: compute_differential_expression -> collapse_probes_to_symbols
|
||||
across: concordance_filter -> build_signature -> persist_signature
|
||||
|
||||
Conventions:
|
||||
- log_fc > 0 => up-regulated in DISEASE vs HEALTHY (Week 1 refinement R1)
|
||||
- cross-study join key is the HGNC gene symbol (R2)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from pydantic import BaseModel, Field
|
||||
from scipy.stats import false_discovery_control, ttest_ind
|
||||
|
||||
from . import PIPELINE_VERSION, PROCESSED_DIR
|
||||
from .provenance import ConfidenceTier
|
||||
@@ -22,6 +30,10 @@ from .provenance import ConfidenceTier
|
||||
TOP_N_PER_DIRECTION = 250
|
||||
QVALUE_CUTOFF = 0.05
|
||||
|
||||
# Group labels used throughout. The disease group is the "treatment" arm in DE contrasts.
|
||||
DISEASE_LABEL = "disease"
|
||||
HEALTHY_LABEL = "healthy"
|
||||
|
||||
|
||||
class GeneEntry(BaseModel):
|
||||
"""A single differentially expressed gene in the signature."""
|
||||
@@ -33,15 +45,37 @@ class GeneEntry(BaseModel):
|
||||
qvalue: float
|
||||
|
||||
|
||||
class SignatureProvenance(BaseModel):
|
||||
"""Provenance block for a disease signature (PLAN.md §6 schema)."""
|
||||
class StudyProvenance(BaseModel):
|
||||
"""Provenance for one contributing GEO study (Week 1 refinement R6)."""
|
||||
|
||||
geo_accession: str
|
||||
n_disease: int
|
||||
n_healthy: int
|
||||
platform: str
|
||||
method: str = Field(..., description="Differential expression method, e.g. 'limma', 'deseq2'.")
|
||||
tissue: str
|
||||
method: str = Field(..., description="DE method: 'welch' (microarray) or 'deseq2' (RNA-seq).")
|
||||
|
||||
|
||||
class ConcordanceSummary(BaseModel):
|
||||
"""How the two studies were reconciled into the signature (R6)."""
|
||||
|
||||
rule: str = Field(
|
||||
default="q<0.05 in both studies AND same sign of log_fc in both",
|
||||
description="The concordance rule applied across studies.",
|
||||
)
|
||||
n_genes_tested: int = Field(..., description="Genes present in both studies after symbol collapse.")
|
||||
n_concordant: int
|
||||
n_up: int
|
||||
n_down: int
|
||||
|
||||
|
||||
class SignatureProvenance(BaseModel):
|
||||
"""Provenance block for a disease signature (revised for 2-study concordance, R6)."""
|
||||
|
||||
studies: list[StudyProvenance]
|
||||
concordance: ConcordanceSummary
|
||||
created_date: str
|
||||
pipeline_version: str = PIPELINE_VERSION
|
||||
|
||||
|
||||
class DiseaseSignature(BaseModel):
|
||||
@@ -58,44 +92,213 @@ class DiseaseSignature(BaseModel):
|
||||
limitations: list[str]
|
||||
|
||||
|
||||
def _looks_like_linear_intensity(expression: pd.DataFrame) -> bool:
|
||||
"""Heuristic: microarray data still on a linear scale has a large dynamic range.
|
||||
|
||||
log2-transformed intensities are typically <~20; raw intensities run into the thousands.
|
||||
"""
|
||||
return float(np.nanmax(expression.to_numpy())) > 50.0
|
||||
|
||||
|
||||
def _welch_de(expression: pd.DataFrame, groups: pd.Series) -> pd.DataFrame:
|
||||
"""Per-gene Welch t-test + Benjamini-Hochberg, the limma-equivalent for microarray.
|
||||
|
||||
Assumes ``expression`` is genes (rows) x samples (columns), on (or coercible to) a log2
|
||||
scale. ``log_fc`` is mean(disease) - mean(healthy) (R1 sign convention).
|
||||
"""
|
||||
expr = expression.copy()
|
||||
if _looks_like_linear_intensity(expr):
|
||||
# log2(x+1) guards against zeros/negatives while keeping the transform standard.
|
||||
expr = np.log2(expr.clip(lower=0) + 1.0)
|
||||
|
||||
disease_cols = groups.index[groups == DISEASE_LABEL]
|
||||
healthy_cols = groups.index[groups == HEALTHY_LABEL]
|
||||
disease = expr[disease_cols].to_numpy()
|
||||
healthy = expr[healthy_cols].to_numpy()
|
||||
|
||||
log_fc = disease.mean(axis=1) - healthy.mean(axis=1)
|
||||
# Welch's t-test (unequal variance) per gene across samples.
|
||||
_, pvalue = ttest_ind(disease, healthy, axis=1, equal_var=False, nan_policy="omit")
|
||||
|
||||
out = pd.DataFrame({"log_fc": log_fc, "pvalue": pvalue}, index=expr.index)
|
||||
out = out.dropna(subset=["pvalue"])
|
||||
out["qvalue"] = false_discovery_control(out["pvalue"].to_numpy(), method="bh")
|
||||
return out.sort_values("qvalue")
|
||||
|
||||
|
||||
def _deseq2_de(counts: pd.DataFrame, groups: pd.Series) -> pd.DataFrame:
|
||||
"""RNA-seq differential expression via pydeseq2.
|
||||
|
||||
``counts`` is genes (rows) x samples (columns) of raw integer counts. Returns the same
|
||||
schema as ``_welch_de`` (log_fc = log2FC of disease vs healthy, plus pvalue/qvalue).
|
||||
"""
|
||||
from pydeseq2.dds import DeseqDataSet
|
||||
from pydeseq2.ds import DeseqStats
|
||||
|
||||
# pydeseq2 wants samples (rows) x genes (cols), with an aligned metadata frame.
|
||||
counts_t = counts.T.round().astype(int)
|
||||
metadata = pd.DataFrame({"condition": groups.reindex(counts_t.index).to_numpy()}, index=counts_t.index)
|
||||
|
||||
dds = DeseqDataSet(counts=counts_t, metadata=metadata, design="~condition", quiet=True)
|
||||
dds.deseq2()
|
||||
stats = DeseqStats(dds, contrast=["condition", DISEASE_LABEL, HEALTHY_LABEL], quiet=True)
|
||||
stats.summary()
|
||||
res = stats.results_df.rename(columns={"log2FoldChange": "log_fc", "pvalue": "pvalue", "padj": "qvalue"})
|
||||
return res[["log_fc", "pvalue", "qvalue"]].dropna(subset=["qvalue"]).sort_values("qvalue")
|
||||
|
||||
|
||||
def compute_differential_expression(
|
||||
expression: pd.DataFrame,
|
||||
sample_groups: pd.Series,
|
||||
*,
|
||||
method: str,
|
||||
) -> pd.DataFrame:
|
||||
"""Compute gene-level log fold change and adjusted p-values.
|
||||
|
||||
For RNA-seq use ``pydeseq2``; for microarray log2-transform/normalize and use a
|
||||
limma-equivalent (PLAN.md §6, Week 1 task 4).
|
||||
"""Compute gene-level log fold change and adjusted p-values for one study.
|
||||
|
||||
Args:
|
||||
expression: Genes (rows) x samples (columns) expression matrix.
|
||||
sample_groups: Per-sample group label ('disease' / 'healthy'), indexed by sample.
|
||||
method: 'deseq2' (RNA-seq) or 'limma' (microarray).
|
||||
expression: Genes (rows) x samples (columns). Microarray intensities or RNA-seq counts.
|
||||
sample_groups: Per-sample 'disease'/'healthy' label, indexed by sample id (column).
|
||||
method: 'welch' (microarray, PLAN choice) or 'deseq2' (RNA-seq).
|
||||
|
||||
Returns:
|
||||
A table indexed by gene with at least ``log_fc`` and ``qvalue`` columns.
|
||||
Table indexed by gene/probe with ``log_fc``, ``pvalue``, ``qvalue`` (R1 sign convention).
|
||||
"""
|
||||
raise NotImplementedError("Differential expression: implement in Week 1 (notebook 02).")
|
||||
groups = sample_groups.reindex(expression.columns)
|
||||
if groups.isna().any():
|
||||
missing = list(groups.index[groups.isna()])
|
||||
raise ValueError(f"sample_groups missing labels for samples: {missing[:5]}...")
|
||||
|
||||
if method == "welch":
|
||||
return _welch_de(expression, groups)
|
||||
if method == "deseq2":
|
||||
return _deseq2_de(expression, groups)
|
||||
raise ValueError(f"Unknown method {method!r}; expected 'welch' or 'deseq2'.")
|
||||
|
||||
|
||||
def collapse_probes_to_symbols(
|
||||
de_table: pd.DataFrame,
|
||||
probe_to_symbol: pd.Series,
|
||||
expression_for_ranking: pd.DataFrame | None = None,
|
||||
) -> pd.DataFrame:
|
||||
"""Collapse a probe-level DE table to one row per HGNC symbol (R2).
|
||||
|
||||
When multiple probes map to the same symbol, keep the probe with the highest mean
|
||||
expression (the standard collapseRows heuristic) if ``expression_for_ranking`` is given;
|
||||
otherwise keep the probe with the smallest qvalue.
|
||||
|
||||
Args:
|
||||
de_table: DE table indexed by probe id (from ``compute_differential_expression``).
|
||||
probe_to_symbol: Probe id -> HGNC symbol mapping.
|
||||
expression_for_ranking: Optional genes x samples matrix to rank duplicate probes.
|
||||
|
||||
Returns:
|
||||
DE table indexed by HGNC symbol.
|
||||
"""
|
||||
df = de_table.copy()
|
||||
df["symbol"] = probe_to_symbol.reindex(df.index)
|
||||
df = df.dropna(subset=["symbol"])
|
||||
|
||||
if expression_for_ranking is not None:
|
||||
rank_key = expression_for_ranking.reindex(df.index).mean(axis=1)
|
||||
df = df.assign(_rank=rank_key).sort_values("_rank", ascending=False)
|
||||
else:
|
||||
df = df.sort_values("qvalue", ascending=True)
|
||||
|
||||
collapsed = df[~df["symbol"].duplicated(keep="first")].set_index("symbol")
|
||||
return collapsed.drop(columns=[c for c in ("_rank",) if c in collapsed.columns])
|
||||
|
||||
|
||||
def concordance_filter(
|
||||
de_a: pd.DataFrame,
|
||||
de_b: pd.DataFrame,
|
||||
*,
|
||||
qvalue_cutoff: float = QVALUE_CUTOFF,
|
||||
) -> tuple[pd.DataFrame, ConcordanceSummary]:
|
||||
"""Keep only genes concordant across two symbol-level DE tables (R3).
|
||||
|
||||
A gene qualifies iff q<``qvalue_cutoff`` in BOTH studies AND log_fc has the same sign in
|
||||
both. Reported ``log_fc`` = mean of the two; ``qvalue`` = max of the two (worst case).
|
||||
Ranked by ``max(q_a, q_b)`` ascending.
|
||||
|
||||
Returns:
|
||||
(concordant_table indexed by symbol with log_fc/qvalue, ConcordanceSummary).
|
||||
"""
|
||||
merged = de_a.join(de_b, lsuffix="_a", rsuffix="_b", how="inner")
|
||||
n_tested = len(merged)
|
||||
|
||||
sig_both = (merged["qvalue_a"] < qvalue_cutoff) & (merged["qvalue_b"] < qvalue_cutoff)
|
||||
same_sign = np.sign(merged["log_fc_a"]) == np.sign(merged["log_fc_b"])
|
||||
keep = merged[sig_both & same_sign].copy()
|
||||
|
||||
keep["log_fc"] = (keep["log_fc_a"] + keep["log_fc_b"]) / 2.0
|
||||
keep["qvalue"] = keep[["qvalue_a", "qvalue_b"]].max(axis=1)
|
||||
keep = keep[["log_fc", "qvalue"]].sort_values("qvalue")
|
||||
|
||||
summary = ConcordanceSummary(
|
||||
n_genes_tested=n_tested,
|
||||
n_concordant=len(keep),
|
||||
n_up=int((keep["log_fc"] > 0).sum()),
|
||||
n_down=int((keep["log_fc"] < 0).sum()),
|
||||
)
|
||||
return keep, summary
|
||||
|
||||
|
||||
def build_signature(
|
||||
de_table: pd.DataFrame,
|
||||
concordant_table: pd.DataFrame,
|
||||
provenance: SignatureProvenance,
|
||||
*,
|
||||
tier: ConfidenceTier,
|
||||
tier_rationale: str,
|
||||
limitations: list[str],
|
||||
id_map: pd.DataFrame | None = None,
|
||||
top_n: int = TOP_N_PER_DIRECTION,
|
||||
qvalue_cutoff: float = QVALUE_CUTOFF,
|
||||
) -> DiseaseSignature:
|
||||
"""Assemble a ``DiseaseSignature`` from a differential expression table.
|
||||
"""Assemble a ``DiseaseSignature`` from the concordant gene table.
|
||||
|
||||
Takes the top ``top_n`` up- and down-regulated genes (by qvalue, cut at
|
||||
``qvalue_cutoff``) per PLAN.md §6, Week 1 task 5.
|
||||
Takes the top ``top_n`` up- and down-regulated genes by qvalue per direction (R3). If
|
||||
fewer than ``top_n`` qualify in a direction, takes all available (the caller should have
|
||||
logged the count). ``id_map`` optionally provides 'entrez_id'/'ensembl_id' columns indexed
|
||||
by symbol (from ``mygene``).
|
||||
|
||||
Args:
|
||||
concordant_table: Output of ``concordance_filter`` (indexed by symbol).
|
||||
provenance: Fully populated ``SignatureProvenance`` (both studies + concordance).
|
||||
tier: Confidence tier (Tier A only if genuinely multi-source).
|
||||
tier_rationale: One-line justification of the tier.
|
||||
limitations: Honest limitations list (R8).
|
||||
id_map: Optional symbol -> {entrez_id, ensembl_id} table.
|
||||
top_n: Max genes per direction.
|
||||
|
||||
Returns:
|
||||
A populated ``DiseaseSignature``.
|
||||
"""
|
||||
raise NotImplementedError("Signature assembly: implement in Week 1 (notebook 02).")
|
||||
|
||||
def _entries(rows: pd.DataFrame) -> list[GeneEntry]:
|
||||
entries: list[GeneEntry] = []
|
||||
for symbol, row in rows.iterrows():
|
||||
ids = id_map.loc[symbol] if (id_map is not None and symbol in id_map.index) else None
|
||||
entries.append(
|
||||
GeneEntry(
|
||||
gene=str(symbol),
|
||||
entrez_id=(str(ids["entrez_id"]) if ids is not None and pd.notna(ids.get("entrez_id")) else None),
|
||||
ensembl_id=(str(ids["ensembl_id"]) if ids is not None and pd.notna(ids.get("ensembl_id")) else None),
|
||||
log_fc=float(row["log_fc"]),
|
||||
qvalue=float(row["qvalue"]),
|
||||
)
|
||||
)
|
||||
return entries
|
||||
|
||||
up = concordant_table[concordant_table["log_fc"] > 0].sort_values("qvalue").head(top_n)
|
||||
down = concordant_table[concordant_table["log_fc"] < 0].sort_values("qvalue").head(top_n)
|
||||
|
||||
return DiseaseSignature(
|
||||
up_regulated=_entries(up),
|
||||
down_regulated=_entries(down),
|
||||
provenance=provenance,
|
||||
confidence_tier=tier,
|
||||
tier_rationale=tier_rationale,
|
||||
limitations=limitations,
|
||||
)
|
||||
|
||||
|
||||
def persist_signature(signature: DiseaseSignature, out_path: Path | None = None) -> Path:
|
||||
|
||||
143
src/scoring.py
143
src/scoring.py
@@ -1,33 +1,65 @@
|
||||
"""CMap-style connectivity scoring — the matching engine.
|
||||
"""CMap-style connectivity scoring — the matching engine (Week 3, PLAN §6).
|
||||
|
||||
Week 3 (PLAN.md §6). Scores each drug's LINCS signature against the disease signature using
|
||||
weighted Kolmogorov-Smirnov enrichment (Lamb 2006 / Subramanian 2017). Strongly *negative*
|
||||
connectivity = strong reversal of the disease signature = candidate match.
|
||||
Scores each drug's LINCS consensus signature against the disease signature using the weighted
|
||||
Kolmogorov-Smirnov / GSEA enrichment statistic (Lamb 2006; Subramanian 2017). The query is the
|
||||
disease up/down gene sets; the reference is each drug's 978 landmark genes ranked by z-score.
|
||||
|
||||
Uses ``cmapPy`` as the reference implementation. ``tests/test_scoring.py`` verifies the
|
||||
implementation against a known reference.
|
||||
Sign convention (PLAN §6): strongly **negative** connectivity = strong **reversal** of the
|
||||
disease signature = candidate match. A drug that down-regulates the disease's up-genes and
|
||||
up-regulates its down-genes scores negative.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from pydantic import BaseModel
|
||||
|
||||
from . import RESULTS_DIR
|
||||
|
||||
# Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
|
||||
# Keys are pathway categories; values are substrings matched (case-insensitive) against a
|
||||
# drug's ChEMBL target names.
|
||||
SICKLE_PATHWAYS: dict[str, tuple[str, ...]] = {
|
||||
"hbf_epigenetic": ("histone deacetylase", "hdac", "methyltransferase", "dnmt",
|
||||
"ribonucleoside-diphosphate reductase", "ribonucleotide reductase"),
|
||||
"hemoglobin": ("hemoglobin", "globin"),
|
||||
"no_signaling": ("nitric oxide", "guanylate cyclase", "phosphodiesterase 5", "pde5"),
|
||||
"inflammation": ("cyclooxygenase", "prostaglandin", "nf-kappa", "interleukin",
|
||||
"leukotriene", "selectin", "tumor necrosis factor"),
|
||||
"oxidative_stress": ("glutathione", "superoxide", "nadph oxidase", "thioredoxin", "nrf2"),
|
||||
}
|
||||
|
||||
class ConnectivityResult(BaseModel):
|
||||
"""Connectivity score for a single drug against the disease signature."""
|
||||
|
||||
chembl_id: str
|
||||
drug_name: str
|
||||
connectivity_score: float | None # None when the drug has no LINCS signature.
|
||||
normalized_score: float | None = None
|
||||
p_value: float | None = None
|
||||
scored: bool # False => no signature available, not scored (do not skip silently).
|
||||
n_genes_overlap: int | None = None
|
||||
def _enrichment_score(drug_profile: pd.Series, gene_set: set[str], weight: float = 1.0) -> float:
|
||||
"""Weighted GSEA/KS enrichment score of ``gene_set`` in a drug's ranked profile.
|
||||
|
||||
The profile is ranked by z-score (most up-regulated first). Hits increment a running sum in
|
||||
proportion to ``|z|**weight``; misses decrement uniformly. ES is the max signed deviation
|
||||
from zero. ES>0 => set enriched among up-regulated genes; ES<0 => among down-regulated.
|
||||
"""
|
||||
s = drug_profile.sort_values(ascending=False)
|
||||
genes = s.index.to_numpy()
|
||||
vals = s.to_numpy(dtype=float)
|
||||
hit = np.fromiter((g in gene_set for g in genes), dtype=bool, count=len(genes))
|
||||
|
||||
n_hit = int(hit.sum())
|
||||
n = len(genes)
|
||||
if n_hit == 0 or n_hit == n:
|
||||
return 0.0
|
||||
|
||||
w = (np.abs(vals) ** weight) * hit
|
||||
sum_hit = w.sum()
|
||||
if sum_hit == 0:
|
||||
return 0.0
|
||||
|
||||
inc = w / sum_hit
|
||||
dec = (~hit) / (n - n_hit)
|
||||
running = np.cumsum(inc - dec)
|
||||
|
||||
hi, lo = running.max(), running.min()
|
||||
return float(hi if abs(hi) >= abs(lo) else lo)
|
||||
|
||||
|
||||
def connectivity_score(
|
||||
@@ -35,46 +67,71 @@ def connectivity_score(
|
||||
down_genes: list[str],
|
||||
drug_signature: pd.Series,
|
||||
) -> float:
|
||||
"""Weighted KS connectivity score for one drug vs the disease up/down gene sets.
|
||||
"""Weighted connectivity score (WTCS) for one drug vs the disease up/down sets.
|
||||
|
||||
Only the intersection of disease-signature genes and LINCS landmark genes is scored;
|
||||
callers must record the overlap count (PLAN.md §6, Week 3 task 2).
|
||||
|
||||
Args:
|
||||
up_genes: Disease up-regulated gene identifiers.
|
||||
down_genes: Disease down-regulated gene identifiers.
|
||||
drug_signature: Drug's expression vector indexed by gene identifier.
|
||||
|
||||
Returns:
|
||||
Connectivity score in roughly [-1, 1]; strongly negative = strong reversal.
|
||||
Only query genes present in the drug's profile index (the 978 landmarks) are used — callers
|
||||
should record the overlap count (PLAN §6 Week 3 task 2). Returns the WTCS: if the two
|
||||
enrichment scores share a sign the result is 0 (ambiguous), else ``(ES_up - ES_down)/2``.
|
||||
Negative => reversal => candidate.
|
||||
"""
|
||||
raise NotImplementedError("Connectivity scoring: implement in Week 3 (notebook 04).")
|
||||
profile_genes = set(drug_signature.index)
|
||||
up = set(up_genes) & profile_genes
|
||||
down = set(down_genes) & profile_genes
|
||||
|
||||
es_up = _enrichment_score(drug_signature, up)
|
||||
es_down = _enrichment_score(drug_signature, down)
|
||||
|
||||
if np.sign(es_up) == np.sign(es_down):
|
||||
return 0.0
|
||||
return (es_up - es_down) / 2.0
|
||||
|
||||
|
||||
def normalize_scores(scores: pd.Series) -> pd.Series:
|
||||
"""Signed normalization (NCS, Subramanian 2017): divide by the mean magnitude of same-sign
|
||||
scores, so positive and negative tails are separately scaled to a mean magnitude of 1."""
|
||||
out = scores.astype(float).copy()
|
||||
pos_mean = scores[scores > 0].mean()
|
||||
neg_mean = scores[scores < 0].abs().mean()
|
||||
if pos_mean and not np.isnan(pos_mean):
|
||||
out[scores > 0] = scores[scores > 0] / pos_mean
|
||||
if neg_mean and not np.isnan(neg_mean):
|
||||
out[scores < 0] = scores[scores < 0] / neg_mean
|
||||
return out
|
||||
|
||||
|
||||
def rank_drugs(
|
||||
signature_up: list[str],
|
||||
signature_down: list[str],
|
||||
drug_profiles: pd.DataFrame,
|
||||
up_genes: list[str],
|
||||
down_genes: list[str],
|
||||
signature_matrix: pd.DataFrame,
|
||||
) -> pd.DataFrame:
|
||||
"""Score and rank all drugs against the disease signature.
|
||||
"""Score and rank all drugs (rows of ``signature_matrix``: drug x landmark-gene z-scores).
|
||||
|
||||
Drugs without a LINCS signature are marked ``scored=False`` and excluded from the ranking
|
||||
rather than dropped silently (PLAN.md §6, Week 3 task 2).
|
||||
|
||||
Returns a ranked table with the columns described in PLAN.md §6 (rank, drug_name,
|
||||
chembl_id, connectivity_score, normalized_score, p_value, inclusion_reason,
|
||||
known_targets, mechanism_summary).
|
||||
Returns a table indexed by drug with ``rank`` (1 = strongest reversal = most negative),
|
||||
``connectivity_score`` and ``normalized_score``. Drugs are expected to all have signatures
|
||||
here; signature-less drugs are handled (marked not-scored) by the orchestration layer per
|
||||
PLAN §6 Week 3 task 2.
|
||||
"""
|
||||
raise NotImplementedError("Drug ranking: implement in Week 3 (notebook 04).")
|
||||
scores = pd.Series(
|
||||
{drug: connectivity_score(up_genes, down_genes, signature_matrix.loc[drug])
|
||||
for drug in signature_matrix.index},
|
||||
name="connectivity_score",
|
||||
)
|
||||
df = pd.DataFrame({"connectivity_score": scores, "normalized_score": normalize_scores(scores)})
|
||||
df = df.sort_values("connectivity_score") # most negative (reversal) first
|
||||
df.insert(0, "rank", range(1, len(df) + 1))
|
||||
return df
|
||||
|
||||
|
||||
def mechanistic_prior(targets: list[str]) -> float:
|
||||
"""Prior weight for a drug based on sickle-cell-relevant target pathways.
|
||||
"""Count of sickle-cell-relevant pathway categories a drug's targets hit (PLAN §6 task 3).
|
||||
|
||||
Pathways of interest: HbF regulation, hemoglobin, NO signaling, inflammation, oxidative
|
||||
stress (PLAN.md §6, Week 3 task 3). Used to build the secondary, prior-weighted ranking.
|
||||
Higher = more mechanistically plausible. Used to build the secondary, prior-weighted ranking
|
||||
alongside the raw connectivity ranking.
|
||||
"""
|
||||
raise NotImplementedError("Mechanistic prior: implement in Week 3 (notebook 04).")
|
||||
if not targets:
|
||||
return 0.0
|
||||
text = " ; ".join(str(t) for t in targets).lower()
|
||||
return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))
|
||||
|
||||
|
||||
def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:
|
||||
|
||||
122
tests/test_disease.py
Normal file
122
tests/test_disease.py
Normal file
@@ -0,0 +1,122 @@
|
||||
"""Tests for the Week 1 signature-construction logic on synthetic data.
|
||||
|
||||
These verify the DE / probe-collapse / concordance math without touching the network, so the
|
||||
pipeline is trustworthy before it is pointed at real GEO studies.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.disease import (
|
||||
DISEASE_LABEL,
|
||||
HEALTHY_LABEL,
|
||||
ConcordanceSummary,
|
||||
SignatureProvenance,
|
||||
StudyProvenance,
|
||||
build_signature,
|
||||
collapse_probes_to_symbols,
|
||||
compute_differential_expression,
|
||||
concordance_filter,
|
||||
)
|
||||
from src.provenance import ConfidenceTier
|
||||
|
||||
|
||||
def _synthetic_study(seed: int, n_per_group: int = 12) -> tuple[pd.DataFrame, pd.Series]:
|
||||
"""Genes x samples matrix where UP is higher and DOWN is lower in disease."""
|
||||
rng = np.random.default_rng(seed)
|
||||
genes = ["UP1", "UP2", "DOWN1", "DOWN2", "NOISE1", "NOISE2"]
|
||||
samples = [f"d{i}" for i in range(n_per_group)] + [f"h{i}" for i in range(n_per_group)]
|
||||
groups = pd.Series([DISEASE_LABEL] * n_per_group + [HEALTHY_LABEL] * n_per_group, index=samples)
|
||||
|
||||
base = rng.normal(8.0, 0.3, size=(len(genes), len(samples)))
|
||||
df = pd.DataFrame(base, index=genes, columns=samples)
|
||||
disease = groups == DISEASE_LABEL
|
||||
df.loc["UP1", disease] += 3.0
|
||||
df.loc["UP2", disease] += 2.0
|
||||
df.loc["DOWN1", disease] -= 3.0
|
||||
df.loc["DOWN2", disease] -= 2.0
|
||||
return df, groups
|
||||
|
||||
|
||||
def test_welch_de_recovers_direction_and_significance():
|
||||
expr, groups = _synthetic_study(seed=1)
|
||||
de = compute_differential_expression(expr, groups, method="welch")
|
||||
|
||||
assert de.loc["UP1", "log_fc"] > 0 and de.loc["UP1", "qvalue"] < 0.05
|
||||
assert de.loc["DOWN1", "log_fc"] < 0 and de.loc["DOWN1", "qvalue"] < 0.05
|
||||
# Pure-noise genes should not be significant.
|
||||
assert de.loc["NOISE1", "qvalue"] > 0.05
|
||||
|
||||
|
||||
def test_compute_de_rejects_unlabelled_samples():
|
||||
expr, groups = _synthetic_study(seed=2)
|
||||
with pytest.raises(ValueError):
|
||||
compute_differential_expression(expr, groups.iloc[:-3], method="welch")
|
||||
|
||||
|
||||
def test_collapse_probes_keeps_highest_mean_expression():
|
||||
de = pd.DataFrame(
|
||||
{"log_fc": [1.0, 2.0, -1.0], "pvalue": [0.1, 0.2, 0.3], "qvalue": [0.1, 0.2, 0.3]},
|
||||
index=["probeA1", "probeA2", "probeB1"],
|
||||
)
|
||||
probe_to_symbol = pd.Series({"probeA1": "GENEA", "probeA2": "GENEA", "probeB1": "GENEB"})
|
||||
expr = pd.DataFrame(
|
||||
{"s1": [1.0, 100.0, 5.0], "s2": [1.0, 100.0, 5.0]},
|
||||
index=["probeA1", "probeA2", "probeB1"],
|
||||
)
|
||||
collapsed = collapse_probes_to_symbols(de, probe_to_symbol, expression_for_ranking=expr)
|
||||
|
||||
assert set(collapsed.index) == {"GENEA", "GENEB"}
|
||||
# probeA2 has the higher mean expression, so its log_fc (2.0) should win.
|
||||
assert collapsed.loc["GENEA", "log_fc"] == 2.0
|
||||
|
||||
|
||||
def test_concordance_filter_keeps_only_agreeing_genes():
|
||||
de_a = pd.DataFrame(
|
||||
{"log_fc": [2.0, -2.0, 1.5, 0.1], "qvalue": [0.001, 0.001, 0.2, 0.001]},
|
||||
index=["UP1", "DOWN1", "WEAK", "DISAGREE"],
|
||||
)
|
||||
de_b = pd.DataFrame(
|
||||
{"log_fc": [1.8, -2.2, 1.4, -0.1], "qvalue": [0.002, 0.002, 0.2, 0.002]},
|
||||
index=["UP1", "DOWN1", "WEAK", "DISAGREE"],
|
||||
)
|
||||
keep, summary = concordance_filter(de_a, de_b)
|
||||
|
||||
assert set(keep.index) == {"UP1", "DOWN1"} # WEAK fails q-cut; DISAGREE flips sign
|
||||
assert keep.loc["UP1", "log_fc"] == pytest.approx(1.9) # mean of the two
|
||||
assert keep.loc["UP1", "qvalue"] == pytest.approx(0.002) # max of the two
|
||||
assert isinstance(summary, ConcordanceSummary)
|
||||
assert summary.n_genes_tested == 4 and summary.n_concordant == 2
|
||||
assert summary.n_up == 1 and summary.n_down == 1
|
||||
|
||||
|
||||
def test_build_signature_splits_directions_and_respects_top_n():
|
||||
concordant = pd.DataFrame(
|
||||
{
|
||||
"log_fc": [3.0, 2.0, 1.0, -1.0, -2.0],
|
||||
"qvalue": [0.001, 0.002, 0.003, 0.004, 0.005],
|
||||
},
|
||||
index=["UP1", "UP2", "UP3", "DOWN1", "DOWN2"],
|
||||
)
|
||||
prov = SignatureProvenance(
|
||||
studies=[
|
||||
StudyProvenance(geo_accession="GSE1", n_disease=12, n_healthy=12,
|
||||
platform="P", tissue="whole blood", method="welch"),
|
||||
StudyProvenance(geo_accession="GSE2", n_disease=15, n_healthy=11,
|
||||
platform="P", tissue="whole blood", method="welch"),
|
||||
],
|
||||
concordance=ConcordanceSummary(n_genes_tested=100, n_concordant=5, n_up=3, n_down=2),
|
||||
created_date="2026-06-23",
|
||||
)
|
||||
sig = build_signature(
|
||||
concordant, prov, tier=ConfidenceTier.A,
|
||||
tier_rationale="Two-study concordance", limitations=["cell-composition confound"],
|
||||
top_n=2,
|
||||
)
|
||||
|
||||
assert [g.gene for g in sig.up_regulated] == ["UP1", "UP2"] # top 2 up by qvalue
|
||||
assert [g.gene for g in sig.down_regulated] == ["DOWN1", "DOWN2"]
|
||||
assert sig.confidence_tier == ConfidenceTier.A
|
||||
@@ -1,14 +1,13 @@
|
||||
"""Tests for the matching engine and provenance logic.
|
||||
|
||||
The headline test (PLAN.md §6, Week 3 task 4) verifies connectivity scoring against a known
|
||||
reference within tolerance; it is marked xfail until the scorer is implemented in Week 3.
|
||||
|
||||
The tier-assignment tests run today — they pin the rules from PLAN.md §3 so the most
|
||||
Connectivity tests (PLAN.md §6, Week 3 task 4) pin the weighted-KS scorer against hand-built
|
||||
reference profiles. The tier-assignment tests pin the rules from PLAN.md §3 so the most
|
||||
commercially important design decision can't silently drift.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.provenance import ConfidenceTier, assign_tier
|
||||
@@ -52,14 +51,76 @@ class TestAssignTier:
|
||||
assert assign_tier(**kwargs) == ConfidenceTier.B
|
||||
|
||||
|
||||
@pytest.mark.xfail(reason="Connectivity scoring implemented in Week 3 (notebook 04).", strict=True)
|
||||
def test_connectivity_score_matches_reference():
|
||||
"""Verify connectivity scoring against a CMap/cmapPy reference within tolerance.
|
||||
class TestConnectivityScore:
|
||||
"""Reference checks for the weighted-KS connectivity score (PLAN §6 Week 3 task 4).
|
||||
|
||||
PLAN.md §6, Week 3 task 4. Replace this body with a known reference example
|
||||
(disease up/down sets + drug signature -> expected score) once the scorer exists.
|
||||
Query: up = {U1, U2}, down = {D1, D2}. We build drug profiles with a known relationship to
|
||||
the query and assert the sign/ordering the CMap convention requires.
|
||||
"""
|
||||
from src.scoring import connectivity_score
|
||||
|
||||
score = connectivity_score(up_genes=[], down_genes=[], drug_signature=None) # noqa
|
||||
assert score == pytest.approx(0.0, abs=1e-6)
|
||||
UP = ["U1", "U2"]
|
||||
DOWN = ["D1", "D2"]
|
||||
|
||||
@staticmethod
|
||||
def _profile(values: dict[str, float]) -> pd.Series:
|
||||
# 20 filler genes at ~0 so the query genes sit clearly at the extremes.
|
||||
base = {f"N{i}": 0.01 * ((i % 5) - 2) for i in range(20)}
|
||||
base.update(values)
|
||||
return pd.Series(base)
|
||||
|
||||
def test_perfect_reversal_is_strongly_negative(self):
|
||||
from src.scoring import connectivity_score
|
||||
# Drug pushes disease-up genes DOWN (very negative) and disease-down genes UP (very
|
||||
# positive) => reversal => negative connectivity.
|
||||
prof = self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7})
|
||||
assert connectivity_score(self.UP, self.DOWN, prof) < -0.4
|
||||
|
||||
def test_perfect_mimic_is_strongly_positive(self):
|
||||
from src.scoring import connectivity_score
|
||||
prof = self._profile({"U1": 8, "U2": 7, "D1": -8, "D2": -7})
|
||||
assert connectivity_score(self.UP, self.DOWN, prof) > 0.4
|
||||
|
||||
def test_reversal_beats_mimic_and_null(self):
|
||||
from src.scoring import connectivity_score
|
||||
rev = connectivity_score(self.UP, self.DOWN, self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7}))
|
||||
mimic = connectivity_score(self.UP, self.DOWN, self._profile({"U1": 8, "U2": 7, "D1": -8, "D2": -7}))
|
||||
null = connectivity_score(self.UP, self.DOWN, self._profile({"U1": 0.2, "U2": -0.1, "D1": 0.1, "D2": -0.2}))
|
||||
assert rev < null < mimic
|
||||
assert abs(null) < abs(rev)
|
||||
|
||||
def test_same_sign_enrichment_returns_zero(self):
|
||||
from src.scoring import connectivity_score
|
||||
# Both up- and down-sets at the top => same-sign ES => ambiguous => 0 (WTCS rule).
|
||||
prof = self._profile({"U1": 8, "U2": 7, "D1": 6, "D2": 5})
|
||||
assert connectivity_score(self.UP, self.DOWN, prof) == 0.0
|
||||
|
||||
def test_genes_absent_from_profile_are_ignored(self):
|
||||
from src.scoring import connectivity_score
|
||||
prof = self._profile({"U1": -8, "U2": -7, "D1": 8, "D2": 7})
|
||||
# Adding a query gene not in the profile must not change the score.
|
||||
s1 = connectivity_score(self.UP, self.DOWN, prof)
|
||||
s2 = connectivity_score(self.UP + ["NOT_IN_PROFILE"], self.DOWN, prof)
|
||||
assert s1 == pytest.approx(s2)
|
||||
|
||||
|
||||
class TestMechanisticPrior:
|
||||
def test_counts_distinct_sickle_pathways(self):
|
||||
from src.scoring import mechanistic_prior
|
||||
# ribonucleotide reductase (hydroxyurea) -> hbf_epigenetic category.
|
||||
assert mechanistic_prior(["Ribonucleoside-diphosphate reductase RR1"]) == 1.0
|
||||
# DNMT (epigenetic) + hemoglobin -> two categories.
|
||||
assert mechanistic_prior(["DNA (cytosine-5)-methyltransferase 1", "Hemoglobin subunit beta"]) == 2.0
|
||||
assert mechanistic_prior([]) == 0.0
|
||||
assert mechanistic_prior(["Some unrelated kinase"]) == 0.0
|
||||
|
||||
|
||||
def test_rank_drugs_orders_by_reversal():
|
||||
from src.scoring import rank_drugs
|
||||
genes = ["U1", "U2", "D1", "D2"] + [f"N{i}" for i in range(10)]
|
||||
base = {g: 0.0 for g in genes}
|
||||
reverser = {**base, "U1": -8, "U2": -7, "D1": 8, "D2": 7}
|
||||
mimic = {**base, "U1": 8, "U2": 7, "D1": -8, "D2": -7}
|
||||
matrix = pd.DataFrame([reverser, mimic], index=["reverser", "mimic"])
|
||||
ranked = rank_drugs(["U1", "U2"], ["D1", "D2"], matrix)
|
||||
assert ranked.loc["reverser", "rank"] == 1
|
||||
assert ranked.loc["reverser", "connectivity_score"] < ranked.loc["mimic", "connectivity_score"]
|
||||
|
||||
Reference in New Issue
Block a user