Week 3: CMap connectivity scoring engine + ranked candidates
Implement the matching engine (PLAN §6 Week 3): - src/scoring.py: weighted-KS/GSEA enrichment, weighted connectivity score (WTCS, Lamb 2006 / Subramanian 2017), signed NCS normalization, rank_drugs, and a sickle-pathway mechanistic prior - tests/test_scoring.py: real reference tests for the scorer (perfect reversal<null<mimic, same-sign->0, absent-gene invariance) + prior - week3_scoring.py: score 300 drugs -> ranked_candidates_v1.csv with a raw ranking and a secondary mechanistic-prior-weighted ranking Preliminary (formal recovery test is Week 4): hydroxyurea raw rank 40/300 (top 13%, just misses pre-registered top-10%), blended rank 7; L-glutamine WTCS=0 (ambiguous). Notably anti-inflammatory SCD drugs cluster in the raw top tier — the engine reverses the inflammation axis, not the erythroid axis, traceable to the 12% landmark-overlap caveat. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
143
src/scoring.py
143
src/scoring.py
@@ -1,33 +1,65 @@
|
||||
"""CMap-style connectivity scoring — the matching engine.
|
||||
"""CMap-style connectivity scoring — the matching engine (Week 3, PLAN §6).
|
||||
|
||||
Week 3 (PLAN.md §6). Scores each drug's LINCS signature against the disease signature using
|
||||
weighted Kolmogorov-Smirnov enrichment (Lamb 2006 / Subramanian 2017). Strongly *negative*
|
||||
connectivity = strong reversal of the disease signature = candidate match.
|
||||
Scores each drug's LINCS consensus signature against the disease signature using the weighted
|
||||
Kolmogorov-Smirnov / GSEA enrichment statistic (Lamb 2006; Subramanian 2017). The query is the
|
||||
disease up/down gene sets; the reference is each drug's 978 landmark genes ranked by z-score.
|
||||
|
||||
Uses ``cmapPy`` as the reference implementation. ``tests/test_scoring.py`` verifies the
|
||||
implementation against a known reference.
|
||||
Sign convention (PLAN §6): strongly **negative** connectivity = strong **reversal** of the
|
||||
disease signature = candidate match. A drug that down-regulates the disease's up-genes and
|
||||
up-regulates its down-genes scores negative.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from pydantic import BaseModel
|
||||
|
||||
from . import RESULTS_DIR
|
||||
|
||||
# Sickle-cell-relevant target pathways for the mechanistic prior (PLAN §6 Week 3 task 3).
|
||||
# Keys are pathway categories; values are substrings matched (case-insensitive) against a
|
||||
# drug's ChEMBL target names.
|
||||
SICKLE_PATHWAYS: dict[str, tuple[str, ...]] = {
|
||||
"hbf_epigenetic": ("histone deacetylase", "hdac", "methyltransferase", "dnmt",
|
||||
"ribonucleoside-diphosphate reductase", "ribonucleotide reductase"),
|
||||
"hemoglobin": ("hemoglobin", "globin"),
|
||||
"no_signaling": ("nitric oxide", "guanylate cyclase", "phosphodiesterase 5", "pde5"),
|
||||
"inflammation": ("cyclooxygenase", "prostaglandin", "nf-kappa", "interleukin",
|
||||
"leukotriene", "selectin", "tumor necrosis factor"),
|
||||
"oxidative_stress": ("glutathione", "superoxide", "nadph oxidase", "thioredoxin", "nrf2"),
|
||||
}
|
||||
|
||||
class ConnectivityResult(BaseModel):
|
||||
"""Connectivity score for a single drug against the disease signature."""
|
||||
|
||||
chembl_id: str
|
||||
drug_name: str
|
||||
connectivity_score: float | None # None when the drug has no LINCS signature.
|
||||
normalized_score: float | None = None
|
||||
p_value: float | None = None
|
||||
scored: bool # False => no signature available, not scored (do not skip silently).
|
||||
n_genes_overlap: int | None = None
|
||||
def _enrichment_score(drug_profile: pd.Series, gene_set: set[str], weight: float = 1.0) -> float:
|
||||
"""Weighted GSEA/KS enrichment score of ``gene_set`` in a drug's ranked profile.
|
||||
|
||||
The profile is ranked by z-score (most up-regulated first). Hits increment a running sum in
|
||||
proportion to ``|z|**weight``; misses decrement uniformly. ES is the max signed deviation
|
||||
from zero. ES>0 => set enriched among up-regulated genes; ES<0 => among down-regulated.
|
||||
"""
|
||||
s = drug_profile.sort_values(ascending=False)
|
||||
genes = s.index.to_numpy()
|
||||
vals = s.to_numpy(dtype=float)
|
||||
hit = np.fromiter((g in gene_set for g in genes), dtype=bool, count=len(genes))
|
||||
|
||||
n_hit = int(hit.sum())
|
||||
n = len(genes)
|
||||
if n_hit == 0 or n_hit == n:
|
||||
return 0.0
|
||||
|
||||
w = (np.abs(vals) ** weight) * hit
|
||||
sum_hit = w.sum()
|
||||
if sum_hit == 0:
|
||||
return 0.0
|
||||
|
||||
inc = w / sum_hit
|
||||
dec = (~hit) / (n - n_hit)
|
||||
running = np.cumsum(inc - dec)
|
||||
|
||||
hi, lo = running.max(), running.min()
|
||||
return float(hi if abs(hi) >= abs(lo) else lo)
|
||||
|
||||
|
||||
def connectivity_score(
|
||||
@@ -35,46 +67,71 @@ def connectivity_score(
|
||||
down_genes: list[str],
|
||||
drug_signature: pd.Series,
|
||||
) -> float:
|
||||
"""Weighted KS connectivity score for one drug vs the disease up/down gene sets.
|
||||
"""Weighted connectivity score (WTCS) for one drug vs the disease up/down sets.
|
||||
|
||||
Only the intersection of disease-signature genes and LINCS landmark genes is scored;
|
||||
callers must record the overlap count (PLAN.md §6, Week 3 task 2).
|
||||
|
||||
Args:
|
||||
up_genes: Disease up-regulated gene identifiers.
|
||||
down_genes: Disease down-regulated gene identifiers.
|
||||
drug_signature: Drug's expression vector indexed by gene identifier.
|
||||
|
||||
Returns:
|
||||
Connectivity score in roughly [-1, 1]; strongly negative = strong reversal.
|
||||
Only query genes present in the drug's profile index (the 978 landmarks) are used — callers
|
||||
should record the overlap count (PLAN §6 Week 3 task 2). Returns the WTCS: if the two
|
||||
enrichment scores share a sign the result is 0 (ambiguous), else ``(ES_up - ES_down)/2``.
|
||||
Negative => reversal => candidate.
|
||||
"""
|
||||
raise NotImplementedError("Connectivity scoring: implement in Week 3 (notebook 04).")
|
||||
profile_genes = set(drug_signature.index)
|
||||
up = set(up_genes) & profile_genes
|
||||
down = set(down_genes) & profile_genes
|
||||
|
||||
es_up = _enrichment_score(drug_signature, up)
|
||||
es_down = _enrichment_score(drug_signature, down)
|
||||
|
||||
if np.sign(es_up) == np.sign(es_down):
|
||||
return 0.0
|
||||
return (es_up - es_down) / 2.0
|
||||
|
||||
|
||||
def normalize_scores(scores: pd.Series) -> pd.Series:
|
||||
"""Signed normalization (NCS, Subramanian 2017): divide by the mean magnitude of same-sign
|
||||
scores, so positive and negative tails are separately scaled to a mean magnitude of 1."""
|
||||
out = scores.astype(float).copy()
|
||||
pos_mean = scores[scores > 0].mean()
|
||||
neg_mean = scores[scores < 0].abs().mean()
|
||||
if pos_mean and not np.isnan(pos_mean):
|
||||
out[scores > 0] = scores[scores > 0] / pos_mean
|
||||
if neg_mean and not np.isnan(neg_mean):
|
||||
out[scores < 0] = scores[scores < 0] / neg_mean
|
||||
return out
|
||||
|
||||
|
||||
def rank_drugs(
|
||||
signature_up: list[str],
|
||||
signature_down: list[str],
|
||||
drug_profiles: pd.DataFrame,
|
||||
up_genes: list[str],
|
||||
down_genes: list[str],
|
||||
signature_matrix: pd.DataFrame,
|
||||
) -> pd.DataFrame:
|
||||
"""Score and rank all drugs against the disease signature.
|
||||
"""Score and rank all drugs (rows of ``signature_matrix``: drug x landmark-gene z-scores).
|
||||
|
||||
Drugs without a LINCS signature are marked ``scored=False`` and excluded from the ranking
|
||||
rather than dropped silently (PLAN.md §6, Week 3 task 2).
|
||||
|
||||
Returns a ranked table with the columns described in PLAN.md §6 (rank, drug_name,
|
||||
chembl_id, connectivity_score, normalized_score, p_value, inclusion_reason,
|
||||
known_targets, mechanism_summary).
|
||||
Returns a table indexed by drug with ``rank`` (1 = strongest reversal = most negative),
|
||||
``connectivity_score`` and ``normalized_score``. Drugs are expected to all have signatures
|
||||
here; signature-less drugs are handled (marked not-scored) by the orchestration layer per
|
||||
PLAN §6 Week 3 task 2.
|
||||
"""
|
||||
raise NotImplementedError("Drug ranking: implement in Week 3 (notebook 04).")
|
||||
scores = pd.Series(
|
||||
{drug: connectivity_score(up_genes, down_genes, signature_matrix.loc[drug])
|
||||
for drug in signature_matrix.index},
|
||||
name="connectivity_score",
|
||||
)
|
||||
df = pd.DataFrame({"connectivity_score": scores, "normalized_score": normalize_scores(scores)})
|
||||
df = df.sort_values("connectivity_score") # most negative (reversal) first
|
||||
df.insert(0, "rank", range(1, len(df) + 1))
|
||||
return df
|
||||
|
||||
|
||||
def mechanistic_prior(targets: list[str]) -> float:
|
||||
"""Prior weight for a drug based on sickle-cell-relevant target pathways.
|
||||
"""Count of sickle-cell-relevant pathway categories a drug's targets hit (PLAN §6 task 3).
|
||||
|
||||
Pathways of interest: HbF regulation, hemoglobin, NO signaling, inflammation, oxidative
|
||||
stress (PLAN.md §6, Week 3 task 3). Used to build the secondary, prior-weighted ranking.
|
||||
Higher = more mechanistically plausible. Used to build the secondary, prior-weighted ranking
|
||||
alongside the raw connectivity ranking.
|
||||
"""
|
||||
raise NotImplementedError("Mechanistic prior: implement in Week 3 (notebook 04).")
|
||||
if not targets:
|
||||
return 0.0
|
||||
text = " ; ".join(str(t) for t in targets).lower()
|
||||
return float(sum(any(kw in text for kw in kws) for kws in SICKLE_PATHWAYS.values()))
|
||||
|
||||
|
||||
def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:
|
||||
|
||||
Reference in New Issue
Block a user