Week 2: 300-drug profiles with LINCS signatures + ChEMBL
Build the drug profile dataset (PLAN §6 Week 2): - week2_curate_drugset.py: 300-drug set (2 ground-truth + 32 related- mechanism + 26 negative-control + 240 random), restricted to LINCS-scorable compounds, seed=42 - week2_chembl.py: InChIKey->ChEMBL match (145/300), MoA + targets - week2_lincs_extract.py: cmapPy-slice both Level-5 GCTX phases to 978 landmark genes, mean-aggregate per drug to one consensus signature - week2_assemble.py: join into drug_profiles_v1.parquet, Tier B (LINCS single-source), scored flag per PLAN §6 Week 3 task 2 - docs/data_sources.md: drug set composition + LINCS/ChEMBL provenance Results (all gitignored data): 300/300 drugs scored, both ground-truth drugs present (hydroxyurea Phase II = CHEMBL467, L-glutamine Phase I). Key caveat recorded: only 56/477 (12%) of the disease signature genes are LINCS landmarks, so Week-3 scoring uses a 30-up/26-down query. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -12,8 +12,9 @@
|
||||
| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
|
||||
| GEO (GSE35007) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 | GEOparse, FTP | Public domain | Disease signature (study 1) | GPL10558 | 2026-06-23 |
|
||||
| GEO (GSE16728) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 | GEOparse, FTP | Public domain | Disease signature (study 2) | GPL570 | 2026-06-23 |
|
||||
| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets | TBD | TBD |
|
||||
| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures | TBD | TBD |
|
||||
| ChEMBL | https://www.ebi.ac.uk/chembl | chembl_webresource_client | CC BY-SA 3.0 | Drug structures, MoA, targets | API (live) | 2026-06-23 |
|
||||
| LINCS L1000 Phase I | GSE92742 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. L-glutamine) | GSE92742 | 2026-06-23 |
|
||||
| LINCS L1000 Phase II | GSE70138 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. hydroxyurea) | GSE70138 | 2026-06-23 |
|
||||
| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
|
||||
| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
|
||||
| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |
|
||||
@@ -42,6 +43,30 @@ robustness.
|
||||
Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
|
||||
`scripts/week1_finalize.py` (mygene mapping + persist).
|
||||
|
||||
## Drug profiles (Week 2)
|
||||
|
||||
300-drug set (`drug_set_v1.csv`), composed and restricted to LINCS-scorable compounds:
|
||||
|
||||
| Inclusion reason | n | Notes |
|
||||
|---|---|---|
|
||||
| ground_truth | 2 | hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I) |
|
||||
| related_mechanism | 32 | HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories |
|
||||
| negative_control | 26 | antifungals, antihistamines, antibiotics, hormones |
|
||||
| general_sample | 240 | random from LINCS catalog, seed=42 |
|
||||
|
||||
- **LINCS signatures:** per-drug consensus = mean of Level-5 MODZ z-scores across the drug's
|
||||
sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH
|
||||
phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
|
||||
- **ChEMBL:** matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds
|
||||
38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
|
||||
- **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
|
||||
not-single-source rule).
|
||||
- **Signature↔landmark overlap:** only 56/477 (12%) of the disease signature genes are LINCS
|
||||
landmarks, so connectivity scoring (Week 3) uses a 30-up/26-down query. The erythroid hallmark
|
||||
genes (CA1, AHSP, SLC4A1, HBG) are NOT landmarks. This is a key limitation for the recovery test.
|
||||
- Reproduce: `week2_curate_drugset.py` → `week2_chembl.py` → download Level-5 GCTX →
|
||||
`week2_lincs_extract.py` → `week2_assemble.py`.
|
||||
|
||||
## Licensing note for LINCS
|
||||
|
||||
Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept)
|
||||
|
||||
91
scripts/week2_assemble.py
Normal file
91
scripts/week2_assemble.py
Normal file
@@ -0,0 +1,91 @@
|
||||
"""Week 2, task 4: assemble drug_profiles_v1.parquet (PLAN §6).
|
||||
|
||||
Joins the curated drug set + ChEMBL enrichment + LINCS consensus signatures into one profile
|
||||
table. Each drug carries a confidence tier: LINCS is a single source, so signature-backed drugs
|
||||
are Tier B at best (assign_tier with single_source=True); drugs with no signature are Tier C and
|
||||
marked not-scored (not dropped silently — PLAN §6 Week 3 task 2).
|
||||
|
||||
The 978-gene signature order is the column order of lincs_signatures_v1.parquet (landmark
|
||||
symbols); each profile's `lincs_signature` is that vector (or null).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import ast
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
from src.drugs import persist_drug_profiles # noqa: E402
|
||||
from src.provenance import ConfidenceTier, assign_tier # noqa: E402
|
||||
|
||||
PROCESSED = Path("data/processed")
|
||||
DRUG_SET = PROCESSED / "drug_set_v1.csv"
|
||||
CHEMBL = Path("data/raw/chembl/chembl_enrichment.parquet")
|
||||
LINCS_SIG = PROCESSED / "lincs_signatures_v1.parquet"
|
||||
|
||||
|
||||
def main() -> None:
|
||||
drugs = pd.read_csv(DRUG_SET)
|
||||
chembl = pd.read_parquet(CHEMBL)[
|
||||
["pert_iname", "chembl_id", "pref_name", "smiles", "mechanism_of_action", "targets"]
|
||||
]
|
||||
sigs = pd.read_parquet(LINCS_SIG) # rows=pert_iname, cols=978 landmark symbols
|
||||
gene_order = list(sigs.columns)
|
||||
|
||||
df = drugs.merge(chembl, on="pert_iname", how="left")
|
||||
|
||||
rows = []
|
||||
for r in df.itertuples():
|
||||
has_sig = r.pert_iname in sigs.index
|
||||
vector = sigs.loc[r.pert_iname].tolist() if has_sig else None
|
||||
# LINCS = single source => Tier B max when measured; no signature => Tier C.
|
||||
tier = assign_tier(
|
||||
is_measured=has_sig, n_per_group=None, peer_reviewed=True, single_source=True
|
||||
) if has_sig else ConfidenceTier.C
|
||||
targets = r.targets
|
||||
if isinstance(targets, str):
|
||||
try:
|
||||
targets = ast.literal_eval(targets)
|
||||
except (ValueError, SyntaxError):
|
||||
targets = []
|
||||
elif hasattr(targets, "tolist"): # numpy ndarray from parquet round-trip
|
||||
targets = targets.tolist()
|
||||
elif targets is None or (not isinstance(targets, (list, tuple))):
|
||||
targets = []
|
||||
rows.append({
|
||||
"name": r.pert_iname,
|
||||
"chembl_id": r.chembl_id if pd.notna(r.chembl_id) else None,
|
||||
"pref_name": r.pref_name if pd.notna(r.pref_name) else None,
|
||||
"inchikey": r.inchi_key if pd.notna(r.inchi_key) else None,
|
||||
"smiles": r.smiles if pd.notna(r.smiles) else None,
|
||||
"targets": list(targets),
|
||||
"mechanism_of_action": r.mechanism_of_action if pd.notna(r.mechanism_of_action) else None,
|
||||
"inclusion_reason": r.inclusion_reason,
|
||||
"lincs_phase": r.phase,
|
||||
"scored": has_sig,
|
||||
"lincs_signature": vector,
|
||||
"confidence_tier": tier.value,
|
||||
})
|
||||
|
||||
profiles = pd.DataFrame(rows)
|
||||
# Persist the gene order alongside, so Week-3 scoring can align the vectors.
|
||||
(PROCESSED / "lincs_gene_order.txt").write_text("\n".join(gene_order))
|
||||
path = persist_drug_profiles(profiles)
|
||||
|
||||
print(f"drug_profiles_v1: {len(profiles)} drugs")
|
||||
print(f" scored (have LINCS signature): {profiles['scored'].sum()}")
|
||||
print(f" not scored: {(~profiles['scored']).sum()}")
|
||||
print(" by inclusion reason (scored rate):")
|
||||
print(profiles.groupby("inclusion_reason")["scored"].agg(["sum", "count"]).to_string())
|
||||
print(" tier split:", profiles["confidence_tier"].value_counts().to_dict())
|
||||
for gt in ["hydroxyurea", "glutamine"]:
|
||||
row = profiles[profiles["name"] == gt]
|
||||
print(f" ground truth '{gt}': scored={bool(row['scored'].iloc[0]) if len(row) else 'ABSENT'}")
|
||||
print(f"wrote {path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
99
scripts/week2_chembl.py
Normal file
99
scripts/week2_chembl.py
Normal file
@@ -0,0 +1,99 @@
|
||||
"""Week 2, task 2: enrich the drug set with ChEMBL structure/target/mechanism data.
|
||||
|
||||
Drugs are matched to ChEMBL by the InChIKey already carried from LINCS pert_info (reliable),
|
||||
then mechanism-of-action and target names are pulled. Compounds absent from ChEMBL (many
|
||||
research/tool compounds in the random arm) keep null ChEMBL fields — they still have LINCS
|
||||
signatures for scoring; only the Week-3 mechanistic prior won't apply. Output cached to
|
||||
data/raw/chembl/chembl_enrichment.parquet.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
from chembl_webresource_client.new_client import new_client
|
||||
|
||||
DRUG_SET = Path("data/processed/drug_set_v1.csv")
|
||||
OUT = Path("data/raw/chembl/chembl_enrichment.parquet")
|
||||
BATCH = 40
|
||||
|
||||
|
||||
def chunks(seq, n):
|
||||
for i in range(0, len(seq), n):
|
||||
yield seq[i:i + n]
|
||||
|
||||
|
||||
def main() -> None:
|
||||
drugs = pd.read_csv(DRUG_SET)
|
||||
inchikeys = sorted({k for k in drugs["inchi_key"].dropna() if isinstance(k, str) and len(k) > 10})
|
||||
print(f"{len(drugs)} drugs; {len(inchikeys)} usable InChIKeys to resolve")
|
||||
|
||||
molecule = new_client.molecule
|
||||
mechanism = new_client.mechanism
|
||||
target = new_client.target
|
||||
|
||||
# 1) InChIKey -> ChEMBL molecule (id, name, smiles)
|
||||
mol_rows = []
|
||||
for i, batch in enumerate(chunks(inchikeys, BATCH)):
|
||||
res = molecule.filter(molecule_structures__standard_inchi_key__in=batch).only(
|
||||
["molecule_chembl_id", "pref_name", "molecule_structures"])
|
||||
for m in res:
|
||||
ms = m.get("molecule_structures") or {}
|
||||
mol_rows.append({
|
||||
"chembl_id": m["molecule_chembl_id"],
|
||||
"pref_name": m.get("pref_name"),
|
||||
"smiles": ms.get("canonical_smiles"),
|
||||
"inchi_key": ms.get("standard_inchi_key"),
|
||||
})
|
||||
print(f" molecules batch {i+1}: cumulative {len(mol_rows)} hits", flush=True)
|
||||
mols = pd.DataFrame(mol_rows).drop_duplicates("inchi_key")
|
||||
chembl_ids = sorted(mols["chembl_id"].unique())
|
||||
print(f"resolved {len(mols)} molecules -> {len(chembl_ids)} ChEMBL ids")
|
||||
|
||||
# 2) ChEMBL id -> mechanism of action + target ids
|
||||
mech_rows = []
|
||||
for batch in chunks(chembl_ids, BATCH):
|
||||
for m in mechanism.filter(molecule_chembl_id__in=batch).only(
|
||||
["molecule_chembl_id", "mechanism_of_action", "target_chembl_id"]):
|
||||
mech_rows.append(m)
|
||||
mech = pd.DataFrame(mech_rows)
|
||||
print(f"mechanism records: {len(mech)}")
|
||||
|
||||
# 3) target id -> name
|
||||
tgt_names = {}
|
||||
if not mech.empty:
|
||||
tids = sorted({t for t in mech["target_chembl_id"].dropna().unique()})
|
||||
for batch in chunks(tids, BATCH):
|
||||
for t in target.filter(target_chembl_id__in=batch).only(["target_chembl_id", "pref_name"]):
|
||||
tgt_names[t["target_chembl_id"]] = t.get("pref_name")
|
||||
|
||||
# aggregate mechanism/targets per molecule
|
||||
def agg(df):
|
||||
moa = sorted({x for x in df["mechanism_of_action"].dropna()})
|
||||
tns = sorted({tgt_names.get(t) for t in df["target_chembl_id"].dropna() if tgt_names.get(t)})
|
||||
return pd.Series({"mechanism_of_action": "; ".join(moa) or None, "targets": tns})
|
||||
|
||||
if not mech.empty:
|
||||
per_mol = mech.groupby("molecule_chembl_id").apply(agg, include_groups=False).reset_index()
|
||||
per_mol = per_mol.rename(columns={"molecule_chembl_id": "chembl_id"})
|
||||
mols = mols.merge(per_mol, on="chembl_id", how="left")
|
||||
else:
|
||||
mols["mechanism_of_action"] = None
|
||||
mols["targets"] = None
|
||||
|
||||
# join back to the drug set on inchi_key
|
||||
enriched = drugs.merge(mols, on="inchi_key", how="left", suffixes=("", "_chembl"))
|
||||
OUT.parent.mkdir(parents=True, exist_ok=True)
|
||||
enriched.to_parquet(OUT, index=False)
|
||||
|
||||
n_resolved = enriched["chembl_id"].notna().sum()
|
||||
n_moa = enriched["mechanism_of_action"].notna().sum()
|
||||
print(f"\nenriched {len(enriched)} drugs: {n_resolved} matched ChEMBL, {n_moa} have MoA")
|
||||
print(f"by reason, ChEMBL match rate:")
|
||||
print(enriched.assign(matched=enriched["chembl_id"].notna()).groupby("inclusion_reason")["matched"].mean().round(2).to_string())
|
||||
print(f"wrote {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
131
scripts/week2_curate_drugset.py
Normal file
131
scripts/week2_curate_drugset.py
Normal file
@@ -0,0 +1,131 @@
|
||||
"""Week 2, task 1: curate the deliberately-composed ~300-drug set (PLAN §6).
|
||||
|
||||
Composition: 2 ground-truth + ~50 related-mechanism + ~50 negative controls + ~200 random.
|
||||
The universe is restricted to compounds that actually have a LINCS Level-5 signature (in
|
||||
Phase I and/or Phase II), so every curated drug is scorable. Output: drug_set_v1.csv.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import gzip
|
||||
import io
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
from src import RANDOM_SEED # noqa: E402
|
||||
|
||||
LINCS = Path("data/raw/lincs")
|
||||
OUT = Path("data/processed/drug_set_v1.csv")
|
||||
|
||||
GROUND_TRUTH = ["hydroxyurea", "glutamine"] # glutamine == L-glutamine in LINCS
|
||||
|
||||
# Curated by mechanism (PLAN §6). Intersected with the LINCS catalog below, so misses are
|
||||
# silently dropped — we keep whatever actually has a signature.
|
||||
RELATED_MECHANISM = [
|
||||
# HbF inducers / epigenetic
|
||||
"decitabine", "azacitidine", "vorinostat", "panobinostat", "romidepsin", "entinostat",
|
||||
"mocetinostat", "belinostat", "pomalidomide", "lenalidomide", "thalidomide", "apicidin",
|
||||
"trichostatin-a", "scriptaid", "valproic-acid",
|
||||
# NO / vascular
|
||||
"sildenafil", "tadalafil", "nitroprusside",
|
||||
# antioxidants
|
||||
"n-acetyl-cysteine", "resveratrol", "curcumin", "quercetin", "sulforaphane",
|
||||
# anti-inflammatory studied in SCD
|
||||
"dexamethasone", "prednisolone", "hydrocortisone", "ibuprofen", "indomethacin",
|
||||
"sulfasalazine", "montelukast", "aspirin",
|
||||
# iron / heme / SCD-adjacent
|
||||
"hemin", "deferoxamine", "deferasirox", "simvastatin", "atorvastatin", "ticagrelor",
|
||||
]
|
||||
|
||||
NEGATIVE_CONTROL = [
|
||||
# antifungals
|
||||
"fluconazole", "ketoconazole", "itraconazole", "clotrimazole", "terbinafine", "miconazole",
|
||||
# antihistamines
|
||||
"loratadine", "cetirizine", "fexofenadine", "diphenhydramine", "chlorpheniramine",
|
||||
"astemizole",
|
||||
# antibiotics
|
||||
"amoxicillin", "ciprofloxacin", "doxycycline", "trimethoprim", "azithromycin", "tetracycline",
|
||||
"nitrofurantoin",
|
||||
# hormones / contraceptives
|
||||
"levonorgestrel", "ethinyl-estradiol", "norethindrone", "medroxyprogesterone-acetate",
|
||||
# misc unrelated
|
||||
"omeprazole", "ranitidine", "loperamide", "caffeine", "acetaminophen", "lidocaine",
|
||||
]
|
||||
|
||||
# Fill the random sample so the total set is ~300 (the denominator the pre-registered
|
||||
# recovery-test thresholds assume: "top 30 of 300"). Curated mechanism/control drugs are
|
||||
# capped by what LINCS actually contains, so the random arm absorbs the remainder.
|
||||
TARGET_TOTAL = 300
|
||||
|
||||
|
||||
def load_catalog() -> pd.DataFrame:
|
||||
"""Compounds with >=1 Level-5 signature, annotated with phase + inchi/smiles."""
|
||||
|
||||
def read_gz(fn, **kw):
|
||||
return pd.read_csv(io.BytesIO(gzip.decompress(Path(fn).read_bytes())), sep="\t", **kw)
|
||||
|
||||
sig1 = read_gz(LINCS / "GSE92742_sig_info.txt.gz", low_memory=False)
|
||||
sig2 = read_gz(LINCS / "GSE70138_sig_info.txt.gz", low_memory=False)
|
||||
cp1 = set(sig1[sig1["pert_type"] == "trt_cp"]["pert_iname"])
|
||||
cp2 = set(sig2[sig2["pert_type"] == "trt_cp"]["pert_iname"])
|
||||
|
||||
pert1 = read_gz(LINCS / "GSE92742_pert_info.txt.gz", low_memory=False)
|
||||
pert2 = read_gz(LINCS / "GSE70138_pert_info.txt.gz", low_memory=False)
|
||||
info = pd.concat([pert1, pert2], ignore_index=True)
|
||||
info = info[info["pert_type"] == "trt_cp"].drop_duplicates("pert_iname", keep="first")
|
||||
info = info.set_index("pert_iname")
|
||||
|
||||
names = cp1 | cp2
|
||||
rows = []
|
||||
for nm in names:
|
||||
phase = "both" if nm in cp1 and nm in cp2 else ("P1" if nm in cp1 else "P2")
|
||||
rec = info.loc[nm] if nm in info.index else None
|
||||
rows.append({
|
||||
"pert_iname": nm,
|
||||
"phase": phase,
|
||||
"pert_id": rec["pert_id"] if rec is not None else None,
|
||||
"inchi_key": rec["inchi_key"] if rec is not None else None,
|
||||
"canonical_smiles": rec["canonical_smiles"] if rec is not None else None,
|
||||
})
|
||||
return pd.DataFrame(rows).set_index("pert_iname")
|
||||
|
||||
|
||||
def pick(catalog: pd.DataFrame, names: list[str], reason: str) -> pd.DataFrame:
|
||||
present = [n for n in names if n in catalog.index]
|
||||
missing = [n for n in names if n not in catalog.index]
|
||||
if missing:
|
||||
print(f" [{reason}] {len(present)}/{len(names)} in LINCS; dropped: {missing}")
|
||||
out = catalog.loc[present].copy()
|
||||
out["inclusion_reason"] = reason
|
||||
return out
|
||||
|
||||
|
||||
def main() -> None:
|
||||
catalog = load_catalog()
|
||||
print(f"LINCS scorable compound universe: {len(catalog)}")
|
||||
|
||||
gt = pick(catalog, GROUND_TRUTH, "ground_truth")
|
||||
rel = pick(catalog, RELATED_MECHANISM, "related_mechanism")
|
||||
neg = pick(catalog, NEGATIVE_CONTROL, "negative_control")
|
||||
|
||||
chosen = pd.concat([gt, rel, neg])
|
||||
remaining = catalog.drop(index=chosen.index)
|
||||
n_random = TARGET_TOTAL - len(chosen)
|
||||
rand = remaining.sample(n=n_random, random_state=RANDOM_SEED).copy()
|
||||
rand["inclusion_reason"] = "general_sample"
|
||||
|
||||
drug_set = pd.concat([gt, rel, neg, rand]).reset_index()
|
||||
OUT.parent.mkdir(parents=True, exist_ok=True)
|
||||
drug_set.to_csv(OUT, index=False)
|
||||
|
||||
print(f"\ndrug_set_v1.csv: {len(drug_set)} drugs")
|
||||
print(drug_set["inclusion_reason"].value_counts().to_string())
|
||||
print(f"phase split:\n{drug_set['phase'].value_counts().to_string()}")
|
||||
print(f"wrote {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
101
scripts/week2_lincs_extract.py
Normal file
101
scripts/week2_lincs_extract.py
Normal file
@@ -0,0 +1,101 @@
|
||||
"""Week 2, task 3: extract per-drug LINCS L1000 consensus signatures.
|
||||
|
||||
For each drug in the set, slice its Level-5 MODZ signatures (978 landmark genes x its sig_ids)
|
||||
out of the big GCTX via cmapPy, then aggregate to ONE consensus 978-vector per drug by mean
|
||||
across its sig_ids (the "MODZ aggregation across cell lines/replicates" of PLAN §6). hydroxyurea
|
||||
lives in Phase II, L-glutamine in Phase I, so both phases are processed and merged.
|
||||
|
||||
Output: data/processed/lincs_signatures_v1.parquet (rows = pert_iname, cols = 978 landmark
|
||||
gene symbols, values = mean MODZ z-score).
|
||||
|
||||
Usage:
|
||||
python scripts/week2_lincs_extract.py --phase 2 # test on Phase II (already downloaded)
|
||||
python scripts/week2_lincs_extract.py # both phases (needs Phase I gunzipped)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import gzip
|
||||
import io
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
from cmapPy.pandasGEXpress.parse import parse
|
||||
|
||||
LINCS = Path("data/raw/lincs")
|
||||
DRUG_SET = Path("data/processed/drug_set_v1.csv")
|
||||
OUT = Path("data/processed/lincs_signatures_v1.parquet")
|
||||
|
||||
GCTX = {1: LINCS / "phase1_level5.gctx", 2: LINCS / "phase2_level5.gctx"}
|
||||
SIG_INFO = {1: "GSE92742_sig_info.txt.gz", 2: "GSE70138_sig_info.txt.gz"}
|
||||
|
||||
|
||||
def read_gz_tsv(name: str) -> pd.DataFrame:
|
||||
return pd.read_csv(io.BytesIO(gzip.decompress((LINCS / name).read_bytes())), sep="\t", low_memory=False)
|
||||
|
||||
|
||||
def landmark_ids_and_symbols() -> tuple[list[str], dict[str, str]]:
|
||||
lm = pd.read_csv(LINCS / "landmark_genes.csv")
|
||||
ids = [str(x) for x in lm["pr_gene_id"]]
|
||||
id_to_symbol = {str(r.pr_gene_id): r.pr_gene_symbol for r in lm.itertuples()}
|
||||
return ids, id_to_symbol
|
||||
|
||||
|
||||
def extract_phase(phase: int, drug_names: set[str], landmark_ids: list[str]) -> pd.DataFrame:
|
||||
"""Return DataFrame: rows=pert_iname, cols=landmark gene_id (str), one mean vector per drug."""
|
||||
sig = read_gz_tsv(SIG_INFO[phase])
|
||||
sig = sig[(sig["pert_type"] == "trt_cp") & (sig["pert_iname"].isin(drug_names))]
|
||||
if sig.empty:
|
||||
return pd.DataFrame()
|
||||
sig_ids = sig["sig_id"].tolist()
|
||||
print(f"[phase {phase}] {sig['pert_iname'].nunique()} drugs, {len(sig_ids)} signatures to slice", flush=True)
|
||||
|
||||
gctoo = parse(str(GCTX[phase]), rid=landmark_ids, cid=sig_ids)
|
||||
data = gctoo.data_df # rows=gene_id, cols=sig_id
|
||||
sig_to_drug = dict(zip(sig["sig_id"], sig["pert_iname"]))
|
||||
# mean across each drug's sig_ids -> one consensus vector per drug
|
||||
per_drug = data.T.groupby(data.columns.map(sig_to_drug)).mean()
|
||||
print(f"[phase {phase}] aggregated to {len(per_drug)} drug consensus vectors", flush=True)
|
||||
return per_drug # rows=pert_iname, cols=gene_id
|
||||
|
||||
|
||||
def main() -> None:
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--phase", type=int, choices=[1, 2], default=None, help="single phase (test)")
|
||||
args = ap.parse_args()
|
||||
|
||||
drugs = pd.read_csv(DRUG_SET)
|
||||
drug_names = set(drugs["pert_iname"])
|
||||
landmark_ids, id_to_symbol = landmark_ids_and_symbols()
|
||||
|
||||
phases = [args.phase] if args.phase else [1, 2]
|
||||
frames = []
|
||||
for ph in phases:
|
||||
if not GCTX[ph].exists():
|
||||
print(f"[phase {ph}] {GCTX[ph]} missing — skipping")
|
||||
continue
|
||||
frames.append(extract_phase(ph, drug_names, landmark_ids))
|
||||
|
||||
frames = [f for f in frames if not f.empty]
|
||||
if not frames:
|
||||
print("no signatures extracted")
|
||||
return
|
||||
# A drug present in both phases: average the two phase consensus vectors.
|
||||
combined = pd.concat(frames).groupby(level=0).mean()
|
||||
combined.columns = [id_to_symbol.get(c, c) for c in combined.columns] # gene_id -> symbol
|
||||
|
||||
covered = sorted(set(combined.index))
|
||||
missing = sorted(drug_names - set(covered))
|
||||
print(f"\nsignatures extracted for {len(covered)}/{len(drug_names)} drugs")
|
||||
for gt in ["hydroxyurea", "glutamine"]:
|
||||
print(f" ground truth '{gt}': {'OK' if gt in covered else 'MISSING'}")
|
||||
if args.phase is None:
|
||||
combined.to_parquet(OUT)
|
||||
print(f"wrote {OUT} ({combined.shape[0]} drugs x {combined.shape[1]} landmark genes)")
|
||||
if missing:
|
||||
print(f"{len(missing)} drugs without signature (will be marked not-scored in Week 3)")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user