Files

Junior B. 47b0094079 Week 2: 300-drug profiles with LINCS signatures + ChEMBL

Build the drug profile dataset (PLAN §6 Week 2):
- week2_curate_drugset.py: 300-drug set (2 ground-truth + 32 related-
  mechanism + 26 negative-control + 240 random), restricted to
  LINCS-scorable compounds, seed=42
- week2_chembl.py: InChIKey->ChEMBL match (145/300), MoA + targets
- week2_lincs_extract.py: cmapPy-slice both Level-5 GCTX phases to 978
  landmark genes, mean-aggregate per drug to one consensus signature
- week2_assemble.py: join into drug_profiles_v1.parquet, Tier B (LINCS
  single-source), scored flag per PLAN §6 Week 3 task 2
- docs/data_sources.md: drug set composition + LINCS/ChEMBL provenance

Results (all gitignored data): 300/300 drugs scored, both ground-truth
drugs present (hydroxyurea Phase II = CHEMBL467, L-glutamine Phase I).
Key caveat recorded: only 56/477 (12%) of the disease signature genes
are LINCS landmarks, so Week-3 scoring uses a 30-up/26-down query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-23 22:25:00 +02:00

4.9 KiB

Raw Blame History

Data Sources

Fill in version + download date for every source actually used. This file is the artifact that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for all downloads.

Source	URL	Access	License	Use in MVP	Version	Download date
Open Targets	https://platform.opentargets.org	API, bulk Parquet	CC0	Target-disease graph	TBD	TBD
MONDO	http://www.obofoundry.org/ontology/mondo.html	OBO file	CC BY 4.0	Disease ID	TBD	TBD
Orphanet	https://www.orpha.net	Bulk XML	CC BY 4.0	Rare disease metadata	TBD	TBD
OMIM	https://omim.org	Free for academic	License for commercial	Disease genetics	TBD	TBD
GEO (GSE35007)	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007	GEOparse, FTP	Public domain	Disease signature (study 1)	GPL10558	2026-06-23
GEO (GSE16728)	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728	GEOparse, FTP	Public domain	Disease signature (study 2)	GPL570	2026-06-23
ChEMBL	https://www.ebi.ac.uk/chembl	chembl_webresource_client	CC BY-SA 3.0	Drug structures, MoA, targets	API (live)	2026-06-23
LINCS L1000 Phase I	GSE92742 (GEO)	GEOparse/FTP + cmapPy	CC0 (GEO)	Drug signatures (incl. L-glutamine)	GSE92742	2026-06-23
LINCS L1000 Phase II	GSE70138 (GEO)	GEOparse/FTP + cmapPy	CC0 (GEO)	Drug signatures (incl. hydroxyurea)	GSE70138	2026-06-23
ClinicalTrials.gov	https://clinicaltrials.gov	API	Public domain	Trial history	TBD	TBD
FDA DailyMed	https://dailymed.nlm.nih.gov	API	Public domain	Approved labels	TBD	TBD
Reactome	https://reactome.org	API, bulk	CC0	Pathway data (Week 3 prior)	TBD	TBD

Chosen GEO datasets (disease signature, Tier A via 2-study concordance)

The signature is the cross-study concordance of two independent whole-blood studies (genes significant at q<0.05 in both with the same direction). Whole-blood tissue was required so concordance is meaningful; the two differ by platform and population, which strengthens robustness.

Study	Platform	Tissue	Disease group	Healthy group	n disease / healthy
GSE35007	Illumina HumanHT-12 V4 (GPL10558)	whole blood	hb phenotype = SS	hb phenotype = AA	190 / 12
GSE16728	Affymetrix HG-U133 Plus 2.0 (GPL570)	whole blood (PAXgene)	sickle-cell patient	control	10 / 10

DE method: per-gene Welch t-test + Benjamini–Hochberg (microarray, pure Python).
Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
Result: 16,208 genes tested in both → 671 concordant (444 up / 227 down). Signature = top 250 up + all 227 down by worst-case q-value.
Rejected candidates: GSE53441 (PBMC — tissue mismatch with the whole-blood anchor); GSE84633/GSE84634 (PBMC, no healthy controls).
Tier caveat: GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.

Reproduce with scripts/week1_explore.py (download + DE + concordance) then scripts/week1_finalize.py (mygene mapping + persist).

Drug profiles (Week 2)

300-drug set (drug_set_v1.csv), composed and restricted to LINCS-scorable compounds:

Inclusion reason	n	Notes
ground_truth	2	hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I)
related_mechanism	32	HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories
negative_control	26	antifungals, antihistamines, antibiotics, hormones
general_sample	240	random from LINCS catalog, seed=42

LINCS signatures: per-drug consensus = mean of Level-5 MODZ z-scores across the drug's sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
ChEMBL: matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds 38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
Tier: all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's not-single-source rule).
Signature↔landmark overlap: only 56/477 (12%) of the disease signature genes are LINCS landmarks, so connectivity scoring (Week 3) uses a 30-up/26-down query. The erythroid hallmark genes (CA1, AHSP, SLC4A1, HBG) are NOT landmarks. This is a key limitation for the recovery test.
Reproduce: week2_curate_drugset.py → week2_chembl.py → download Level-5 GCTX → week2_lincs_extract.py → week2_assemble.py.

Licensing note for LINCS

Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept) the terms are permissive. For productization this needs legal review.

4.9 KiB Raw Blame History Unescape Escape

Data Sources

Chosen GEO datasets (disease signature, Tier A via 2-study concordance)

Drug profiles (Week 2)

Licensing note for LINCS

4.9 KiB

Raw Blame History