Files
Reverso/docs/data_sources.md
Junior B. 47b0094079 Week 2: 300-drug profiles with LINCS signatures + ChEMBL
Build the drug profile dataset (PLAN §6 Week 2):
- week2_curate_drugset.py: 300-drug set (2 ground-truth + 32 related-
  mechanism + 26 negative-control + 240 random), restricted to
  LINCS-scorable compounds, seed=42
- week2_chembl.py: InChIKey->ChEMBL match (145/300), MoA + targets
- week2_lincs_extract.py: cmapPy-slice both Level-5 GCTX phases to 978
  landmark genes, mean-aggregate per drug to one consensus signature
- week2_assemble.py: join into drug_profiles_v1.parquet, Tier B (LINCS
  single-source), scored flag per PLAN §6 Week 3 task 2
- docs/data_sources.md: drug set composition + LINCS/ChEMBL provenance

Results (all gitignored data): 300/300 drugs scored, both ground-truth
drugs present (hydroxyurea Phase II = CHEMBL467, L-glutamine Phase I).
Key caveat recorded: only 56/477 (12%) of the disease signature genes
are LINCS landmarks, so Week-3 scoring uses a 30-up/26-down query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:25:00 +02:00

4.9 KiB
Raw Blame History

Data Sources

Fill in version + download date for every source actually used. This file is the artifact that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for all downloads.

Source URL Access License Use in MVP Version Download date
Open Targets https://platform.opentargets.org API, bulk Parquet CC0 Target-disease graph TBD TBD
MONDO http://www.obofoundry.org/ontology/mondo.html OBO file CC BY 4.0 Disease ID TBD TBD
Orphanet https://www.orpha.net Bulk XML CC BY 4.0 Rare disease metadata TBD TBD
OMIM https://omim.org Free for academic License for commercial Disease genetics TBD TBD
GEO (GSE35007) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 GEOparse, FTP Public domain Disease signature (study 1) GPL10558 2026-06-23
GEO (GSE16728) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 GEOparse, FTP Public domain Disease signature (study 2) GPL570 2026-06-23
ChEMBL https://www.ebi.ac.uk/chembl chembl_webresource_client CC BY-SA 3.0 Drug structures, MoA, targets API (live) 2026-06-23
LINCS L1000 Phase I GSE92742 (GEO) GEOparse/FTP + cmapPy CC0 (GEO) Drug signatures (incl. L-glutamine) GSE92742 2026-06-23
LINCS L1000 Phase II GSE70138 (GEO) GEOparse/FTP + cmapPy CC0 (GEO) Drug signatures (incl. hydroxyurea) GSE70138 2026-06-23
ClinicalTrials.gov https://clinicaltrials.gov API Public domain Trial history TBD TBD
FDA DailyMed https://dailymed.nlm.nih.gov API Public domain Approved labels TBD TBD
Reactome https://reactome.org API, bulk CC0 Pathway data (Week 3 prior) TBD TBD

Chosen GEO datasets (disease signature, Tier A via 2-study concordance)

The signature is the cross-study concordance of two independent whole-blood studies (genes significant at q<0.05 in both with the same direction). Whole-blood tissue was required so concordance is meaningful; the two differ by platform and population, which strengthens robustness.

Study Platform Tissue Disease group Healthy group n disease / healthy
GSE35007 Illumina HumanHT-12 V4 (GPL10558) whole blood hb phenotype = SS hb phenotype = AA 190 / 12
GSE16728 Affymetrix HG-U133 Plus 2.0 (GPL570) whole blood (PAXgene) sickle-cell patient control 10 / 10
  • DE method: per-gene Welch t-test + BenjaminiHochberg (microarray, pure Python).
  • Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
  • Result: 16,208 genes tested in both → 671 concordant (444 up / 227 down). Signature = top 250 up + all 227 down by worst-case q-value.
  • Rejected candidates: GSE53441 (PBMC — tissue mismatch with the whole-blood anchor); GSE84633/GSE84634 (PBMC, no healthy controls).
  • Tier caveat: GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.

Reproduce with scripts/week1_explore.py (download + DE + concordance) then scripts/week1_finalize.py (mygene mapping + persist).

Drug profiles (Week 2)

300-drug set (drug_set_v1.csv), composed and restricted to LINCS-scorable compounds:

Inclusion reason n Notes
ground_truth 2 hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I)
related_mechanism 32 HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories
negative_control 26 antifungals, antihistamines, antibiotics, hormones
general_sample 240 random from LINCS catalog, seed=42
  • LINCS signatures: per-drug consensus = mean of Level-5 MODZ z-scores across the drug's sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
  • ChEMBL: matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds 38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
  • Tier: all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's not-single-source rule).
  • Signature↔landmark overlap: only 56/477 (12%) of the disease signature genes are LINCS landmarks, so connectivity scoring (Week 3) uses a 30-up/26-down query. The erythroid hallmark genes (CA1, AHSP, SLC4A1, HBG) are NOT landmarks. This is a key limitation for the recovery test.
  • Reproduce: week2_curate_drugset.pyweek2_chembl.py → download Level-5 GCTX → week2_lincs_extract.pyweek2_assemble.py.

Licensing note for LINCS

Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept) the terms are permissive. For productization this needs legal review.