Files
Reverso/docs/data_sources.md
Junior B. c7b6649d31 Week 1: Tier-A sickle cell signature via 2-study concordance
Implement and run the Week 1 disease-signature pipeline:
- src/disease.py: Welch t-test + BH DE (microarray), probe->symbol
  collapse, cross-study concordance filter, 2-study provenance schema
- scripts/week1_explore.py: download GSE35007 + GSE16728, DE + concordance
- scripts/week1_finalize.py: mygene ID mapping + persist signature
- tests/test_disease.py: synthetic-data tests for DE/collapse/concordance
- docs/data_sources.md: chosen datasets, group defs, reproduction steps

Result: sickle_cell_signature_v1.json (gitignored), Tier A, 250 up /
227 down genes from 671 concordant (GSE35007 Illumina whole blood SS/AA +
GSE16728 Affymetrix whole blood patient/control). Documented caveats:
missing HbF axis (globin depletion) and reticulocyte composition confound.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 20:43:54 +02:00

3.2 KiB
Raw Blame History

Data Sources

Fill in version + download date for every source actually used. This file is the artifact that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for all downloads.

Source URL Access License Use in MVP Version Download date
Open Targets https://platform.opentargets.org API, bulk Parquet CC0 Target-disease graph TBD TBD
MONDO http://www.obofoundry.org/ontology/mondo.html OBO file CC BY 4.0 Disease ID TBD TBD
Orphanet https://www.orpha.net Bulk XML CC BY 4.0 Rare disease metadata TBD TBD
OMIM https://omim.org Free for academic License for commercial Disease genetics TBD TBD
GEO (GSE35007) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 GEOparse, FTP Public domain Disease signature (study 1) GPL10558 2026-06-23
GEO (GSE16728) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 GEOparse, FTP Public domain Disease signature (study 2) GPL570 2026-06-23
ChEMBL https://www.ebi.ac.uk/chembl Python client, bulk SQLite CC BY-SA 3.0 Drug structures, targets TBD TBD
LINCS L1000 https://clue.io/data Bulk download Restricted academic free Drug expression signatures TBD TBD
ClinicalTrials.gov https://clinicaltrials.gov API Public domain Trial history TBD TBD
FDA DailyMed https://dailymed.nlm.nih.gov API Public domain Approved labels TBD TBD
Reactome https://reactome.org API, bulk CC0 Pathway data (Week 3 prior) TBD TBD

Chosen GEO datasets (disease signature, Tier A via 2-study concordance)

The signature is the cross-study concordance of two independent whole-blood studies (genes significant at q<0.05 in both with the same direction). Whole-blood tissue was required so concordance is meaningful; the two differ by platform and population, which strengthens robustness.

Study Platform Tissue Disease group Healthy group n disease / healthy
GSE35007 Illumina HumanHT-12 V4 (GPL10558) whole blood hb phenotype = SS hb phenotype = AA 190 / 12
GSE16728 Affymetrix HG-U133 Plus 2.0 (GPL570) whole blood (PAXgene) sickle-cell patient control 10 / 10
  • DE method: per-gene Welch t-test + BenjaminiHochberg (microarray, pure Python).
  • Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
  • Result: 16,208 genes tested in both → 671 concordant (444 up / 227 down). Signature = top 250 up + all 227 down by worst-case q-value.
  • Rejected candidates: GSE53441 (PBMC — tissue mismatch with the whole-blood anchor); GSE84633/GSE84634 (PBMC, no healthy controls).
  • Tier caveat: GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.

Reproduce with scripts/week1_explore.py (download + DE + concordance) then scripts/week1_finalize.py (mygene mapping + persist).

Licensing note for LINCS

Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept) the terms are permissive. For productization this needs legal review.