# Data Sources > Fill in version + download date for every source actually used. This file is the artifact > that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for **all** > downloads. | Source | URL | Access | License | Use in MVP | Version | Download date | |---|---|---|---|---|---|---| | Open Targets | https://platform.opentargets.org | API, bulk Parquet | CC0 | Target-disease graph | TBD | TBD | | MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD | | Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD | | OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD | | GEO (GSE35007) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 | GEOparse, FTP | Public domain | Disease signature (study 1) | GPL10558 | 2026-06-23 | | GEO (GSE16728) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 | GEOparse, FTP | Public domain | Disease signature (study 2) | GPL570 | 2026-06-23 | | ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets | TBD | TBD | | LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures | TBD | TBD | | ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD | | FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD | | Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD | ## Chosen GEO datasets (disease signature, Tier A via 2-study concordance) The signature is the cross-study concordance of two independent whole-blood studies (genes significant at q<0.05 in **both** with the same direction). Whole-blood tissue was required so concordance is meaningful; the two differ by platform and population, which strengthens robustness. | Study | Platform | Tissue | Disease group | Healthy group | n disease / healthy | |---|---|---|---|---|---| | **GSE35007** | Illumina HumanHT-12 V4 (GPL10558) | whole blood | hb phenotype = SS | hb phenotype = AA | 190 / 12 | | **GSE16728** | Affymetrix HG-U133 Plus 2.0 (GPL570) | whole blood (PAXgene) | sickle-cell patient | control | 10 / 10 | - DE method: per-gene Welch t-test + Benjamini–Hochberg (microarray, pure Python). - Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance. - Result: 16,208 genes tested in both → **671 concordant** (444 up / 227 down). Signature = top 250 up + all 227 down by worst-case q-value. - **Rejected candidates:** GSE53441 (PBMC — tissue mismatch with the whole-blood anchor); GSE84633/GSE84634 (PBMC, no healthy controls). - **Tier caveat:** GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON. Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then `scripts/week1_finalize.py` (mygene mapping + persist). ## Licensing note for LINCS Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept) the terms are permissive. For productization this needs legal review.