Week 1: Tier-A sickle cell signature via 2-study concordance

Implement and run the Week 1 disease-signature pipeline:
- src/disease.py: Welch t-test + BH DE (microarray), probe->symbol
  collapse, cross-study concordance filter, 2-study provenance schema
- scripts/week1_explore.py: download GSE35007 + GSE16728, DE + concordance
- scripts/week1_finalize.py: mygene ID mapping + persist signature
- tests/test_disease.py: synthetic-data tests for DE/collapse/concordance
- docs/data_sources.md: chosen datasets, group defs, reproduction steps

Result: sickle_cell_signature_v1.json (gitignored), Tier A, 250 up /
227 down genes from 671 concordant (GSE35007 Illumina whole blood SS/AA +
GSE16728 Affymetrix whole blood patient/control). Documented caveats:
missing HbF axis (globin depletion) and reticulocyte composition confound.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-23 20:43:23 +02:00
parent b731478f5d
commit c7b6649d31
5 changed files with 631 additions and 26 deletions

View File

@@ -10,17 +10,37 @@
| MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD |
| Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD |
| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
| GEO | https://www.ncbi.nlm.nih.gov/geo/ | GEOparse, FTP | Public domain | Expression data | TBD | TBD |
| GEO (GSE35007) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 | GEOparse, FTP | Public domain | Disease signature (study 1) | GPL10558 | 2026-06-23 |
| GEO (GSE16728) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 | GEOparse, FTP | Public domain | Disease signature (study 2) | GPL570 | 2026-06-23 |
| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets | TBD | TBD |
| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures | TBD | TBD |
| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |
## Chosen GEO dataset
## Chosen GEO datasets (disease signature, Tier A via 2-study concordance)
_Document the chosen study fully: accession, platform, n per group, publication, why it was
selected over the alternatives (GSE53441, GSE35007, …)._
The signature is the cross-study concordance of two independent whole-blood studies (genes
significant at q<0.05 in **both** with the same direction). Whole-blood tissue was required so
concordance is meaningful; the two differ by platform and population, which strengthens
robustness.
| Study | Platform | Tissue | Disease group | Healthy group | n disease / healthy |
|---|---|---|---|---|---|
| **GSE35007** | Illumina HumanHT-12 V4 (GPL10558) | whole blood | hb phenotype = SS | hb phenotype = AA | 190 / 12 |
| **GSE16728** | Affymetrix HG-U133 Plus 2.0 (GPL570) | whole blood (PAXgene) | sickle-cell patient | control | 10 / 10 |
- DE method: per-gene Welch t-test + BenjaminiHochberg (microarray, pure Python).
- Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
- Result: 16,208 genes tested in both **671 concordant** (444 up / 227 down). Signature =
top 250 up + all 227 down by worst-case q-value.
- **Rejected candidates:** GSE53441 (PBMC tissue mismatch with the whole-blood anchor);
GSE84633/GSE84634 (PBMC, no healthy controls).
- **Tier caveat:** GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict
n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.
Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
`scripts/week1_finalize.py` (mygene mapping + persist).
## Licensing note for LINCS