Post-hoc improvement after the pre-registered v1 recovery test failed. Two changes, diagnosing v1's failure: - score on the full 12,328-gene LINCS space (week2_lincs_extract.py), lifting signature overlap from 12% to 85% (brings erythroid markers in) - src/scoring.py: KS connectivity + per-drug specificity z-score (spec_z = SDs below a 1,000 random-query null). Primary ranking is now spec_z. (Textbook tau saturated at +/-100 for a coherent query — documented; needs a reference-signature library, a v2 item.) - week3_scoring.py: spec_z primary + WTCS reference + prior-blended - tests: tau/spec_z calibration test; 19 passing - scripts/exp_genespace.py: the BING vs all-12,328 comparison Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%), confirming the v1 failure was the landmark bottleneck not the algorithm. Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite), and negative controls (norethindrone, ciprofloxacin) rank top-3 — connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory, not a confirmatory test; reported as such in recovery_test_report.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5.0 KiB
Data Sources
Fill in version + download date for every source actually used. This file is the artifact that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for all downloads.
| Source | URL | Access | License | Use in MVP | Version | Download date |
|---|---|---|---|---|---|---|
| Open Targets | https://platform.opentargets.org | API, bulk Parquet | CC0 | Target-disease graph | TBD | TBD |
| MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD |
| Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD |
| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
| GEO (GSE35007) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 | GEOparse, FTP | Public domain | Disease signature (study 1) | GPL10558 | 2026-06-23 |
| GEO (GSE16728) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 | GEOparse, FTP | Public domain | Disease signature (study 2) | GPL570 | 2026-06-23 |
| ChEMBL | https://www.ebi.ac.uk/chembl | chembl_webresource_client | CC BY-SA 3.0 | Drug structures, MoA, targets | API (live) | 2026-06-23 |
| LINCS L1000 Phase I | GSE92742 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. L-glutamine) | GSE92742 | 2026-06-23 |
| LINCS L1000 Phase II | GSE70138 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. hydroxyurea) | GSE70138 | 2026-06-23 |
| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |
Chosen GEO datasets (disease signature, Tier A via 2-study concordance)
The signature is the cross-study concordance of two independent whole-blood studies (genes significant at q<0.05 in both with the same direction). Whole-blood tissue was required so concordance is meaningful; the two differ by platform and population, which strengthens robustness.
| Study | Platform | Tissue | Disease group | Healthy group | n disease / healthy |
|---|---|---|---|---|---|
| GSE35007 | Illumina HumanHT-12 V4 (GPL10558) | whole blood | hb phenotype = SS | hb phenotype = AA | 190 / 12 |
| GSE16728 | Affymetrix HG-U133 Plus 2.0 (GPL570) | whole blood (PAXgene) | sickle-cell patient | control | 10 / 10 |
- DE method: per-gene Welch t-test + Benjamini–Hochberg (microarray, pure Python).
- Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
- Result: 16,208 genes tested in both → 671 concordant (444 up / 227 down). Signature = top 250 up + all 227 down by worst-case q-value.
- Rejected candidates: GSE53441 (PBMC — tissue mismatch with the whole-blood anchor); GSE84633/GSE84634 (PBMC, no healthy controls).
- Tier caveat: GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.
Reproduce with scripts/week1_explore.py (download + DE + concordance) then
scripts/week1_finalize.py (mygene mapping + persist).
Drug profiles (Week 2)
300-drug set (drug_set_v1.csv), composed and restricted to LINCS-scorable compounds:
| Inclusion reason | n | Notes |
|---|---|---|
| ground_truth | 2 | hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I) |
| related_mechanism | 32 | HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories |
| negative_control | 26 | antifungals, antihistamines, antibiotics, hormones |
| general_sample | 240 | random from LINCS catalog, seed=42 |
- LINCS signatures: per-drug consensus = mean of Level-5 MODZ z-scores across the drug's sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
- ChEMBL: matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds 38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
- Tier: all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's not-single-source rule).
- Gene space (v1.1): scoring uses the full 12,328-gene LINCS space, not just the 978 landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from LINCS entirely and remain unscoreable.
- Reproduce:
week2_curate_drugset.py→week2_chembl.py→ download Level-5 GCTX →week2_lincs_extract.py→week2_assemble.py.
Licensing note for LINCS
Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept) the terms are permissive. For productization this needs legal review.