Files
Reverso/docs/data_sources.md
Junior B. 3417f85eb1 v1.1: full gene space + specificity z-score; hydroxyurea recovers
Post-hoc improvement after the pre-registered v1 recovery test failed.
Two changes, diagnosing v1's failure:
- score on the full 12,328-gene LINCS space (week2_lincs_extract.py),
  lifting signature overlap from 12% to 85% (brings erythroid markers in)
- src/scoring.py: KS connectivity + per-drug specificity z-score
  (spec_z = SDs below a 1,000 random-query null). Primary ranking is
  now spec_z. (Textbook tau saturated at +/-100 for a coherent query —
  documented; needs a reference-signature library, a v2 item.)
- week3_scoring.py: spec_z primary + WTCS reference + prior-blended
- tests: tau/spec_z calibration test; 19 passing
- scripts/exp_genespace.py: the BING vs all-12,328 comparison

Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%),
confirming the v1 failure was the landmark bottleneck not the algorithm.
Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite),
and negative controls (norethindrone, ciprofloxacin) rank top-3 —
connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory,
not a confirmatory test; reported as such in recovery_test_report.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:57:30 +02:00

5.0 KiB
Raw Blame History

Data Sources

Fill in version + download date for every source actually used. This file is the artifact that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for all downloads.

Source URL Access License Use in MVP Version Download date
Open Targets https://platform.opentargets.org API, bulk Parquet CC0 Target-disease graph TBD TBD
MONDO http://www.obofoundry.org/ontology/mondo.html OBO file CC BY 4.0 Disease ID TBD TBD
Orphanet https://www.orpha.net Bulk XML CC BY 4.0 Rare disease metadata TBD TBD
OMIM https://omim.org Free for academic License for commercial Disease genetics TBD TBD
GEO (GSE35007) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 GEOparse, FTP Public domain Disease signature (study 1) GPL10558 2026-06-23
GEO (GSE16728) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 GEOparse, FTP Public domain Disease signature (study 2) GPL570 2026-06-23
ChEMBL https://www.ebi.ac.uk/chembl chembl_webresource_client CC BY-SA 3.0 Drug structures, MoA, targets API (live) 2026-06-23
LINCS L1000 Phase I GSE92742 (GEO) GEOparse/FTP + cmapPy CC0 (GEO) Drug signatures (incl. L-glutamine) GSE92742 2026-06-23
LINCS L1000 Phase II GSE70138 (GEO) GEOparse/FTP + cmapPy CC0 (GEO) Drug signatures (incl. hydroxyurea) GSE70138 2026-06-23
ClinicalTrials.gov https://clinicaltrials.gov API Public domain Trial history TBD TBD
FDA DailyMed https://dailymed.nlm.nih.gov API Public domain Approved labels TBD TBD
Reactome https://reactome.org API, bulk CC0 Pathway data (Week 3 prior) TBD TBD

Chosen GEO datasets (disease signature, Tier A via 2-study concordance)

The signature is the cross-study concordance of two independent whole-blood studies (genes significant at q<0.05 in both with the same direction). Whole-blood tissue was required so concordance is meaningful; the two differ by platform and population, which strengthens robustness.

Study Platform Tissue Disease group Healthy group n disease / healthy
GSE35007 Illumina HumanHT-12 V4 (GPL10558) whole blood hb phenotype = SS hb phenotype = AA 190 / 12
GSE16728 Affymetrix HG-U133 Plus 2.0 (GPL570) whole blood (PAXgene) sickle-cell patient control 10 / 10
  • DE method: per-gene Welch t-test + BenjaminiHochberg (microarray, pure Python).
  • Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
  • Result: 16,208 genes tested in both → 671 concordant (444 up / 227 down). Signature = top 250 up + all 227 down by worst-case q-value.
  • Rejected candidates: GSE53441 (PBMC — tissue mismatch with the whole-blood anchor); GSE84633/GSE84634 (PBMC, no healthy controls).
  • Tier caveat: GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.

Reproduce with scripts/week1_explore.py (download + DE + concordance) then scripts/week1_finalize.py (mygene mapping + persist).

Drug profiles (Week 2)

300-drug set (drug_set_v1.csv), composed and restricted to LINCS-scorable compounds:

Inclusion reason n Notes
ground_truth 2 hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I)
related_mechanism 32 HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories
negative_control 26 antifungals, antihistamines, antibiotics, hormones
general_sample 240 random from LINCS catalog, seed=42
  • LINCS signatures: per-drug consensus = mean of Level-5 MODZ z-scores across the drug's sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
  • ChEMBL: matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds 38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
  • Tier: all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's not-single-source rule).
  • Gene space (v1.1): scoring uses the full 12,328-gene LINCS space, not just the 978 landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from LINCS entirely and remain unscoreable.
  • Reproduce: week2_curate_drugset.pyweek2_chembl.py → download Level-5 GCTX → week2_lincs_extract.pyweek2_assemble.py.

Licensing note for LINCS

Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept) the terms are permissive. For productization this needs legal review.