Files
Reverso/docs/data_sources.md
Junior B. 3417f85eb1 v1.1: full gene space + specificity z-score; hydroxyurea recovers
Post-hoc improvement after the pre-registered v1 recovery test failed.
Two changes, diagnosing v1's failure:
- score on the full 12,328-gene LINCS space (week2_lincs_extract.py),
  lifting signature overlap from 12% to 85% (brings erythroid markers in)
- src/scoring.py: KS connectivity + per-drug specificity z-score
  (spec_z = SDs below a 1,000 random-query null). Primary ranking is
  now spec_z. (Textbook tau saturated at +/-100 for a coherent query —
  documented; needs a reference-signature library, a v2 item.)
- week3_scoring.py: spec_z primary + WTCS reference + prior-blended
- tests: tau/spec_z calibration test; 19 passing
- scripts/exp_genespace.py: the BING vs all-12,328 comparison

Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%),
confirming the v1 failure was the landmark bottleneck not the algorithm.
Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite),
and negative controls (norethindrone, ciprofloxacin) rank top-3 —
connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory,
not a confirmatory test; reported as such in recovery_test_report.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 22:57:30 +02:00

75 lines
5.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Data Sources
> Fill in version + download date for every source actually used. This file is the artifact
> that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for **all**
> downloads.
| Source | URL | Access | License | Use in MVP | Version | Download date |
|---|---|---|---|---|---|---|
| Open Targets | https://platform.opentargets.org | API, bulk Parquet | CC0 | Target-disease graph | TBD | TBD |
| MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD |
| Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD |
| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
| GEO (GSE35007) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007 | GEOparse, FTP | Public domain | Disease signature (study 1) | GPL10558 | 2026-06-23 |
| GEO (GSE16728) | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728 | GEOparse, FTP | Public domain | Disease signature (study 2) | GPL570 | 2026-06-23 |
| ChEMBL | https://www.ebi.ac.uk/chembl | chembl_webresource_client | CC BY-SA 3.0 | Drug structures, MoA, targets | API (live) | 2026-06-23 |
| LINCS L1000 Phase I | GSE92742 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. L-glutamine) | GSE92742 | 2026-06-23 |
| LINCS L1000 Phase II | GSE70138 (GEO) | GEOparse/FTP + cmapPy | CC0 (GEO) | Drug signatures (incl. hydroxyurea) | GSE70138 | 2026-06-23 |
| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |
## Chosen GEO datasets (disease signature, Tier A via 2-study concordance)
The signature is the cross-study concordance of two independent whole-blood studies (genes
significant at q<0.05 in **both** with the same direction). Whole-blood tissue was required so
concordance is meaningful; the two differ by platform and population, which strengthens
robustness.
| Study | Platform | Tissue | Disease group | Healthy group | n disease / healthy |
|---|---|---|---|---|---|
| **GSE35007** | Illumina HumanHT-12 V4 (GPL10558) | whole blood | hb phenotype = SS | hb phenotype = AA | 190 / 12 |
| **GSE16728** | Affymetrix HG-U133 Plus 2.0 (GPL570) | whole blood (PAXgene) | sickle-cell patient | control | 10 / 10 |
- DE method: per-gene Welch t-test + BenjaminiHochberg (microarray, pure Python).
- Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
- Result: 16,208 genes tested in both **671 concordant** (444 up / 227 down). Signature =
top 250 up + all 227 down by worst-case q-value.
- **Rejected candidates:** GSE53441 (PBMC tissue mismatch with the whole-blood anchor);
GSE84633/GSE84634 (PBMC, no healthy controls).
- **Tier caveat:** GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict
n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.
Reproduce with `scripts/week1_explore.py` (download + DE + concordance) then
`scripts/week1_finalize.py` (mygene mapping + persist).
## Drug profiles (Week 2)
300-drug set (`drug_set_v1.csv`), composed and restricted to LINCS-scorable compounds:
| Inclusion reason | n | Notes |
|---|---|---|
| ground_truth | 2 | hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I) |
| related_mechanism | 32 | HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories |
| negative_control | 26 | antifungals, antihistamines, antibiotics, hormones |
| general_sample | 240 | random from LINCS catalog, seed=42 |
- **LINCS signatures:** per-drug consensus = mean of Level-5 MODZ z-scores across the drug's
sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH
phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
- **ChEMBL:** matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds
38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
- **Tier:** all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's
not-single-source rule).
- **Gene space (v1.1):** scoring uses the full **12,328-gene** LINCS space, not just the 978
landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger
space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from
LINCS entirely and remain unscoreable.
- Reproduce: `week2_curate_drugset.py``week2_chembl.py` → download Level-5 GCTX →
`week2_lincs_extract.py``week2_assemble.py`.
## Licensing note for LINCS
Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept)
the terms are permissive. For productization this needs legal review.