Files

Junior B. 3417f85eb1 v1.1: full gene space + specificity z-score; hydroxyurea recovers

Post-hoc improvement after the pre-registered v1 recovery test failed.
Two changes, diagnosing v1's failure:
- score on the full 12,328-gene LINCS space (week2_lincs_extract.py),
  lifting signature overlap from 12% to 85% (brings erythroid markers in)
- src/scoring.py: KS connectivity + per-drug specificity z-score
  (spec_z = SDs below a 1,000 random-query null). Primary ranking is
  now spec_z. (Textbook tau saturated at +/-100 for a coherent query —
  documented; needs a reference-signature library, a v2 item.)
- week3_scoring.py: spec_z primary + WTCS reference + prior-blended
- tests: tau/spec_z calibration test; 19 passing
- scripts/exp_genespace.py: the BING vs all-12,328 comparison

Result: hydroxyurea recovers (rank 40 -> 18, top 6%, passes top-10%),
confirming the v1 failure was the landmark bottleneck not the algorithm.
Overall STILL FAILS: L-glutamine does not reverse (rank 213, metabolite),
and negative controls (norethindrone, ciprofloxacin) rank top-3 —
connectivity != therapeutic relatedness. v1.1 is post-hoc/exploratory,
not a confirmatory test; reported as such in recovery_test_report.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-23 22:57:30 +02:00

5.0 KiB

Raw Blame History

Data Sources

Fill in version + download date for every source actually used. This file is the artifact that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for all downloads.

Source	URL	Access	License	Use in MVP	Version	Download date
Open Targets	https://platform.opentargets.org	API, bulk Parquet	CC0	Target-disease graph	TBD	TBD
MONDO	http://www.obofoundry.org/ontology/mondo.html	OBO file	CC BY 4.0	Disease ID	TBD	TBD
Orphanet	https://www.orpha.net	Bulk XML	CC BY 4.0	Rare disease metadata	TBD	TBD
OMIM	https://omim.org	Free for academic	License for commercial	Disease genetics	TBD	TBD
GEO (GSE35007)	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35007	GEOparse, FTP	Public domain	Disease signature (study 1)	GPL10558	2026-06-23
GEO (GSE16728)	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16728	GEOparse, FTP	Public domain	Disease signature (study 2)	GPL570	2026-06-23
ChEMBL	https://www.ebi.ac.uk/chembl	chembl_webresource_client	CC BY-SA 3.0	Drug structures, MoA, targets	API (live)	2026-06-23
LINCS L1000 Phase I	GSE92742 (GEO)	GEOparse/FTP + cmapPy	CC0 (GEO)	Drug signatures (incl. L-glutamine)	GSE92742	2026-06-23
LINCS L1000 Phase II	GSE70138 (GEO)	GEOparse/FTP + cmapPy	CC0 (GEO)	Drug signatures (incl. hydroxyurea)	GSE70138	2026-06-23
ClinicalTrials.gov	https://clinicaltrials.gov	API	Public domain	Trial history	TBD	TBD
FDA DailyMed	https://dailymed.nlm.nih.gov	API	Public domain	Approved labels	TBD	TBD
Reactome	https://reactome.org	API, bulk	CC0	Pathway data (Week 3 prior)	TBD	TBD

Chosen GEO datasets (disease signature, Tier A via 2-study concordance)

The signature is the cross-study concordance of two independent whole-blood studies (genes significant at q<0.05 in both with the same direction). Whole-blood tissue was required so concordance is meaningful; the two differ by platform and population, which strengthens robustness.

Study	Platform	Tissue	Disease group	Healthy group	n disease / healthy
GSE35007	Illumina HumanHT-12 V4 (GPL10558)	whole blood	hb phenotype = SS	hb phenotype = AA	190 / 12
GSE16728	Affymetrix HG-U133 Plus 2.0 (GPL570)	whole blood (PAXgene)	sickle-cell patient	control	10 / 10

DE method: per-gene Welch t-test + Benjamini–Hochberg (microarray, pure Python).
Probes collapsed to HGNC symbol (keep max-mean-expression probe) before concordance.
Result: 16,208 genes tested in both → 671 concordant (444 up / 227 down). Signature = top 250 up + all 227 down by worst-case q-value.
Rejected candidates: GSE53441 (PBMC — tissue mismatch with the whole-blood anchor); GSE84633/GSE84634 (PBMC, no healthy controls).
Tier caveat: GSE16728 is exactly 10/group (two PAXgene preps merged), below the strict n>10 rule; Tier A is assigned on cross-study concordance, documented in the signature JSON.

Reproduce with scripts/week1_explore.py (download + DE + concordance) then scripts/week1_finalize.py (mygene mapping + persist).

Drug profiles (Week 2)

300-drug set (drug_set_v1.csv), composed and restricted to LINCS-scorable compounds:

Inclusion reason	n	Notes
ground_truth	2	hydroxyurea (Phase II), L-glutamine = "glutamine" (Phase I)
related_mechanism	32	HbF inducers (decitabine, azacitidine, vorinostat, panobinostat, romidepsin…), NO donors, antioxidants, anti-inflammatories
negative_control	26	antifungals, antihistamines, antibiotics, hormones
general_sample	240	random from LINCS catalog, seed=42

LINCS signatures: per-drug consensus = mean of Level-5 MODZ z-scores across the drug's sig_ids (cell lines/doses/times), restricted to the 978 landmark genes. Drawn from BOTH phases (hydroxyurea is Phase-II-only; L-glutamine is Phase-I-only). All 300 drugs scored.
ChEMBL: matched by InChIKey — 145/300 (curated drugs ~90%, random research compounds 38%, as expected). 43 drugs carry target annotations; 46 carry mechanism-of-action.
Tier: all signature-backed drugs are Tier B (LINCS is a single source → fails Tier A's not-single-source rule).
Gene space (v1.1): scoring uses the full 12,328-gene LINCS space, not just the 978 landmarks. Signature overlap is 406/477 (85%) vs 56/477 (12%) for landmark-only — the larger space is what recovers hydroxyurea (see recovery_test_report.md). HBG1/HBG2 are absent from LINCS entirely and remain unscoreable.
Reproduce: week2_curate_drugset.py → week2_chembl.py → download Level-5 GCTX → week2_lincs_extract.py → week2_assemble.py.

Licensing note for LINCS

Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept) the terms are permissive. For productization this needs legal review.

5.0 KiB Raw Blame History Unescape Escape

Data Sources

Chosen GEO datasets (disease signature, Tier A via 2-study concordance)

Drug profiles (Week 2)

Licensing note for LINCS

5.0 KiB

Raw Blame History