Files
Reverso/docs/data_sources.md
Junior B. b731478f5d Scaffold Reverso MVP pipeline structure
Set up the project skeleton per PLAN.md §4:
- src/ package: identifiers, disease, drugs, scoring, provenance
  with pydantic schemas and confidence-tier logic (working);
  data-pull/compute functions stubbed per their build week
- 5 starter notebooks (01-05) with PLAN-referenced steps
- tests/test_scoring.py: tier-assignment tests pass; scoring
  reference test xfail until Week 3
- docs/: recovery_test_report, data_sources, known_limitations skeletons
- pyproject.toml (requires-python >=3.11,<3.14), .gitignore, README
- data/ tree preserved via .gitkeep; raw/processed/results gitignored

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 20:20:09 +02:00

1.7 KiB

Data Sources

Fill in version + download date for every source actually used. This file is the artifact that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for all downloads.

Source URL Access License Use in MVP Version Download date
Open Targets https://platform.opentargets.org API, bulk Parquet CC0 Target-disease graph TBD TBD
MONDO http://www.obofoundry.org/ontology/mondo.html OBO file CC BY 4.0 Disease ID TBD TBD
Orphanet https://www.orpha.net Bulk XML CC BY 4.0 Rare disease metadata TBD TBD
OMIM https://omim.org Free for academic License for commercial Disease genetics TBD TBD
GEO https://www.ncbi.nlm.nih.gov/geo/ GEOparse, FTP Public domain Expression data TBD TBD
ChEMBL https://www.ebi.ac.uk/chembl Python client, bulk SQLite CC BY-SA 3.0 Drug structures, targets TBD TBD
LINCS L1000 https://clue.io/data Bulk download Restricted academic free Drug expression signatures TBD TBD
ClinicalTrials.gov https://clinicaltrials.gov API Public domain Trial history TBD TBD
FDA DailyMed https://dailymed.nlm.nih.gov API Public domain Approved labels TBD TBD
Reactome https://reactome.org API, bulk CC0 Pathway data (Week 3 prior) TBD TBD

Chosen GEO dataset

Document the chosen study fully: accession, platform, n per group, publication, why it was selected over the alternatives (GSE53441, GSE35007, …).

Licensing note for LINCS

Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept) the terms are permissive. For productization this needs legal review.