Scaffold Reverso MVP pipeline structure
Set up the project skeleton per PLAN.md §4: - src/ package: identifiers, disease, drugs, scoring, provenance with pydantic schemas and confidence-tier logic (working); data-pull/compute functions stubbed per their build week - 5 starter notebooks (01-05) with PLAN-referenced steps - tests/test_scoring.py: tier-assignment tests pass; scoring reference test xfail until Week 3 - docs/: recovery_test_report, data_sources, known_limitations skeletons - pyproject.toml (requires-python >=3.11,<3.14), .gitignore, README - data/ tree preserved via .gitkeep; raw/processed/results gitignored Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
28
docs/data_sources.md
Normal file
28
docs/data_sources.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Data Sources
|
||||
|
||||
> Fill in version + download date for every source actually used. This file is the artifact
|
||||
> that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for **all**
|
||||
> downloads.
|
||||
|
||||
| Source | URL | Access | License | Use in MVP | Version | Download date |
|
||||
|---|---|---|---|---|---|---|
|
||||
| Open Targets | https://platform.opentargets.org | API, bulk Parquet | CC0 | Target-disease graph | TBD | TBD |
|
||||
| MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD |
|
||||
| Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD |
|
||||
| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
|
||||
| GEO | https://www.ncbi.nlm.nih.gov/geo/ | GEOparse, FTP | Public domain | Expression data | TBD | TBD |
|
||||
| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets | TBD | TBD |
|
||||
| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures | TBD | TBD |
|
||||
| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
|
||||
| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
|
||||
| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |
|
||||
|
||||
## Chosen GEO dataset
|
||||
|
||||
_Document the chosen study fully: accession, platform, n per group, publication, why it was
|
||||
selected over the alternatives (GSE53441, GSE35007, …)._
|
||||
|
||||
## Licensing note for LINCS
|
||||
|
||||
Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept)
|
||||
the terms are permissive. For productization this needs legal review.
|
||||
39
docs/known_limitations.md
Normal file
39
docs/known_limitations.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# Known Limitations
|
||||
|
||||
The honest list of what would break this MVP at scale or in a different disease. Useful for the
|
||||
next pharma conversation: "yes, we know these are limitations, here's how v2 addresses them."
|
||||
Source: PLAN.md §9.
|
||||
|
||||
1. **Cell-composition confound in sickle cell expression data.** Whole-blood differential
|
||||
expression partly reflects different blood cell ratios, not disease biology. v1 acknowledges
|
||||
this; v2 should deconvolve cell types.
|
||||
|
||||
2. **LINCS L1000 cell-line limitations.** The 978 landmark genes were measured mostly in cancer
|
||||
cell lines (MCF7, A375, PC3, …). Signatures for non-oncology diseases may be noisy. A
|
||||
field-wide limitation, not unique to Reverso.
|
||||
|
||||
3. **L-glutamine probably has no LINCS signature.** Amino acids and metabolites weren't LINCS
|
||||
priorities. If true, the ground-truth test effectively rests on hydroxyurea alone, which is
|
||||
weaker. _Status: TBD — record the actual finding here once LINCS is pulled (Week 2)._
|
||||
|
||||
4. **Connectivity scoring surfaces broad-effect drugs as false positives.** HDAC inhibitors and
|
||||
broad kinase inhibitors often top connectivity rankings simply because they perturb many
|
||||
genes. The mechanistic prior (Week 3) helps filter, but does not eliminate this.
|
||||
|
||||
5. **Hydroxyurea will probably pass the recovery test by construction.** Sickle cell +
|
||||
hydroxyurea is a well-studied pair. Passing is necessary but not sufficient to claim the
|
||||
platform generalizes. The next disease is the real test — do not sell sickle cell results as
|
||||
proving the platform.
|
||||
|
||||
6. **No mechanistic validation layer.** Pure ML matching is not sufficient for extrapolation
|
||||
(flagged by multiple experts). The MVP knowingly omits the mechanistic layer; it is a phase-2
|
||||
addition. Position the MVP as "discovery hypothesis generation," not "validated prediction."
|
||||
|
||||
7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
|
||||
to test, not discoveries. Use careful language in any write-up.
|
||||
|
||||
## Drug-specific gaps (fill in during Week 2–3)
|
||||
|
||||
| Drug | Issue | Handling |
|
||||
|---|---|---|
|
||||
| TBD | e.g. no LINCS signature | flagged "not scored, no signature available" |
|
||||
68
docs/recovery_test_report.md
Normal file
68
docs/recovery_test_report.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# Sickle Cell Repurposing — Recovery Test Report
|
||||
|
||||
> **Status: DRAFT SCAFFOLD — not yet run.** Filled in during Week 4 from
|
||||
> `notebooks/05_recovery_test.ipynb`. Target length: ~2 pages, readable by a sceptical
|
||||
> pharma scientist in 5 minutes.
|
||||
|
||||
## Pre-registered success criteria
|
||||
|
||||
> ⚠️ **Commit this section to git _before_ running the recovery test** (PLAN.md §8, §10).
|
||||
|
||||
The MVP passes if:
|
||||
|
||||
- Hydroxyurea ranks in the **top 10%** (top 30 of 300), **AND**
|
||||
- L-glutamine ranks in the **top 25%** (top 75) **OR** is documented as unscorable due to a
|
||||
missing LINCS signature, **AND**
|
||||
- At least **4 of 5** negative-control drugs rank in the **bottom half**.
|
||||
|
||||
_Pre-registered on: TBD (date of commit)_
|
||||
|
||||
---
|
||||
|
||||
## Section 1 — Methodology
|
||||
|
||||
_5–6 sentences: what was built, the GEO dataset used, the drug-set composition, and the
|
||||
scoring method (CMap connectivity, Lamb 2006 / Subramanian 2017)._
|
||||
|
||||
## Section 2 — Recovery test result
|
||||
|
||||
| Drug | Rank | Percentile | Pass? |
|
||||
|---|---|---|---|
|
||||
| Hydroxyurea | TBD | TBD | TBD |
|
||||
| L-glutamine | TBD | TBD | TBD |
|
||||
|
||||
Negative controls (expected: bottom half):
|
||||
|
||||
| Control drug | Rank | Bottom half? |
|
||||
|---|---|---|
|
||||
| TBD | TBD | TBD |
|
||||
|
||||
**Overall: PASS / FAIL against pre-registered criteria — TBD**
|
||||
|
||||
## Section 3 — Top 10 candidates
|
||||
|
||||
| Rank | Drug | Score | Known mechanism | Biological plausibility |
|
||||
|---|---|---|---|---|
|
||||
| 1 | TBD | TBD | TBD | TBD |
|
||||
|
||||
_Note: HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due
|
||||
to widespread expression effects — flag these honestly (PLAN.md §9.4)._
|
||||
|
||||
## Section 4 — One non-obvious candidate worth investigating
|
||||
|
||||
_A single paragraph on the most interesting result. Language must be careful: this is a
|
||||
computational hypothesis to test, not a discovery (PLAN.md §9.7)._
|
||||
|
||||
## Section 5 — Honest limitations
|
||||
|
||||
- Cell-composition confound in whole-blood expression (PLAN.md §9.1)
|
||||
- LINCS L1000 cell-line limitations — landmark genes measured mostly in cancer lines (§9.2)
|
||||
- Missing signatures (e.g. L-glutamine) (§9.3)
|
||||
- No mechanistic validation layer — discovery hypothesis generation, not validated prediction (§9.6)
|
||||
|
||||
## Section 6 — What v2 would fix
|
||||
|
||||
- Cell-type deconvolution of the disease signature
|
||||
- Knowledge graph fallback for missing-signature drugs
|
||||
- A second disease to test generalization (the real test — sickle cell results do not prove
|
||||
the platform generalizes, §9.5)
|
||||
Reference in New Issue
Block a user