Scaffold Reverso MVP pipeline structure

Set up the project skeleton per PLAN.md §4: - src/ package: identifiers, disease, drugs, scoring, provenance with pydantic schemas and confidence-tier logic (working); data-pull/compute functions stubbed per their build week - 5 starter notebooks (01-05) with PLAN-referenced steps - tests/test_scoring.py: tier-assignment tests pass; scoring reference test xfail until Week 3 - docs/: recovery_test_report, data_sources, known_limitations skeletons - pyproject.toml (requires-python >=3.11,<3.14), .gitignore, README - data/ tree preserved via .gitkeep; raw/processed/results gitignored Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 20:19:38 +02:00
parent e717cf40ed
commit b731478f5d
25 changed files with 1038 additions and 4 deletions
--- a/docs/recovery_test_report.md
+++ b/docs/recovery_test_report.md
@@ -0,0 +1,68 @@
+# Sickle Cell Repurposing — Recovery Test Report
+
+> **Status: DRAFT SCAFFOLD — not yet run.** Filled in during Week 4 from
+> `notebooks/05_recovery_test.ipynb`. Target length: ~2 pages, readable by a sceptical
+> pharma scientist in 5 minutes.
+
+## Pre-registered success criteria
+
+> ⚠️ **Commit this section to git _before_ running the recovery test** (PLAN.md §8, §10).
+
+The MVP passes if:
+
+- Hydroxyurea ranks in the **top 10%** (top 30 of 300), **AND**
+- L-glutamine ranks in the **top 25%** (top 75) **OR** is documented as unscorable due to a
+  missing LINCS signature, **AND**
+- At least **4 of 5** negative-control drugs rank in the **bottom half**.
+
+_Pre-registered on: TBD (date of commit)_
+
+---
+
+## Section 1 — Methodology
+
+_5–6 sentences: what was built, the GEO dataset used, the drug-set composition, and the
+scoring method (CMap connectivity, Lamb 2006 / Subramanian 2017)._
+
+## Section 2 — Recovery test result
+
+| Drug | Rank | Percentile | Pass? |
+|---|---|---|---|
+| Hydroxyurea | TBD | TBD | TBD |
+| L-glutamine | TBD | TBD | TBD |
+
+Negative controls (expected: bottom half):
+
+| Control drug | Rank | Bottom half? |
+|---|---|---|
+| TBD | TBD | TBD |
+
+**Overall: PASS / FAIL against pre-registered criteria — TBD**
+
+## Section 3 — Top 10 candidates
+
+| Rank | Drug | Score | Known mechanism | Biological plausibility |
+|---|---|---|---|---|
+| 1 | TBD | TBD | TBD | TBD |
+
+_Note: HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due
+to widespread expression effects — flag these honestly (PLAN.md §9.4)._
+
+## Section 4 — One non-obvious candidate worth investigating
+
+_A single paragraph on the most interesting result. Language must be careful: this is a
+computational hypothesis to test, not a discovery (PLAN.md §9.7)._
+
+## Section 5 — Honest limitations
+
+- Cell-composition confound in whole-blood expression (PLAN.md §9.1)
+- LINCS L1000 cell-line limitations — landmark genes measured mostly in cancer lines (§9.2)
+- Missing signatures (e.g. L-glutamine) (§9.3)
+- No mechanistic validation layer — discovery hypothesis generation, not validated prediction (§9.6)
+
+## Section 6 — What v2 would fix
+
+- Cell-type deconvolution of the disease signature
+- Knowledge graph fallback for missing-signature drugs
+- A second disease to test generalization (the real test — sickle cell results do not prove
+  the platform generalizes, §9.5)