Files

Junior B. b731478f5d Scaffold Reverso MVP pipeline structure

Set up the project skeleton per PLAN.md §4:
- src/ package: identifiers, disease, drugs, scoring, provenance
  with pydantic schemas and confidence-tier logic (working);
  data-pull/compute functions stubbed per their build week
- 5 starter notebooks (01-05) with PLAN-referenced steps
- tests/test_scoring.py: tier-assignment tests pass; scoring
  reference test xfail until Week 3
- docs/: recovery_test_report, data_sources, known_limitations skeletons
- pyproject.toml (requires-python >=3.11,<3.14), .gitignore, README
- data/ tree preserved via .gitkeep; raw/processed/results gitignored

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-23 20:20:09 +02:00

2.4 KiB

Raw Blame History

Sickle Cell Repurposing — Recovery Test Report

Status: DRAFT SCAFFOLD — not yet run. Filled in during Week 4 from notebooks/05_recovery_test.ipynb. Target length: ~2 pages, readable by a sceptical pharma scientist in 5 minutes.

Pre-registered success criteria

⚠️ Commit this section to git before running the recovery test (PLAN.md §8, §10).

The MVP passes if:

Hydroxyurea ranks in the top 10% (top 30 of 300), AND
L-glutamine ranks in the top 25% (top 75) OR is documented as unscorable due to a missing LINCS signature, AND
At least 4 of 5 negative-control drugs rank in the bottom half.

Pre-registered on: TBD (date of commit)

Section 1 — Methodology

5–6 sentences: what was built, the GEO dataset used, the drug-set composition, and the scoring method (CMap connectivity, Lamb 2006 / Subramanian 2017).

Section 2 — Recovery test result

Drug	Rank	Percentile	Pass?
Hydroxyurea	TBD	TBD	TBD
L-glutamine	TBD	TBD	TBD

Negative controls (expected: bottom half):

Control drug	Rank	Bottom half?
TBD	TBD	TBD

Overall: PASS / FAIL against pre-registered criteria — TBD

Section 3 — Top 10 candidates

Rank	Drug	Score	Known mechanism	Biological plausibility
1	TBD	TBD	TBD	TBD

Note: HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due to widespread expression effects — flag these honestly (PLAN.md §9.4).

Section 4 — One non-obvious candidate worth investigating

A single paragraph on the most interesting result. Language must be careful: this is a computational hypothesis to test, not a discovery (PLAN.md §9.7).

Section 5 — Honest limitations

Cell-composition confound in whole-blood expression (PLAN.md §9.1)
LINCS L1000 cell-line limitations — landmark genes measured mostly in cancer lines (§9.2)
Missing signatures (e.g. L-glutamine) (§9.3)
No mechanistic validation layer — discovery hypothesis generation, not validated prediction (§9.6)

Section 6 — What v2 would fix

Cell-type deconvolution of the disease signature
Knowledge graph fallback for missing-signature drugs
A second disease to test generalization (the real test — sickle cell results do not prove the platform generalizes, §9.5)

2.4 KiB Raw Blame History Unescape Escape