Files

Junior B. b731478f5d Scaffold Reverso MVP pipeline structure

Set up the project skeleton per PLAN.md §4:
- src/ package: identifiers, disease, drugs, scoring, provenance
  with pydantic schemas and confidence-tier logic (working);
  data-pull/compute functions stubbed per their build week
- 5 starter notebooks (01-05) with PLAN-referenced steps
- tests/test_scoring.py: tier-assignment tests pass; scoring
  reference test xfail until Week 3
- docs/: recovery_test_report, data_sources, known_limitations skeletons
- pyproject.toml (requires-python >=3.11,<3.14), .gitignore, README
- data/ tree preserved via .gitkeep; raw/processed/results gitignored

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-23 20:20:09 +02:00

2.1 KiB

Raw Blame History

Known Limitations

The honest list of what would break this MVP at scale or in a different disease. Useful for the next pharma conversation: "yes, we know these are limitations, here's how v2 addresses them." Source: PLAN.md §9.

Cell-composition confound in sickle cell expression data. Whole-blood differential expression partly reflects different blood cell ratios, not disease biology. v1 acknowledges this; v2 should deconvolve cell types.
LINCS L1000 cell-line limitations. The 978 landmark genes were measured mostly in cancer cell lines (MCF7, A375, PC3, …). Signatures for non-oncology diseases may be noisy. A field-wide limitation, not unique to Reverso.
L-glutamine probably has no LINCS signature. Amino acids and metabolites weren't LINCS priorities. If true, the ground-truth test effectively rests on hydroxyurea alone, which is weaker. Status: TBD — record the actual finding here once LINCS is pulled (Week 2).
Connectivity scoring surfaces broad-effect drugs as false positives. HDAC inhibitors and broad kinase inhibitors often top connectivity rankings simply because they perturb many genes. The mechanistic prior (Week 3) helps filter, but does not eliminate this.
Hydroxyurea will probably pass the recovery test by construction. Sickle cell + hydroxyurea is a well-studied pair. Passing is necessary but not sufficient to claim the platform generalizes. The next disease is the real test — do not sell sickle cell results as proving the platform.
No mechanistic validation layer. Pure ML matching is not sufficient for extrapolation (flagged by multiple experts). The MVP knowingly omits the mechanistic layer; it is a phase-2 addition. Position the MVP as "discovery hypothesis generation," not "validated prediction."
Top-ranked novel candidates are not wet-lab validated. They are computational hypotheses to test, not discoveries. Use careful language in any write-up.

Drug-specific gaps (fill in during Week 2–3)

Drug	Issue	Handling
TBD	e.g. no LINCS signature	flagged "not scored, no signature available"

2.1 KiB Raw Blame History Unescape Escape

Known Limitations

Drug-specific gaps (fill in during Week 2–3)

2.1 KiB

Raw Blame History