Go to file

Junior B. 51bd90df41 Redocking-RMSD validation fails 3/3: pipeline-quality issue

§12.4 de-biased validation (scripts/dock_validate.py).
Redock each co-crystal ligand into its own structure, RMSD vs crystal:
- voxelotor->Hb: NA (covalent binder, out of scope §12.7)
- mitapivat->PKR: 8.2A (allosteric, cofactors stripped)
- vorinostat->HDAC2 (4LXZ, zinc kept): 7.9A -- a CLASSICAL target that
  should have worked

The clean target also failing => systematic pipeline-quality problem,
not target choice. Cheap Vina + open-babel prep gives scores but doesn't
reproduce known geometry, so affinities aren't trustworthy. Ligand
efficiency over-corrects (ranks tiny hydroxyurea best). Fix needs
production prep (Meeko/AutoDockTools prepare_receptor + reduce) and an
in-place RMSD metric. Consistent with the project theme: the quick
version of every method runs but fails honest validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-24 07:28:47 +02:00

data

Structure-binding track: scaffold + ligand-retrieval baseline

2026-06-23 23:53:27 +02:00

docs

Redocking-RMSD validation fails 3/3: pipeline-quality issue

2026-06-24 07:28:47 +02:00

notebooks

Scaffold Reverso MVP pipeline structure

2026-06-23 20:20:09 +02:00

scripts

Redocking-RMSD validation fails 3/3: pipeline-quality issue

2026-06-24 07:28:47 +02:00

src

Structure-binding track: scaffold + ligand-retrieval baseline

2026-06-23 23:53:27 +02:00

tests

v1.1: full gene space + specificity z-score; hydroxyurea recovers

2026-06-23 22:57:30 +02:00

.gitignore

Docking baseline: toolchain solved, raw affinity is size-biased

2026-06-24 00:03:00 +02:00

PLAN.md

PLAN §12.9: leave door open for generative-guided retrieval

2026-06-23 23:43:25 +02:00

pyproject.toml

Structure-binding track: scaffold + ligand-retrieval baseline

2026-06-23 23:53:27 +02:00

README.md

Scaffold Reverso MVP pipeline structure

2026-06-23 20:20:09 +02:00

README.md

Reverso MVP — Sickle Cell Repurposing Pipeline

A minimum viable drug repurposing pipeline for sickle cell disease: build a disease signature from public transcriptomic data, build drug profiles for ~300 small molecules, and rank them by CMap-style connectivity scoring. Validated by a recovery test — do the two known sickle cell drugs (hydroxyurea, L-glutamine) rank near the top?

See PLAN.md for the full specification, locked decisions, and week-by-week build plan.

Quickstart

# Requires Python >=3.11,<3.13 (see note below)
pip install -e .            # or: pip install -e ".[dev]" for test/lint tooling
pytest                      # run unit tests

Python version note: use Python 3.11–3.13 (python3.13 -m venv .venv). Python 3.14 is not yet supported by all pipeline dependencies (pydeseq2, cmapPy).

Project layout

data/         raw (downloaded, never edited) / processed / results — gitignored
notebooks/    01..05, run end-to-end in order
src/          identifiers, disease, drugs, scoring, provenance
tests/        scoring unit tests
docs/         recovery_test_report.md, data_sources.md, known_limitations.md

The deliverable

When complete, the artifact to share is three files:

docs/recovery_test_report.md — the 2-page write-up
data/results/ranked_candidates_v1.csv — the ranked drug list
The signature + drug profile files with provenance

Pipeline

Notebook	Stage	Output
`01_setup_identifiers.ipynb`	Pin disease/gene IDs	`data/processed/identifiers.json`
`02_disease_signature.ipynb`	GEO + differential expression	`sickle_cell_signature_v1.json`
`03_drug_profiles.ipynb`	ChEMBL + LINCS	`drug_profiles_v1.parquet`
`04_connectivity_scoring.ipynb`	CMap scoring	`ranked_candidates_v1.csv`
`05_recovery_test.ipynb`	Validation	`docs/recovery_test_report.md`

Every persisted artifact carries a confidence tier (A/B/C) and provenance. See PLAN.md §3.

README.md Unescape Escape

Reverso MVP — Sickle Cell Repurposing Pipeline

Quickstart

Project layout

The deliverable

Pipeline

README.md