Scaffold Reverso MVP pipeline structure

Set up the project skeleton per PLAN.md §4: - src/ package: identifiers, disease, drugs, scoring, provenance with pydantic schemas and confidence-tier logic (working); data-pull/compute functions stubbed per their build week - 5 starter notebooks (01-05) with PLAN-referenced steps - tests/test_scoring.py: tier-assignment tests pass; scoring reference test xfail until Week 3 - docs/: recovery_test_report, data_sources, known_limitations skeletons - pyproject.toml (requires-python >=3.11,<3.14), .gitignore, README - data/ tree preserved via .gitkeep; raw/processed/results gitignored Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 20:19:38 +02:00
parent e717cf40ed
commit b731478f5d
25 changed files with 1038 additions and 4 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,33 @@
+# Data — never commit raw or processed data; keep directory structure via .gitkeep.
+# Re-include directories first (!data/**/), else .gitkeep inside an excluded dir
+# cannot be un-ignored.
+data/raw/**
+data/processed/**
+data/results/**
+!data/**/
+!data/**/.gitkeep
+
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+.eggs/
+build/
+dist/
+.venv/
+venv/
+env/
+
+# Jupyter
+.ipynb_checkpoints/
+*/.ipynb_checkpoints/
+
+# Tooling caches
+.pytest_cache/
+.ruff_cache/
+.mypy_cache/
+
+# OS / editor
+.DS_Store
+.idea/
+.vscode/
--- a/PLAN.md
+++ b/PLAN.md
@@ -1,4 +1,4 @@
-# QPharma MVP — Sickle Cell Repurposing Pipeline
+# Reverso MVP — Sickle Cell Repurposing Pipeline

 > **For Claude Code:** This is the project specification. Read this entire document before suggesting actions or writing code. The decisions in section "Locked decisions" have already been made by the founder after extensive expert consultation; do not re-litigate them. Where the plan calls for a choice, propose options but default to the spec.

@@ -121,7 +121,7 @@ This is the most commercially important design decision in the whole pipeline. S
 ## 4. Directory structure

 ```
-qpharma-mvp/
+reverso-mvp/
 ├── PLAN.md                          # This file
 ├── README.md                        # Short project description
 ├── pyproject.toml                   # Dependencies (or requirements.txt)
@@ -375,7 +375,7 @@ These are real risks documented during planning. They are not paranoia.

 1. **Cell-composition confound in sickle cell expression data.** Whole-blood differential expression in sickle cell partly reflects different blood cell ratios, not disease biology. v1 acknowledges this; v2 should deconvolve.

-2. **LINCS L1000 cell-line limitations.** The 978 landmark genes were measured mostly in cancer cell lines (MCF7, A375, PC3, etc.). Signatures for non-oncology diseases may be noisy. This is a known field-wide limitation, not unique to QPharma.
+2. **LINCS L1000 cell-line limitations.** The 978 landmark genes were measured mostly in cancer cell lines (MCF7, A375, PC3, etc.). Signatures for non-oncology diseases may be noisy. This is a known field-wide limitation, not unique to Reverso.

 3. **L-glutamine probably has no LINCS signature.** Amino acids and metabolites weren't LINCS priorities. If true, the ground-truth test only has hydroxyurea, which is weaker. Document honestly.

--- a/README.md
+++ b/README.md
@@ -1 +1,48 @@
-# Reverso
+# Reverso MVP — Sickle Cell Repurposing Pipeline
+
+A minimum viable drug repurposing pipeline for **sickle cell disease**: build a disease
+signature from public transcriptomic data, build drug profiles for ~300 small molecules,
+and rank them by CMap-style connectivity scoring. Validated by a recovery test — do the two
+known sickle cell drugs (hydroxyurea, L-glutamine) rank near the top?
+
+See [`PLAN.md`](PLAN.md) for the full specification, locked decisions, and week-by-week build plan.
+
+## Quickstart
+
+```bash
+# Requires Python >=3.11,<3.13 (see note below)
+pip install -e .            # or: pip install -e ".[dev]" for test/lint tooling
+pytest                      # run unit tests
+```
+
+> **Python version note:** use Python 3.11–3.13 (`python3.13 -m venv .venv`). Python 3.14 is
+> not yet supported by all pipeline dependencies (`pydeseq2`, `cmapPy`).
+
+## Project layout
+
+```
+data/         raw (downloaded, never edited) / processed / results — gitignored
+notebooks/    01..05, run end-to-end in order
+src/          identifiers, disease, drugs, scoring, provenance
+tests/        scoring unit tests
+docs/         recovery_test_report.md, data_sources.md, known_limitations.md
+```
+
+## The deliverable
+
+When complete, the artifact to share is three files:
+1. `docs/recovery_test_report.md` — the 2-page write-up
+2. `data/results/ranked_candidates_v1.csv` — the ranked drug list
+3. The signature + drug profile files with provenance
+
+## Pipeline
+
+| Notebook | Stage | Output |
+|---|---|---|
+| `01_setup_identifiers.ipynb` | Pin disease/gene IDs | `data/processed/identifiers.json` |
+| `02_disease_signature.ipynb` | GEO + differential expression | `sickle_cell_signature_v1.json` |
+| `03_drug_profiles.ipynb` | ChEMBL + LINCS | `drug_profiles_v1.parquet` |
+| `04_connectivity_scoring.ipynb` | CMap scoring | `ranked_candidates_v1.csv` |
+| `05_recovery_test.ipynb` | Validation | `docs/recovery_test_report.md` |
+
+Every persisted artifact carries a **confidence tier** (A/B/C) and provenance. See `PLAN.md` §3.
--- a/data/processed/.gitkeep
+++ b/data/processed/.gitkeep
--- a/data/raw/chembl/.gitkeep
+++ b/data/raw/chembl/.gitkeep
--- a/data/raw/geo/.gitkeep
+++ b/data/raw/geo/.gitkeep
--- a/data/raw/lincs/.gitkeep
+++ b/data/raw/lincs/.gitkeep
--- a/data/raw/open_targets/.gitkeep
+++ b/data/raw/open_targets/.gitkeep
--- a/data/results/.gitkeep
+++ b/data/results/.gitkeep
--- a/docs/data_sources.md
+++ b/docs/data_sources.md
@@ -0,0 +1,28 @@
+# Data Sources
+
+> Fill in version + download date for every source actually used. This file is the artifact
+> that proves reproducibility (PLAN.md §6, Week 4 task 4). Record date and version for **all**
+> downloads.
+
+| Source | URL | Access | License | Use in MVP | Version | Download date |
+|---|---|---|---|---|---|---|
+| Open Targets | https://platform.opentargets.org | API, bulk Parquet | CC0 | Target-disease graph | TBD | TBD |
+| MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID | TBD | TBD |
+| Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata | TBD | TBD |
+| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics | TBD | TBD |
+| GEO | https://www.ncbi.nlm.nih.gov/geo/ | GEOparse, FTP | Public domain | Expression data | TBD | TBD |
+| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets | TBD | TBD |
+| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures | TBD | TBD |
+| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history | TBD | TBD |
+| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels | TBD | TBD |
+| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) | TBD | TBD |
+
+## Chosen GEO dataset
+
+_Document the chosen study fully: accession, platform, n per group, publication, why it was
+selected over the alternatives (GSE53441, GSE35007, …)._
+
+## Licensing note for LINCS
+
+Read the LINCS data use terms before commercial use. For the MVP (research / proof-of-concept)
+the terms are permissive. For productization this needs legal review.
--- a/docs/known_limitations.md
+++ b/docs/known_limitations.md
@@ -0,0 +1,39 @@
+# Known Limitations
+
+The honest list of what would break this MVP at scale or in a different disease. Useful for the
+next pharma conversation: "yes, we know these are limitations, here's how v2 addresses them."
+Source: PLAN.md §9.
+
+1. **Cell-composition confound in sickle cell expression data.** Whole-blood differential
+   expression partly reflects different blood cell ratios, not disease biology. v1 acknowledges
+   this; v2 should deconvolve cell types.
+
+2. **LINCS L1000 cell-line limitations.** The 978 landmark genes were measured mostly in cancer
+   cell lines (MCF7, A375, PC3, …). Signatures for non-oncology diseases may be noisy. A
+   field-wide limitation, not unique to Reverso.
+
+3. **L-glutamine probably has no LINCS signature.** Amino acids and metabolites weren't LINCS
+   priorities. If true, the ground-truth test effectively rests on hydroxyurea alone, which is
+   weaker. _Status: TBD — record the actual finding here once LINCS is pulled (Week 2)._
+
+4. **Connectivity scoring surfaces broad-effect drugs as false positives.** HDAC inhibitors and
+   broad kinase inhibitors often top connectivity rankings simply because they perturb many
+   genes. The mechanistic prior (Week 3) helps filter, but does not eliminate this.
+
+5. **Hydroxyurea will probably pass the recovery test by construction.** Sickle cell +
+   hydroxyurea is a well-studied pair. Passing is necessary but not sufficient to claim the
+   platform generalizes. The next disease is the real test — do not sell sickle cell results as
+   proving the platform.
+
+6. **No mechanistic validation layer.** Pure ML matching is not sufficient for extrapolation
+   (flagged by multiple experts). The MVP knowingly omits the mechanistic layer; it is a phase-2
+   addition. Position the MVP as "discovery hypothesis generation," not "validated prediction."
+
+7. **Top-ranked novel candidates are not wet-lab validated.** They are computational hypotheses
+   to test, not discoveries. Use careful language in any write-up.
+
+## Drug-specific gaps (fill in during Week 2–3)
+
+| Drug | Issue | Handling |
+|---|---|---|
+| TBD | e.g. no LINCS signature | flagged "not scored, no signature available" |
--- a/docs/recovery_test_report.md
+++ b/docs/recovery_test_report.md
@@ -0,0 +1,68 @@
+# Sickle Cell Repurposing — Recovery Test Report
+
+> **Status: DRAFT SCAFFOLD — not yet run.** Filled in during Week 4 from
+> `notebooks/05_recovery_test.ipynb`. Target length: ~2 pages, readable by a sceptical
+> pharma scientist in 5 minutes.
+
+## Pre-registered success criteria
+
+> ⚠️ **Commit this section to git _before_ running the recovery test** (PLAN.md §8, §10).
+
+The MVP passes if:
+
+- Hydroxyurea ranks in the **top 10%** (top 30 of 300), **AND**
+- L-glutamine ranks in the **top 25%** (top 75) **OR** is documented as unscorable due to a
+  missing LINCS signature, **AND**
+- At least **4 of 5** negative-control drugs rank in the **bottom half**.
+
+_Pre-registered on: TBD (date of commit)_
+
+---
+
+## Section 1 — Methodology
+
+_5–6 sentences: what was built, the GEO dataset used, the drug-set composition, and the
+scoring method (CMap connectivity, Lamb 2006 / Subramanian 2017)._
+
+## Section 2 — Recovery test result
+
+| Drug | Rank | Percentile | Pass? |
+|---|---|---|---|
+| Hydroxyurea | TBD | TBD | TBD |
+| L-glutamine | TBD | TBD | TBD |
+
+Negative controls (expected: bottom half):
+
+| Control drug | Rank | Bottom half? |
+|---|---|---|
+| TBD | TBD | TBD |
+
+**Overall: PASS / FAIL against pre-registered criteria — TBD**
+
+## Section 3 — Top 10 candidates
+
+| Rank | Drug | Score | Known mechanism | Biological plausibility |
+|---|---|---|---|---|
+| 1 | TBD | TBD | TBD | TBD |
+
+_Note: HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due
+to widespread expression effects — flag these honestly (PLAN.md §9.4)._
+
+## Section 4 — One non-obvious candidate worth investigating
+
+_A single paragraph on the most interesting result. Language must be careful: this is a
+computational hypothesis to test, not a discovery (PLAN.md §9.7)._
+
+## Section 5 — Honest limitations
+
+- Cell-composition confound in whole-blood expression (PLAN.md §9.1)
+- LINCS L1000 cell-line limitations — landmark genes measured mostly in cancer lines (§9.2)
+- Missing signatures (e.g. L-glutamine) (§9.3)
+- No mechanistic validation layer — discovery hypothesis generation, not validated prediction (§9.6)
+
+## Section 6 — What v2 would fix
+
+- Cell-type deconvolution of the disease signature
+- Knowledge graph fallback for missing-signature drugs
+- A second disease to test generalization (the real test — sickle cell results do not prove
+  the platform generalizes, §9.5)
--- a/notebooks/01_setup_identifiers.ipynb
+++ b/notebooks/01_setup_identifiers.ipynb
@@ -0,0 +1,57 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 01 \u2014 Setup identifiers\n",
+    "\n",
+    "Week 1, task 1 (PLAN.md \u00a76). Pin the disease/gene/ground-truth identifiers and persist them to `data/processed/identifiers.json`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')  # import the src package from notebooks/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from src.identifiers import build_identifier_set, persist_identifiers\n",
+    "\n",
+    "ids = build_identifier_set()\n",
+    "ids.model_dump()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "path = persist_identifiers()\n",
+    "print(f'wrote {path}')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/notebooks/02_disease_signature.ipynb
+++ b/notebooks/02_disease_signature.ipynb
@@ -0,0 +1,50 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 02 \u2014 Disease signature\n",
+    "\n",
+    "Week 1 (PLAN.md \u00a76). Pull Open Targets + a GEO expression study, run differential expression, and build `sickle_cell_signature_v1.json` (Tier A) with full provenance.\n\nSteps: (1) Open Targets associations, (2) choose + download GEO dataset, (3) differential expression, (4) build + persist signature.\n\n**Pitfall to document:** whole-blood expression is partly driven by cell-composition differences, not disease state (PLAN.md \u00a79.1)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')  # import the src package from notebooks/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from src import disease\n",
+    "from src.provenance import ConfidenceTier\n",
+    "\n",
+    "# Step 1: Open Targets associations for MONDO:0011382  -> data/raw/open_targets/\n",
+    "# Step 2: choose + download GEO study (GSE53441 / GSE35007 / newer) -> data/raw/geo/\n",
+    "# Step 3: disease.compute_differential_expression(...)\n",
+    "# Step 4: disease.build_signature(...) then disease.persist_signature(...)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/notebooks/03_drug_profiles.ipynb
+++ b/notebooks/03_drug_profiles.ipynb
@@ -0,0 +1,50 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 03 \u2014 Drug profiles\n",
+    "\n",
+    "Week 2 (PLAN.md \u00a76). Curate the ~300-drug set, pull ChEMBL + LINCS L1000 data, and assemble `drug_profiles_v1.parquet`.\n\nDrug set: 2 ground-truth + ~50 related-mechanism + ~50 negative controls + ~200 random (fixed seed). Document any missing LINCS signatures in `docs/known_limitations.md`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')  # import the src package from notebooks/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from src import drugs\n",
+    "from src import RANDOM_SEED\n",
+    "\n",
+    "# Step 1: drugs.curate_drug_set(seed=RANDOM_SEED) -> data/processed/drug_set_v1.csv\n",
+    "# Step 2: drugs.fetch_chembl_profile(...) for each drug -> data/raw/chembl/\n",
+    "# Step 3: drugs.fetch_lincs_signature(...) -> data/raw/lincs/\n",
+    "# Step 4: drugs.persist_drug_profiles(...) -> drug_profiles_v1.parquet"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/notebooks/04_connectivity_scoring.ipynb
+++ b/notebooks/04_connectivity_scoring.ipynb
@@ -0,0 +1,48 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 04 \u2014 Connectivity scoring\n",
+    "\n",
+    "Week 3 (PLAN.md \u00a76). CMap-style connectivity scoring of every drug against the sickle cell signature. Strongly negative connectivity = strong reversal = candidate.\n\nOutputs `data/results/ranked_candidates_v1.csv`. Also build the secondary mechanistically-weighted ranking. Document the gene-overlap count; mark signature-less drugs as 'not scored' rather than dropping them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')  # import the src package from notebooks/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from src import scoring\n",
+    "\n",
+    "# Load signature + drug profiles, then:\n",
+    "# ranking = scoring.rank_drugs(up, down, drug_profiles)\n",
+    "# scoring.persist_ranking(ranking) -> data/results/ranked_candidates_v1.csv"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/notebooks/05_recovery_test.ipynb
+++ b/notebooks/05_recovery_test.ipynb
@@ -0,0 +1,48 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 05 \u2014 Recovery test\n",
+    "\n",
+    "Week 4 (PLAN.md \u00a76). **Commit the pre-registered success criteria to git BEFORE running this notebook** (PLAN.md \u00a78, \u00a710).\n\nPull hydroxyurea + L-glutamine ranks and 5 negative-control ranks, compute pass/fail, examine the top 10, and fill in `docs/recovery_test_report.md`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')  # import the src package from notebooks/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "from src import RESULTS_DIR\n",
+    "\n",
+    "ranking = pd.read_csv(RESULTS_DIR / 'ranked_candidates_v1.csv')\n",
+    "# Pull ground-truth + negative-control ranks; evaluate pre-registered criteria."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,46 @@
+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "reverso-mvp"
+version = "0.1.0"
+description = "Sickle cell drug repurposing MVP — disease signature + drug profile matching via CMap-style connectivity scoring"
+readme = "README.md"
+requires-python = ">=3.11,<3.14"
+license = { text = "Proprietary" }
+authors = [{ name = "Reverso" }]
+
+dependencies = [
+    "pandas>=2.0",
+    "numpy>=1.24",
+    "scipy>=1.11",
+    "requests>=2.31",
+    "chembl_webresource_client>=0.10",   # ChEMBL API client
+    "GEOparse>=2.0",                      # GEO dataset access
+    "pydeseq2>=0.4",                      # Differential expression in Python
+    "cmapPy>=4.0",                        # Reference CMap connectivity implementation
+    "pyarrow>=14.0",                      # Parquet I/O
+    "jupyter>=1.0",
+    "matplotlib>=3.7",                    # Sanity-check plots
+    "seaborn>=0.13",
+    "pydantic>=2.0",                      # Schema validation for signatures/profiles
+    "mygene>=3.2",                        # Gene symbol -> Entrez/Ensembl mapping
+]
+
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0",
+    "ruff>=0.5",
+]
+
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["src*"]
+
+[tool.ruff]
+line-length = 100
+target-version = "py311"
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
--- a/src/init.py
+++ b/src/init.py
@@ -0,0 +1,36 @@
+"""Reverso MVP — sickle cell drug repurposing pipeline.
+
+A disease-signature + drug-profile matching pipeline using CMap-style connectivity
+scoring. See PLAN.md for the full specification.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+__version__ = "0.1.0"
+PIPELINE_VERSION = "v1"
+
+# Single source of truth for reproducibility (PLAN.md §8).
+# All randomness in the pipeline must derive from this seed.
+RANDOM_SEED = 42
+
+# Canonical project paths, resolved relative to the repo root.
+REPO_ROOT = Path(__file__).resolve().parent.parent
+DATA_DIR = REPO_ROOT / "data"
+RAW_DIR = DATA_DIR / "raw"
+PROCESSED_DIR = DATA_DIR / "processed"
+RESULTS_DIR = DATA_DIR / "results"
+DOCS_DIR = REPO_ROOT / "docs"
+
+__all__ = [
+    "__version__",
+    "PIPELINE_VERSION",
+    "RANDOM_SEED",
+    "REPO_ROOT",
+    "DATA_DIR",
+    "RAW_DIR",
+    "PROCESSED_DIR",
+    "RESULTS_DIR",
+    "DOCS_DIR",
+]
--- a/src/disease.py
+++ b/src/disease.py
@@ -0,0 +1,106 @@
+"""Disease signature construction.
+
+Week 1 (PLAN.md §6). Builds a Tier-A sickle cell signature from GEO expression data via
+differential expression, then persists it with full provenance to
+``data/processed/sickle_cell_signature_v1.json``.
+
+This module defines the persisted schema (pydantic) and the construction stubs. The actual
+data pull + differential expression is driven from ``notebooks/02_disease_signature.ipynb``.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pandas as pd
+from pydantic import BaseModel, Field
+
+from . import PIPELINE_VERSION, PROCESSED_DIR
+from .provenance import ConfidenceTier
+
+# Number of genes to take per direction (PLAN.md §6, Week 1 task 5).
+TOP_N_PER_DIRECTION = 250
+QVALUE_CUTOFF = 0.05
+
+
+class GeneEntry(BaseModel):
+    """A single differentially expressed gene in the signature."""
+
+    gene: str = Field(..., description="HGNC gene symbol, e.g. 'HBG2'.")
+    entrez_id: str | None = None
+    ensembl_id: str | None = None
+    log_fc: float
+    qvalue: float
+
+
+class SignatureProvenance(BaseModel):
+    """Provenance block for a disease signature (PLAN.md §6 schema)."""
+
+    geo_accession: str
+    n_disease: int
+    n_healthy: int
+    platform: str
+    method: str = Field(..., description="Differential expression method, e.g. 'limma', 'deseq2'.")
+    created_date: str
+
+
+class DiseaseSignature(BaseModel):
+    """The persisted sickle cell disease signature (PLAN.md §6 schema)."""
+
+    signature_id: str = "sickle_cell_v1"
+    disease_mondo_id: str = "MONDO:0011382"
+    pipeline_version: str = PIPELINE_VERSION
+    up_regulated: list[GeneEntry]
+    down_regulated: list[GeneEntry]
+    provenance: SignatureProvenance
+    confidence_tier: ConfidenceTier
+    tier_rationale: str
+    limitations: list[str]
+
+
+def compute_differential_expression(
+    expression: pd.DataFrame,
+    sample_groups: pd.Series,
+    *,
+    method: str,
+) -> pd.DataFrame:
+    """Compute gene-level log fold change and adjusted p-values.
+
+    For RNA-seq use ``pydeseq2``; for microarray log2-transform/normalize and use a
+    limma-equivalent (PLAN.md §6, Week 1 task 4).
+
+    Args:
+        expression: Genes (rows) x samples (columns) expression matrix.
+        sample_groups: Per-sample group label ('disease' / 'healthy'), indexed by sample.
+        method: 'deseq2' (RNA-seq) or 'limma' (microarray).
+
+    Returns:
+        A table indexed by gene with at least ``log_fc`` and ``qvalue`` columns.
+    """
+    raise NotImplementedError("Differential expression: implement in Week 1 (notebook 02).")
+
+
+def build_signature(
+    de_table: pd.DataFrame,
+    provenance: SignatureProvenance,
+    *,
+    tier: ConfidenceTier,
+    tier_rationale: str,
+    limitations: list[str],
+    top_n: int = TOP_N_PER_DIRECTION,
+    qvalue_cutoff: float = QVALUE_CUTOFF,
+) -> DiseaseSignature:
+    """Assemble a ``DiseaseSignature`` from a differential expression table.
+
+    Takes the top ``top_n`` up- and down-regulated genes (by qvalue, cut at
+    ``qvalue_cutoff``) per PLAN.md §6, Week 1 task 5.
+    """
+    raise NotImplementedError("Signature assembly: implement in Week 1 (notebook 02).")
+
+
+def persist_signature(signature: DiseaseSignature, out_path: Path | None = None) -> Path:
+    """Write a signature to ``data/processed/sickle_cell_signature_v1.json``."""
+    out_path = out_path or (PROCESSED_DIR / "sickle_cell_signature_v1.json")
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    out_path.write_text(signature.model_dump_json(indent=2))
+    return out_path
--- a/src/drugs.py
+++ b/src/drugs.py
@@ -0,0 +1,85 @@
+"""Drug profile construction.
+
+Week 2 (PLAN.md §6). Curates the ~300-drug set, pulls ChEMBL structure/target data and LINCS
+L1000 signatures, and assembles ``data/processed/drug_profiles_v1.parquet``.
+
+The drug set is deliberately composed (PLAN.md §6, Week 2 task 1):
+    - ground truth (n=2): hydroxyurea, L-glutamine
+    - related-mechanism (n~50)
+    - negative controls (n~50)
+    - general random sample (n~200), fixed seed
+"""
+
+from __future__ import annotations
+
+from enum import Enum
+from pathlib import Path
+
+import pandas as pd
+from pydantic import BaseModel, Field
+
+from . import PROCESSED_DIR, RANDOM_SEED
+from .provenance import ConfidenceTier, Provenance
+
+# LINCS L1000 landmark gene count (PLAN.md §6, Week 2 task 3).
+LINCS_LANDMARK_GENES = 978
+
+
+class InclusionReason(str, Enum):
+    """Why a drug is in the curated set (PLAN.md §6, Week 2 task 1)."""
+
+    GROUND_TRUTH = "ground_truth"
+    RELATED_MECHANISM = "related_mechanism"
+    NEGATIVE_CONTROL = "negative_control"
+    GENERAL_SAMPLE = "general_sample"
+
+
+class DrugProfile(BaseModel):
+    """A single drug profile row (PLAN.md §6, Week 2 task 4)."""
+
+    chembl_id: str
+    name: str
+    inchikey: str | None = None
+    smiles: str | None = None
+    targets: list[str] = Field(default_factory=list)
+    mechanism_of_action: str | None = None
+    # 978-length LINCS landmark z-score vector, or None if no signature is available.
+    lincs_signature: list[float] | None = None
+    inclusion_reason: InclusionReason
+    provenance: list[Provenance] = Field(default_factory=list)
+    confidence_tier: ConfidenceTier
+
+
+def curate_drug_set(seed: int = RANDOM_SEED) -> pd.DataFrame:
+    """Build the deliberately-composed ~300-drug set.
+
+    Returns a table with at least ``chembl_id``, ``name`` and ``inclusion_reason`` columns,
+    written by the notebook to ``data/processed/drug_set_v1.csv``. Random sampling uses
+    ``seed`` for reproducibility (PLAN.md §8).
+    """
+    raise NotImplementedError("Drug-set curation: implement in Week 2 (notebook 03).")
+
+
+def fetch_chembl_profile(chembl_id: str) -> dict:
+    """Fetch structure, targets and mechanism for one drug from ChEMBL.
+
+    Uses ``chembl_webresource_client`` (PLAN.md §6, Week 2 task 2).
+    """
+    raise NotImplementedError("ChEMBL fetch: implement in Week 2 (notebook 03).")
+
+
+def fetch_lincs_signature(chembl_id: str) -> list[float] | None:
+    """Fetch the LINCS L1000 Level-5 consensus (MODZ) signature for a drug.
+
+    Returns a 978-length z-score vector, or ``None`` if no signature is available (e.g.
+    L-glutamine — document such gaps in docs/known_limitations.md). PLAN.md §6, Week 2 task 3.
+    """
+    raise NotImplementedError("LINCS fetch: implement in Week 2 (notebook 03).")
+
+
+def persist_drug_profiles(profiles: pd.DataFrame, out_path: Path | None = None) -> Path:
+    """Write the assembled drug profiles to ``data/processed/drug_profiles_v1.parquet``."""
+    out_path = out_path or (PROCESSED_DIR / "drug_profiles_v1.parquet")
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    profiles.to_parquet(out_path, index=False)
+    return out_path
--- a/src/identifiers.py
+++ b/src/identifiers.py
@@ -0,0 +1,71 @@
+"""Canonical identifier resolution and the pinned identifiers for the MVP.
+
+Week 1, task 1 (PLAN.md §6). The disease and causal gene identifiers are pinned constants
+so the whole pipeline resolves to the same canonical IDs. ``persist_identifiers`` writes them
+to ``data/processed/identifiers.json``.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from pydantic import BaseModel
+
+from . import PROCESSED_DIR
+
+# --- Pinned identifiers (PLAN.md §6, Week 1 task 1) -------------------------------------
+
+SICKLE_CELL_IDS: dict[str, str] = {
+    "mondo": "MONDO:0011382",
+    "orphanet": "Orphanet:232",
+    "omim": "OMIM:603903",
+}
+
+HBB_GENE_IDS: dict[str, str] = {
+    "symbol": "HBB",
+    "ensembl": "ENSG00000244734",
+    "hgnc": "HGNC:4827",
+}
+
+# Ground-truth drugs for the recovery test (PLAN.md §6, Week 2 task 1).
+GROUND_TRUTH_DRUGS: dict[str, str] = {
+    "hydroxyurea": "CHEMBL467",
+    "l-glutamine": "CHEMBL930",
+}
+
+
+class IdentifierSet(BaseModel):
+    """The pinned identifier set persisted at the start of the pipeline."""
+
+    disease: dict[str, str]
+    causal_gene: dict[str, str]
+    ground_truth_drugs: dict[str, str]
+
+
+def build_identifier_set() -> IdentifierSet:
+    """Return the pinned identifier set for the MVP."""
+    return IdentifierSet(
+        disease=SICKLE_CELL_IDS,
+        causal_gene=HBB_GENE_IDS,
+        ground_truth_drugs=GROUND_TRUTH_DRUGS,
+    )
+
+
+def persist_identifiers(out_path: Path | None = None) -> Path:
+    """Write the pinned identifier set to ``data/processed/identifiers.json``.
+
+    Returns the path written.
+    """
+    out_path = out_path or (PROCESSED_DIR / "identifiers.json")
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    out_path.write_text(build_identifier_set().model_dump_json(indent=2))
+    return out_path
+
+
+def resolve_drug_to_chembl(name_or_alias: str) -> str:
+    """Resolve a drug name/alias to a canonical ChEMBL ID.
+
+    Uses ``chembl_webresource_client``. Implemented in Week 2 (PLAN.md §6, task 2).
+    """
+    raise NotImplementedError("Drug -> ChEMBL resolution: implement in Week 2 (notebook 03).")
--- a/src/provenance.py
+++ b/src/provenance.py
@@ -0,0 +1,72 @@
+"""Provenance and confidence-tier tracking.
+
+The confidence tier is the most commercially important design decision in the pipeline
+(PLAN.md §3). *Every* persisted artifact — signatures and drug profiles alike — must carry
+a tier and the provenance needed to justify it.
+
+    Tier A — measured data, peer-reviewed source, n>10 per group, recent
+    Tier B — measured but small-n, older, or single-source
+    Tier C — inferred / extrapolated / hypothesis-only
+"""
+
+from __future__ import annotations
+
+from datetime import date
+from enum import Enum
+
+from pydantic import BaseModel, Field
+
+
+class ConfidenceTier(str, Enum):
+    """Confidence tier for a persisted artifact. See module docstring."""
+
+    A = "A"
+    B = "B"
+    C = "C"
+
+
+class Provenance(BaseModel):
+    """Where a record came from and when. Attached to every persisted artifact."""
+
+    source: str = Field(..., description="Human-readable source name, e.g. 'GEO', 'ChEMBL'.")
+    source_id: str | None = Field(
+        None, description="Accession / identifier within the source, e.g. 'GSE53441'."
+    )
+    source_url: str | None = None
+    source_version: str | None = Field(
+        None, description="Dataset/release version where the source is versioned."
+    )
+    download_date: date | None = Field(
+        None, description="Date the underlying data was downloaded (reproducibility)."
+    )
+    license: str | None = None
+    notes: str | None = None
+
+
+def assign_tier(
+    *,
+    is_measured: bool,
+    n_per_group: int | None,
+    peer_reviewed: bool,
+    single_source: bool,
+) -> ConfidenceTier:
+    """Assign a confidence tier from the evidence characteristics.
+
+    This encodes the tier rules from PLAN.md §3 so tier assignment is consistent and
+    auditable rather than ad-hoc per notebook.
+
+    Args:
+        is_measured: True if the value is directly measured (vs inferred/extrapolated).
+        n_per_group: Sample size per group, if applicable (None when not meaningful).
+        peer_reviewed: Whether the source is peer-reviewed.
+        single_source: Whether the evidence rests on a single source.
+
+    Returns:
+        The assigned ``ConfidenceTier``.
+    """
+    if not is_measured:
+        return ConfidenceTier.C
+    if peer_reviewed and (n_per_group is not None and n_per_group > 10) and not single_source:
+        return ConfidenceTier.A
+    # Measured, but small-n / older / single-source falls to Tier B.
+    return ConfidenceTier.B
--- a/src/scoring.py
+++ b/src/scoring.py
@@ -0,0 +1,85 @@
+"""CMap-style connectivity scoring — the matching engine.
+
+Week 3 (PLAN.md §6). Scores each drug's LINCS signature against the disease signature using
+weighted Kolmogorov-Smirnov enrichment (Lamb 2006 / Subramanian 2017). Strongly *negative*
+connectivity = strong reversal of the disease signature = candidate match.
+
+Uses ``cmapPy`` as the reference implementation. ``tests/test_scoring.py`` verifies the
+implementation against a known reference.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pandas as pd
+from pydantic import BaseModel
+
+from . import RESULTS_DIR
+
+
+class ConnectivityResult(BaseModel):
+    """Connectivity score for a single drug against the disease signature."""
+
+    chembl_id: str
+    drug_name: str
+    connectivity_score: float | None  # None when the drug has no LINCS signature.
+    normalized_score: float | None = None
+    p_value: float | None = None
+    scored: bool  # False => no signature available, not scored (do not skip silently).
+    n_genes_overlap: int | None = None
+
+
+def connectivity_score(
+    up_genes: list[str],
+    down_genes: list[str],
+    drug_signature: pd.Series,
+) -> float:
+    """Weighted KS connectivity score for one drug vs the disease up/down gene sets.
+
+    Only the intersection of disease-signature genes and LINCS landmark genes is scored;
+    callers must record the overlap count (PLAN.md §6, Week 3 task 2).
+
+    Args:
+        up_genes: Disease up-regulated gene identifiers.
+        down_genes: Disease down-regulated gene identifiers.
+        drug_signature: Drug's expression vector indexed by gene identifier.
+
+    Returns:
+        Connectivity score in roughly [-1, 1]; strongly negative = strong reversal.
+    """
+    raise NotImplementedError("Connectivity scoring: implement in Week 3 (notebook 04).")
+
+
+def rank_drugs(
+    signature_up: list[str],
+    signature_down: list[str],
+    drug_profiles: pd.DataFrame,
+) -> pd.DataFrame:
+    """Score and rank all drugs against the disease signature.
+
+    Drugs without a LINCS signature are marked ``scored=False`` and excluded from the ranking
+    rather than dropped silently (PLAN.md §6, Week 3 task 2).
+
+    Returns a ranked table with the columns described in PLAN.md §6 (rank, drug_name,
+    chembl_id, connectivity_score, normalized_score, p_value, inclusion_reason,
+    known_targets, mechanism_summary).
+    """
+    raise NotImplementedError("Drug ranking: implement in Week 3 (notebook 04).")
+
+
+def mechanistic_prior(targets: list[str]) -> float:
+    """Prior weight for a drug based on sickle-cell-relevant target pathways.
+
+    Pathways of interest: HbF regulation, hemoglobin, NO signaling, inflammation, oxidative
+    stress (PLAN.md §6, Week 3 task 3). Used to build the secondary, prior-weighted ranking.
+    """
+    raise NotImplementedError("Mechanistic prior: implement in Week 3 (notebook 04).")
+
+
+def persist_ranking(ranking: pd.DataFrame, out_path: Path | None = None) -> Path:
+    """Write the ranked candidate list to ``data/results/ranked_candidates_v1.csv``."""
+    out_path = out_path or (RESULTS_DIR / "ranked_candidates_v1.csv")
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    ranking.to_csv(out_path, index=False)
+    return out_path
--- a/tests/test_scoring.py
+++ b/tests/test_scoring.py
@@ -0,0 +1,65 @@
+"""Tests for the matching engine and provenance logic.
+
+The headline test (PLAN.md §6, Week 3 task 4) verifies connectivity scoring against a known
+reference within tolerance; it is marked xfail until the scorer is implemented in Week 3.
+
+The tier-assignment tests run today — they pin the rules from PLAN.md §3 so the most
+commercially important design decision can't silently drift.
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from src.provenance import ConfidenceTier, assign_tier
+
+
+class TestAssignTier:
+    """Tier rules from PLAN.md §3."""
+
+    def test_measured_large_n_peer_reviewed_multi_source_is_tier_a(self):
+        assert (
+            assign_tier(
+                is_measured=True,
+                n_per_group=27,
+                peer_reviewed=True,
+                single_source=False,
+            )
+            == ConfidenceTier.A
+        )
+
+    def test_inferred_is_always_tier_c(self):
+        assert (
+            assign_tier(
+                is_measured=False,
+                n_per_group=1000,
+                peer_reviewed=True,
+                single_source=False,
+            )
+            == ConfidenceTier.C
+        )
+
+    @pytest.mark.parametrize(
+        "kwargs",
+        [
+            dict(is_measured=True, n_per_group=6, peer_reviewed=True, single_source=False),
+            dict(is_measured=True, n_per_group=27, peer_reviewed=False, single_source=False),
+            dict(is_measured=True, n_per_group=27, peer_reviewed=True, single_source=True),
+            dict(is_measured=True, n_per_group=None, peer_reviewed=True, single_source=False),
+        ],
+    )
+    def test_measured_but_weak_evidence_is_tier_b(self, kwargs):
+        assert assign_tier(**kwargs) == ConfidenceTier.B
+
+
+@pytest.mark.xfail(reason="Connectivity scoring implemented in Week 3 (notebook 04).", strict=True)
+def test_connectivity_score_matches_reference():
+    """Verify connectivity scoring against a CMap/cmapPy reference within tolerance.
+
+    PLAN.md §6, Week 3 task 4. Replace this body with a known reference example
+    (disease up/down sets + drug signature -> expected score) once the scorer exists.
+    """
+    from src.scoring import connectivity_score
+
+    score = connectivity_score(up_genes=[], down_genes=[], drug_signature=None)  # noqa
+    assert score == pytest.approx(0.0, abs=1e-6)