Scaffold Reverso MVP pipeline structure
Set up the project skeleton per PLAN.md §4: - src/ package: identifiers, disease, drugs, scoring, provenance with pydantic schemas and confidence-tier logic (working); data-pull/compute functions stubbed per their build week - 5 starter notebooks (01-05) with PLAN-referenced steps - tests/test_scoring.py: tier-assignment tests pass; scoring reference test xfail until Week 3 - docs/: recovery_test_report, data_sources, known_limitations skeletons - pyproject.toml (requires-python >=3.11,<3.14), .gitignore, README - data/ tree preserved via .gitkeep; raw/processed/results gitignored Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
57
notebooks/01_setup_identifiers.ipynb
Normal file
57
notebooks/01_setup_identifiers.ipynb
Normal file
@@ -0,0 +1,57 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 01 \u2014 Setup identifiers\n",
|
||||
"\n",
|
||||
"Week 1, task 1 (PLAN.md \u00a76). Pin the disease/gene/ground-truth identifiers and persist them to `data/processed/identifiers.json`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import sys\n",
|
||||
"sys.path.insert(0, '..') # import the src package from notebooks/"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from src.identifiers import build_identifier_set, persist_identifiers\n",
|
||||
"\n",
|
||||
"ids = build_identifier_set()\n",
|
||||
"ids.model_dump()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path = persist_identifiers()\n",
|
||||
"print(f'wrote {path}')"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
50
notebooks/02_disease_signature.ipynb
Normal file
50
notebooks/02_disease_signature.ipynb
Normal file
@@ -0,0 +1,50 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 02 \u2014 Disease signature\n",
|
||||
"\n",
|
||||
"Week 1 (PLAN.md \u00a76). Pull Open Targets + a GEO expression study, run differential expression, and build `sickle_cell_signature_v1.json` (Tier A) with full provenance.\n\nSteps: (1) Open Targets associations, (2) choose + download GEO dataset, (3) differential expression, (4) build + persist signature.\n\n**Pitfall to document:** whole-blood expression is partly driven by cell-composition differences, not disease state (PLAN.md \u00a79.1)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import sys\n",
|
||||
"sys.path.insert(0, '..') # import the src package from notebooks/"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from src import disease\n",
|
||||
"from src.provenance import ConfidenceTier\n",
|
||||
"\n",
|
||||
"# Step 1: Open Targets associations for MONDO:0011382 -> data/raw/open_targets/\n",
|
||||
"# Step 2: choose + download GEO study (GSE53441 / GSE35007 / newer) -> data/raw/geo/\n",
|
||||
"# Step 3: disease.compute_differential_expression(...)\n",
|
||||
"# Step 4: disease.build_signature(...) then disease.persist_signature(...)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
50
notebooks/03_drug_profiles.ipynb
Normal file
50
notebooks/03_drug_profiles.ipynb
Normal file
@@ -0,0 +1,50 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 03 \u2014 Drug profiles\n",
|
||||
"\n",
|
||||
"Week 2 (PLAN.md \u00a76). Curate the ~300-drug set, pull ChEMBL + LINCS L1000 data, and assemble `drug_profiles_v1.parquet`.\n\nDrug set: 2 ground-truth + ~50 related-mechanism + ~50 negative controls + ~200 random (fixed seed). Document any missing LINCS signatures in `docs/known_limitations.md`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import sys\n",
|
||||
"sys.path.insert(0, '..') # import the src package from notebooks/"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from src import drugs\n",
|
||||
"from src import RANDOM_SEED\n",
|
||||
"\n",
|
||||
"# Step 1: drugs.curate_drug_set(seed=RANDOM_SEED) -> data/processed/drug_set_v1.csv\n",
|
||||
"# Step 2: drugs.fetch_chembl_profile(...) for each drug -> data/raw/chembl/\n",
|
||||
"# Step 3: drugs.fetch_lincs_signature(...) -> data/raw/lincs/\n",
|
||||
"# Step 4: drugs.persist_drug_profiles(...) -> drug_profiles_v1.parquet"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
48
notebooks/04_connectivity_scoring.ipynb
Normal file
48
notebooks/04_connectivity_scoring.ipynb
Normal file
@@ -0,0 +1,48 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 04 \u2014 Connectivity scoring\n",
|
||||
"\n",
|
||||
"Week 3 (PLAN.md \u00a76). CMap-style connectivity scoring of every drug against the sickle cell signature. Strongly negative connectivity = strong reversal = candidate.\n\nOutputs `data/results/ranked_candidates_v1.csv`. Also build the secondary mechanistically-weighted ranking. Document the gene-overlap count; mark signature-less drugs as 'not scored' rather than dropping them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import sys\n",
|
||||
"sys.path.insert(0, '..') # import the src package from notebooks/"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from src import scoring\n",
|
||||
"\n",
|
||||
"# Load signature + drug profiles, then:\n",
|
||||
"# ranking = scoring.rank_drugs(up, down, drug_profiles)\n",
|
||||
"# scoring.persist_ranking(ranking) -> data/results/ranked_candidates_v1.csv"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
48
notebooks/05_recovery_test.ipynb
Normal file
48
notebooks/05_recovery_test.ipynb
Normal file
@@ -0,0 +1,48 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 05 \u2014 Recovery test\n",
|
||||
"\n",
|
||||
"Week 4 (PLAN.md \u00a76). **Commit the pre-registered success criteria to git BEFORE running this notebook** (PLAN.md \u00a78, \u00a710).\n\nPull hydroxyurea + L-glutamine ranks and 5 negative-control ranks, compute pass/fail, examine the top 10, and fill in `docs/recovery_test_report.md`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import sys\n",
|
||||
"sys.path.insert(0, '..') # import the src package from notebooks/"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"from src import RESULTS_DIR\n",
|
||||
"\n",
|
||||
"ranking = pd.read_csv(RESULTS_DIR / 'ranked_candidates_v1.csv')\n",
|
||||
"# Pull ground-truth + negative-control ranks; evaluate pre-registered criteria."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
Reference in New Issue
Block a user