first commit

This commit is contained in:
2026-06-23 19:57:44 +02:00
commit e717cf40ed
2 changed files with 429 additions and 0 deletions

428
PLAN.md Normal file
View File

@@ -0,0 +1,428 @@
# QPharma MVP — Sickle Cell Repurposing Pipeline
> **For Claude Code:** This is the project specification. Read this entire document before suggesting actions or writing code. The decisions in section "Locked decisions" have already been made by the founder after extensive expert consultation; do not re-litigate them. Where the plan calls for a choice, propose options but default to the spec.
---
## 1. Project context
### What we're building
A minimum viable drug repurposing pipeline that:
1. Pulls public biomedical data for **sickle cell disease** (one disease, deliberately scoped)
2. Builds a **disease signature** (transcriptomic gene expression vector)
3. Builds **drug profiles** for ~300 small-molecule compounds
4. Runs **CMap-style connectivity scoring** to rank drugs by their potential to reverse the disease signature
5. Validates via a **recovery test**: do the two known sickle cell drugs (hydroxyurea, L-glutamine) rank in the top of the list?
This is a proof-of-concept for a broader platform. The platform thesis is "AI-driven drug repurposing using disease signature + drug profile matching, with knowledge graphs as the long-term moat and quantum-inspired optimization as a phase-3 multiplier." None of that broader vision is in scope for this MVP. The MVP exists to produce one credible, reproducible result that proves the matching method works.
### Why sickle cell
- **Monogenic** (HBB gene, single point mutation) — disease biology is unusually clean for a rare disease
- **Rich public data** — GEO has multiple expression studies, Open Targets has well-curated associations
- **Two known repurposings/expansions** as ground truth: **hydroxyurea** (originally for chronic myeloid leukemia, now standard sickle cell care) and **L-glutamine** (approved 2017). If the engine doesn't rank these highly, the engine is wrong.
- Strong unmet need narrative for investor conversations
### What success looks like
A reproducible Jupyter notebook (or notebook set) that produces:
1. A versioned sickle cell disease signature with provenance
2. A drug profile dataset for ~300 compounds
3. A ranked CSV of all ~300 drugs by connectivity score
4. A **2-page write-up** in `docs/recovery_test_report.md` containing: methodology, the ranks of hydroxyurea and L-glutamine (the recovery test), sanity check results for negative-control drugs, the top-10 candidates with brief mechanistic rationale, and honest limitations
If hydroxyurea ranks in the top 10% and L-glutamine ranks in the top 25%, the MVP passes. Pre-register this threshold before looking at the data.
---
## 2. Locked decisions (do not re-litigate)
| Decision | Choice | Why |
|---|---|---|
| Target disease | Sickle cell disease (MONDO:0011382) | Monogenic, public data, two known ground-truth drugs |
| Drug modality | Small molecules only | Biologics are a fundamentally different problem |
| Matching method | CMap-style connectivity scoring (Lamb 2006 / Subramanian 2017) | Well-established, no training data needed, reference implementation exists (`cmapPy`) |
| Drug set size | ~300 compounds | Large enough to be meaningful, small enough to curate carefully |
| Patient stratification | None (one signature per disease) | Repurposing is disease-level; patient-level is the trial-design problem, out of scope |
| Quantum / quantum-inspired layer | Not in scope | Phase 3 multiplier, not relevant until classical baseline is proven |
| Knowledge graph / LLM extraction | Not in scope | Phase 2, after classical signature matching is validated |
| Build environment | Local notebook (Mac Studio, 96GB RAM) | All data fits locally; no cloud needed for MVP |
| Timeline | 2-4 weeks part-time | Cleaner than a hackathon, faster than "production" |
| Language | Python primary | Cheminformatics and bioinformatics ecosystems are mature in Python |
**Things that are explicitly NOT in this MVP:**
- Combination (2+ drug) matching
- Inverse/reverse matching (outcome → drugs)
- Multiple diseases
- Subtype-level signatures
- Patient/demographic stratification
- Knowledge graph construction
- LLM-based literature extraction
- Quantum / quantum-inspired optimization
- Mechanistic / ODE-based modelling
- Private pharma data ingestion
- API or productization
- Web UI
- Agentic orchestration
If a future Claude Code session is tempted to add any of these "while we're at it" — they delay the proof point that de-risks everything else. Build only what's in the spec.
---
## 3. Architecture overview
```
RAW DATA SOURCES
├── Open Targets (target-disease associations)
├── Orphanet / OMIM / MONDO (disease identifiers and definitions)
├── GEO (transcriptomic expression data — disease vs healthy)
├── ChEMBL (drug structures, targets, bioactivities)
├── LINCS L1000 (drug-induced expression signatures)
├── ClinicalTrials.gov (trial history, shelved compound discovery)
└── FDA labels / DailyMed (approved indications, safety)
HARMONIZATION LAYER (Week 1-2)
├── Disease identifier resolution → canonical MONDO ID
├── Drug identity resolution → canonical InChIKey
├── Provenance + confidence tier attached to every record
FEATURE LAYER (Week 1-2)
├── sickle_cell_signature_v1.json — disease signature vector with provenance
└── drug_profiles_v1.parquet — ~300 drug profiles with LINCS signatures
MATCHING ENGINE (Week 3)
└── CMap connectivity scoring → ranked drug list
VALIDATION + WRITE-UP (Week 4)
├── Recovery test: hydroxyurea + L-glutamine ranks
├── Sanity checks: negative controls rank low
└── 2-page report
```
### Confidence tiers (critical design decision)
Every signature and drug profile carries a **confidence tier**:
- **Tier A** — measured data, peer-reviewed source, n>10 per group, recent
- **Tier B** — measured but small-n, older, or single-source
- **Tier C** — inferred / extrapolated / hypothesis-only
This is the most commercially important design decision in the whole pipeline. Sarborg's "1,700 rare disease signatures" are mostly Tier C (inferred). The platform's honesty about this is a differentiator, not a weakness. **Every persisted artifact must include its tier.**
---
## 4. Directory structure
```
qpharma-mvp/
├── PLAN.md # This file
├── README.md # Short project description
├── pyproject.toml # Dependencies (or requirements.txt)
├── .gitignore # Exclude data/ and notebooks checkpoints
├── data/
│ ├── raw/ # Downloaded data, never edited
│ │ ├── open_targets/
│ │ ├── geo/
│ │ ├── chembl/
│ │ └── lincs/
│ ├── processed/ # Cleaned, harmonized data
│ │ ├── sickle_cell_signature_v1.json
│ │ └── drug_profiles_v1.parquet
│ └── results/
│ └── ranked_candidates_v1.csv
├── notebooks/
│ ├── 01_setup_identifiers.ipynb
│ ├── 02_disease_signature.ipynb
│ ├── 03_drug_profiles.ipynb
│ ├── 04_connectivity_scoring.ipynb
│ └── 05_recovery_test.ipynb
├── src/
│ ├── __init__.py
│ ├── identifiers.py # MONDO, ChEMBL ID resolution
│ ├── disease.py # Signature construction
│ ├── drugs.py # Drug profile construction
│ ├── scoring.py # CMap connectivity scoring
│ └── provenance.py # Tier assignment, source tracking
├── tests/
│ └── test_scoring.py # Verify scoring against known reference
└── docs/
├── recovery_test_report.md # Final 2-page write-up
├── data_sources.md # Detailed list of where data came from
└── known_limitations.md # Honest pitfalls documented
```
---
## 5. Dependencies
```toml
# pyproject.toml core dependencies
python = ">=3.11,<3.13"
pandas = ">=2.0"
numpy = ">=1.24"
scipy = ">=1.11"
requests = ">=2.31"
chembl_webresource_client = ">=0.10" # ChEMBL API client
GEOparse = ">=2.0" # GEO dataset access
pydeseq2 = ">=0.4" # Differential expression in Python
cmapPy = ">=4.0" # Reference CMap connectivity implementation
pyarrow = ">=14.0" # Parquet I/O
jupyter = ">=1.0"
matplotlib = ">=3.7" # Sanity-check plots
seaborn = ">=0.13"
pydantic = ">=2.0" # Schema validation for signatures/profiles
```
Install with `pip install -e .` after creating `pyproject.toml`. All free, all open-source. No licensed data sources are used in the MVP.
---
## 6. Week-by-week build plan
### Week 1 — Disease signature
**Goal:** One Tier-A signature vector for sickle cell with full provenance.
**Tasks:**
1. **Pin identifiers** (`src/identifiers.py`, `notebooks/01_setup_identifiers.ipynb`)
- Sickle cell disease: MONDO:0011382, Orphanet:232, OMIM:603903
- Causal gene: HBB (Ensembl ENSG00000244734, HGNC:4827)
- Persist to `data/processed/identifiers.json`
2. **Pull Open Targets data** (`notebooks/02_disease_signature.ipynb` step 1)
- Use the Open Targets Platform API or bulk Parquet download (https://platform.opentargets.org/downloads)
- Pull: target-disease associations for MONDO:0011382, evidence sources, associated targets
- Store raw in `data/raw/open_targets/`
- This gives the "known biology" subgraph as a sanity reference
3. **Identify and pull a GEO dataset** (`notebooks/02_disease_signature.ipynb` step 2)
- Search GEO for sickle cell expression studies with healthy controls
- Criteria: n>=10 per group, RNA-seq or microarray, peer-reviewed publication
- Candidates to evaluate (verify each is still available and choose the strongest):
- GSE53441 (sickle cell vs healthy whole blood)
- GSE35007 (sickle cell pediatric)
- More recent studies preferred — search "sickle cell" with "Homo sapiens" filter
- Use `GEOparse` to download
- Document the chosen dataset's metadata fully in the signature provenance
4. **Differential expression** (`notebooks/02_disease_signature.ipynb` step 3)
- For microarray data: log2-transform, normalize, use `limma`-equivalent in Python (or call R via `rpy2` if needed — but try to stay in Python)
- For RNA-seq: use `pydeseq2`
- Output: gene-level log fold change and adjusted p-value table
5. **Build the signature** (`src/disease.py`)
- Take top ~250 up-regulated and top ~250 down-regulated genes by adjusted p-value (cut at q<0.05)
- Map gene symbols to Entrez IDs and Ensembl IDs (use `mygene` package or pyensembl)
- Persist to `data/processed/sickle_cell_signature_v1.json` with schema:
```json
{
"signature_id": "sickle_cell_v1",
"disease_mondo_id": "MONDO:0011382",
"up_regulated": [{"gene": "HBG2", "entrez_id": "3048", "log_fc": 2.1, "qvalue": 1e-8}, ...],
"down_regulated": [...],
"provenance": {
"geo_accession": "GSE53441",
"n_disease": 27,
"n_healthy": 12,
"platform": "Affymetrix HG-U133 Plus 2.0",
"method": "limma",
"created_date": "2026-..."
},
"confidence_tier": "A",
"tier_rationale": "Measured RNA expression, n>10/group, peer-reviewed dataset",
"limitations": ["Whole-blood expression confounded by cell composition differences", ...]
}
```
**Honest pitfall to document:** Sickle cell whole-blood expression is partly driven by cell composition differences (different RBC/WBC ratios in patients vs controls), not just disease state. Note this in the signature's `limitations` field. A v2 would do cell-type deconvolution. v1 does not.
### Week 2 — Drug profiles
**Goal:** ~300 drug profiles with structure, targets, and LINCS expression signatures where available.
**Tasks:**
1. **Curate the drug set deliberately** (`notebooks/03_drug_profiles.ipynb` step 1)
- **Ground truth (n=2, non-negotiable):** hydroxyurea (ChEMBL:CHEMBL467), L-glutamine (CHEMBL:CHEMBL930)
- **Related-mechanism drugs (n~50):** HbF inducers, anti-inflammatory drugs studied in sickle cell, NO donors, antioxidants, drugs in current sickle cell clinical trials (search ClinicalTrials.gov for "sickle cell" interventional trials)
- **Negative controls (n~50):** Drugs from unrelated areas — antifungals, contraceptives, antihistamines for non-sickle indications, antibiotics. These should rank low; if they don't, the method has a bias to diagnose.
- **General sample (n~200):** Randomly sampled drugs from the LINCS L1000 catalog (which has ~2000 perturbagens). Use a fixed random seed for reproducibility.
- Store the curated list with the reason for inclusion in `data/processed/drug_set_v1.csv`
2. **Pull ChEMBL data for each drug** (`notebooks/03_drug_profiles.ipynb` step 2, `src/drugs.py`)
- Use `chembl_webresource_client` to fetch: ChEMBL ID, preferred name, InChIKey, canonical SMILES, known mechanisms of action, target list with bioactivity
- Resolve drug name aliases to canonical ChEMBL IDs
- Store raw responses in `data/raw/chembl/`
3. **Pull LINCS L1000 signatures** (`notebooks/03_drug_profiles.ipynb` step 3)
- LINCS data portal: https://clue.io/data/CMap2020
- Use the Level 5 consensus signatures (MODZ aggregation across cell lines and replicates)
- Format: 978 landmark genes × drugs, with z-scored expression changes
- **Critical honest note:** For amino acids and metabolites like L-glutamine, LINCS coverage may be missing. If a ground-truth drug lacks a signature, document this in `docs/known_limitations.md`. The fallback for that specific drug is to flag it as "no signature available, would require mechanism-graph fallback in v2."
- Store in `data/raw/lincs/`
4. **Assemble the drug profile table** (`src/drugs.py`)
- One row per drug
- Columns: chembl_id, name, inchikey, smiles, targets (list), mechanism_of_action, lincs_signature (978-vector or null), source_provenance, confidence_tier
- Persist to `data/processed/drug_profiles_v1.parquet`
### Week 3 — Connectivity scoring (the matching engine)
**Goal:** A ranked CSV of all ~300 drugs by their connectivity score against the sickle cell signature.
**Tasks:**
1. **Implement CMap connectivity scoring** (`src/scoring.py`, `notebooks/04_connectivity_scoring.ipynb`)
- Use `cmapPy` as the reference implementation (it has the Broad Institute's official implementation)
- Method: weighted Kolmogorov-Smirnov-based enrichment. For each drug, the score answers: how strongly does this drug's expression signature *reverse* the disease's up- and down-regulated gene sets?
- Strongly negative connectivity scores = strong reversal = candidate match
- Reference: Lamb et al. 2006 (Science), Subramanian et al. 2017 (Cell) — the L1000 paper
2. **Compute scores for all drugs** (`notebooks/04_connectivity_scoring.ipynb`)
- Map between the disease signature genes (potentially full-genome) and the LINCS 978 landmark genes — only the intersection is scored. Document the gene overlap count; this matters.
- For drugs without LINCS signatures (e.g., L-glutamine likely): mark explicitly as "not scored, no signature available." Do not skip silently.
- Output: `data/results/ranked_candidates_v1.csv` with columns: rank, drug_name, chembl_id, connectivity_score, normalized_score, p_value (if available), inclusion_reason, known_targets, mechanism_summary
3. **Build a secondary mechanistically-weighted ranking** (`notebooks/04_connectivity_scoring.ipynb`)
- For each drug, compute a prior weight based on whether its known targets are in sickle cell-relevant pathways (HBF regulation, hemoglobin, NO signaling, inflammation, oxidative stress)
- Produce a second ranking blending connectivity score with mechanistic prior
- Showing both raw and prior-weighted rankings is honest and informative
4. **Write a unit test** (`tests/test_scoring.py`)
- Use a reference example from the CMap paper or `cmapPy` documentation
- Verify the implementation matches the reference within tolerance
### Week 4 — Recovery test and write-up
**Goal:** A 2-page document a sceptical pharma scientist can evaluate in 5 minutes.
**Critical:** Before looking at the rankings, pre-register the success criteria in writing:
> *"The MVP passes if hydroxyurea ranks in the top 10% (top 30 of 300) AND L-glutamine either ranks in the top 25% (top 75) OR is documented as unscorable due to missing LINCS signature. At least 4 of 5 negative-control drugs must rank in the bottom half."*
**Tasks:**
1. **Run the recovery test** (`notebooks/05_recovery_test.ipynb`)
- Pull the ranks of hydroxyurea and L-glutamine from `ranked_candidates_v1.csv`
- Pull the ranks of 5 pre-specified negative controls
- Compute pass/fail against the pre-registered criteria
2. **Examine the top 10** (`notebooks/05_recovery_test.ipynb`)
- For each of the top 10 candidates, write a one-sentence mechanistic rationale (or note "no obvious rationale — possible false positive")
- Identify the single most interesting non-obvious candidate
- Many top-10 candidates will look mechanistically silly (HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due to widespread expression effects); document this honestly
3. **Write `docs/recovery_test_report.md`** (~2 pages)
- **Section 1 — Methodology:** What was built, in 5-6 sentences, with the GEO dataset, drug set composition, and scoring method named
- **Section 2 — Recovery test result:** Did hydroxyurea and L-glutamine pass? Did negative controls behave correctly? Pass/fail against pre-registered criteria
- **Section 3 — Top 10 candidates:** Brief table with each candidate, score, known mechanism, and a sentence on biological plausibility
- **Section 4 — One non-obvious candidate worth investigating:** A single paragraph on the most interesting result
- **Section 5 — Honest limitations:** Cell-composition confound, L1000 cell-line limitations, missing signatures, no mechanistic validation layer
- **Section 6 — What v2 would fix:** Cell-type deconvolution, knowledge graph for missing-signature drugs, second disease to test generalization
4. **Document data sources fully** (`docs/data_sources.md`)
- Every data source, version, download date, and license
- This is the artifact that proves reproducibility
5. **Document known limitations** (`docs/known_limitations.md`)
- The honest list of what would break this MVP at scale or in a different disease
- Useful for the next pharma conversation: "yes, we know these are limitations, here's how v2 addresses them"
---
## 7. Data sources reference
| Source | URL | Access | License | Use in MVP |
|---|---|---|---|---|
| Open Targets | https://platform.opentargets.org | API, bulk Parquet | CC0 | Target-disease graph |
| MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID |
| Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata |
| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics |
| GEO | https://www.ncbi.nlm.nih.gov/geo/ | GEOparse, FTP | Public domain | Expression data |
| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets |
| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures |
| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history |
| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels |
| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) |
**Licensing note for LINCS:** Read the LINCS data use terms before commercial use. For the MVP (research/proof-of-concept), the terms are permissive. For productization, this needs legal review.
---
## 8. Reproducibility requirements
This is a science artifact, not a hack. Reproducibility is the whole point.
- All data downloads must record date and version
- All randomness must use a fixed seed (set in a top-level constant)
- All signature and profile files must include `created_date` and `pipeline_version`
- Every notebook must run end-to-end from a fresh checkout without manual intervention (other than downloading the raw data files, which have a documented script)
- The pre-registered success criteria must be committed to git *before* the recovery test is run
---
## 9. Honest pitfalls (do not ignore these)
These are real risks documented during planning. They are not paranoia.
1. **Cell-composition confound in sickle cell expression data.** Whole-blood differential expression in sickle cell partly reflects different blood cell ratios, not disease biology. v1 acknowledges this; v2 should deconvolve.
2. **LINCS L1000 cell-line limitations.** The 978 landmark genes were measured mostly in cancer cell lines (MCF7, A375, PC3, etc.). Signatures for non-oncology diseases may be noisy. This is a known field-wide limitation, not unique to QPharma.
3. **L-glutamine probably has no LINCS signature.** Amino acids and metabolites weren't LINCS priorities. If true, the ground-truth test only has hydroxyurea, which is weaker. Document honestly.
4. **Connectivity scoring surfaces broad-effect drugs as false positives.** HDAC inhibitors and broad kinase inhibitors often top connectivity rankings simply because they perturb many genes. Expect this; don't oversell them. The mechanistic prior in Week 3 helps filter.
5. **Hydroxyurea will probably pass the recovery test by construction.** Sickle cell + hydroxyurea is a well-studied pair. Passing this test is necessary but not sufficient to claim the platform generalizes. The next disease (when there is one) is the real test of generalization. Do not sell sickle cell results as proving the platform.
6. **The MVP has no mechanistic validation layer.** Multiple experts (Hicham, Nova In Silico) flagged that pure ML matching is not sufficient for extrapolation. The MVP knowingly omits the mechanistic layer; it's a phase-2 addition. Position the MVP as "discovery hypothesis generation," not "validated prediction."
7. **Top-ranked novel candidates have not been wet-lab validated.** They are computational hypotheses. Any "interesting candidate" surfaced in the write-up is a hypothesis to test, not a discovery. Use careful language.
---
## 10. What to do next (for the human picking this up)
### First session in Claude Code
1. Initialize the repo: `git init`, create the directory structure in section 4
2. Set up the Python environment with `pyproject.toml` (or `uv init` if using `uv`)
3. Open `notebooks/01_setup_identifiers.ipynb` and start Week 1, task 1
4. Commit early and often. Each notebook should be a separate commit when first complete.
### Pre-flight check before the recovery test (end of Week 3)
Before running the recovery test in Week 4, **commit the pre-registered success criteria to git**. This prevents post-hoc rationalization. If the criteria need to change after seeing partial results, that change must also be committed and explained.
### When the MVP is done
The deliverable to send to anyone (investor, advisor, pharma contact) is:
1. The 2-page `recovery_test_report.md`
2. The `ranked_candidates_v1.csv`
3. The signature and profile JSON/parquet files with their provenance
That's it. No slides yet. The single document is the artifact. If it passes the recovery test, you have earned the right to raise on the broader vision.
---
## 11. Strategic context (for future Claude Code sessions to understand the "why")
This MVP exists in a broader strategic context that was built through ~7 expert consultations. The key conclusions:
- **The architecture is two databases (disease signatures + drug profiles) + a matching engine.** Three independent experts described this unprompted. It is not a hypothesis; it is the standard model in the field.
- **The moat is the curated data, not the algorithm.** The matching algorithm is largely commodity (CMap is from 2006). The proprietary value is in the harmonized, curated, provenance-tracked data layer. Build accordingly.
- **The long-term technical architecture adds a knowledge graph (phase 2) and quantum-inspired optimization for combination search (phase 3).** Neither is in the MVP.
- **The go-to-market is "digital CRO for drug repurposing."** Quantum language is for investors; pharma clients hear "digital CRO."
- **The first paying buyers are R&D programme directors and BD teams in pharma**, approached either outbound (we found a match in your area) or via API/sandbox (run your shelved compounds through our engine, data stays on your side).
- **Synthetic trial arms and drug repurposing share data infrastructure.** This is a platform play, not a single product.
The MVP's job is to produce one credible result. Everything else follows from that.

1
README.md Normal file
View File

@@ -0,0 +1 @@
# Reverso