Reframe de novo generation into the repurposing frame per the founder's idea: use a pocket-conditioned generative model (TargetDiff/DiffSBDD/ Pocket2Mol) to propose an idealised binder as a SEARCH BEACON, then retrieve the nearest EXISTING drugs by chemical similarity (Tanimoto/ embedding) as repurposing candidates — the generated molecule is never synthesised. Caveats kept honest: generated molecules used only as beacons (often synthetically invalid); similarity != activity, so retrieved neighbours still must be docked + pass the binding recovery test; open question whether it beats brute-force docking the existing library. Explore only after the §12.3-12.4 docking baseline is validated. §12.7 exclusion reworded to point here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
34 KiB
Reverso MVP — Sickle Cell Repurposing Pipeline
For Claude Code: This is the project specification. Read this entire document before suggesting actions or writing code. The decisions in section "Locked decisions" have already been made by the founder after extensive expert consultation; do not re-litigate them. Where the plan calls for a choice, propose options but default to the spec.
1. Project context
What we're building
A minimum viable drug repurposing pipeline that:
- Pulls public biomedical data for sickle cell disease (one disease, deliberately scoped)
- Builds a disease signature (transcriptomic gene expression vector)
- Builds drug profiles for ~300 small-molecule compounds
- Runs CMap-style connectivity scoring to rank drugs by their potential to reverse the disease signature
- Validates via a recovery test: do the two known sickle cell drugs (hydroxyurea, L-glutamine) rank in the top of the list?
This is a proof-of-concept for a broader platform. The platform thesis is "AI-driven drug repurposing using disease signature + drug profile matching, with knowledge graphs as the long-term moat and quantum-inspired optimization as a phase-3 multiplier." None of that broader vision is in scope for this MVP. The MVP exists to produce one credible, reproducible result that proves the matching method works.
Why sickle cell
- Monogenic (HBB gene, single point mutation) — disease biology is unusually clean for a rare disease
- Rich public data — GEO has multiple expression studies, Open Targets has well-curated associations
- Two known repurposings/expansions as ground truth: hydroxyurea (originally for chronic myeloid leukemia, now standard sickle cell care) and L-glutamine (approved 2017). If the engine doesn't rank these highly, the engine is wrong.
- Strong unmet need narrative for investor conversations
What success looks like
A reproducible Jupyter notebook (or notebook set) that produces:
- A versioned sickle cell disease signature with provenance
- A drug profile dataset for ~300 compounds
- A ranked CSV of all ~300 drugs by connectivity score
- A 2-page write-up in
docs/recovery_test_report.mdcontaining: methodology, the ranks of hydroxyurea and L-glutamine (the recovery test), sanity check results for negative-control drugs, the top-10 candidates with brief mechanistic rationale, and honest limitations
If hydroxyurea ranks in the top 10% and L-glutamine ranks in the top 25%, the MVP passes. Pre-register this threshold before looking at the data.
2. Locked decisions (do not re-litigate)
| Decision | Choice | Why |
|---|---|---|
| Target disease | Sickle cell disease (MONDO:0011382) | Monogenic, public data, two known ground-truth drugs |
| Drug modality | Small molecules only | Biologics are a fundamentally different problem |
| Matching method | CMap-style connectivity scoring (Lamb 2006 / Subramanian 2017) | Well-established, no training data needed, reference implementation exists (cmapPy) |
| Drug set size | ~300 compounds | Large enough to be meaningful, small enough to curate carefully |
| Patient stratification | None (one signature per disease) | Repurposing is disease-level; patient-level is the trial-design problem, out of scope |
| Quantum / quantum-inspired layer | Not in scope | Phase 3 multiplier, not relevant until classical baseline is proven |
| Knowledge graph / LLM extraction | Not in scope | Phase 2, after classical signature matching is validated |
| Build environment | Local notebook (Mac Studio, 96GB RAM) | All data fits locally; no cloud needed for MVP |
| Timeline | 2-4 weeks part-time | Cleaner than a hackathon, faster than "production" |
| Language | Python primary | Cheminformatics and bioinformatics ecosystems are mature in Python |
Things that are explicitly NOT in this MVP:
- Combination (2+ drug) matching
- Inverse/reverse matching (outcome → drugs)
- Multiple diseases
- Subtype-level signatures
- Patient/demographic stratification
- Knowledge graph construction
- LLM-based literature extraction
- Quantum / quantum-inspired optimization
- Mechanistic / ODE-based modelling
- Private pharma data ingestion
- API or productization
- Web UI
- Agentic orchestration
If a future Claude Code session is tempted to add any of these "while we're at it" — they delay the proof point that de-risks everything else. Build only what's in the spec.
3. Architecture overview
RAW DATA SOURCES
├── Open Targets (target-disease associations)
├── Orphanet / OMIM / MONDO (disease identifiers and definitions)
├── GEO (transcriptomic expression data — disease vs healthy)
├── ChEMBL (drug structures, targets, bioactivities)
├── LINCS L1000 (drug-induced expression signatures)
├── ClinicalTrials.gov (trial history, shelved compound discovery)
└── FDA labels / DailyMed (approved indications, safety)
│
▼
HARMONIZATION LAYER (Week 1-2)
├── Disease identifier resolution → canonical MONDO ID
├── Drug identity resolution → canonical InChIKey
├── Provenance + confidence tier attached to every record
│
▼
FEATURE LAYER (Week 1-2)
├── sickle_cell_signature_v1.json — disease signature vector with provenance
└── drug_profiles_v1.parquet — ~300 drug profiles with LINCS signatures
│
▼
MATCHING ENGINE (Week 3)
└── CMap connectivity scoring → ranked drug list
│
▼
VALIDATION + WRITE-UP (Week 4)
├── Recovery test: hydroxyurea + L-glutamine ranks
├── Sanity checks: negative controls rank low
└── 2-page report
Confidence tiers (critical design decision)
Every signature and drug profile carries a confidence tier:
- Tier A — measured data, peer-reviewed source, n>10 per group, recent
- Tier B — measured but small-n, older, or single-source
- Tier C — inferred / extrapolated / hypothesis-only
This is the most commercially important design decision in the whole pipeline. Sarborg's "1,700 rare disease signatures" are mostly Tier C (inferred). The platform's honesty about this is a differentiator, not a weakness. Every persisted artifact must include its tier.
4. Directory structure
reverso-mvp/
├── PLAN.md # This file
├── README.md # Short project description
├── pyproject.toml # Dependencies (or requirements.txt)
├── .gitignore # Exclude data/ and notebooks checkpoints
├── data/
│ ├── raw/ # Downloaded data, never edited
│ │ ├── open_targets/
│ │ ├── geo/
│ │ ├── chembl/
│ │ └── lincs/
│ ├── processed/ # Cleaned, harmonized data
│ │ ├── sickle_cell_signature_v1.json
│ │ └── drug_profiles_v1.parquet
│ └── results/
│ └── ranked_candidates_v1.csv
├── notebooks/
│ ├── 01_setup_identifiers.ipynb
│ ├── 02_disease_signature.ipynb
│ ├── 03_drug_profiles.ipynb
│ ├── 04_connectivity_scoring.ipynb
│ └── 05_recovery_test.ipynb
├── src/
│ ├── __init__.py
│ ├── identifiers.py # MONDO, ChEMBL ID resolution
│ ├── disease.py # Signature construction
│ ├── drugs.py # Drug profile construction
│ ├── scoring.py # CMap connectivity scoring
│ └── provenance.py # Tier assignment, source tracking
├── tests/
│ └── test_scoring.py # Verify scoring against known reference
└── docs/
├── recovery_test_report.md # Final 2-page write-up
├── data_sources.md # Detailed list of where data came from
└── known_limitations.md # Honest pitfalls documented
5. Dependencies
# pyproject.toml core dependencies
python = ">=3.11,<3.13"
pandas = ">=2.0"
numpy = ">=1.24"
scipy = ">=1.11"
requests = ">=2.31"
chembl_webresource_client = ">=0.10" # ChEMBL API client
GEOparse = ">=2.0" # GEO dataset access
pydeseq2 = ">=0.4" # Differential expression in Python
cmapPy = ">=4.0" # Reference CMap connectivity implementation
pyarrow = ">=14.0" # Parquet I/O
jupyter = ">=1.0"
matplotlib = ">=3.7" # Sanity-check plots
seaborn = ">=0.13"
pydantic = ">=2.0" # Schema validation for signatures/profiles
Install with pip install -e . after creating pyproject.toml. All free, all open-source. No licensed data sources are used in the MVP.
6. Week-by-week build plan
Week 1 — Disease signature
Goal: One Tier-A signature vector for sickle cell with full provenance.
Tasks:
-
Pin identifiers (
src/identifiers.py,notebooks/01_setup_identifiers.ipynb)- Sickle cell disease: MONDO:0011382, Orphanet:232, OMIM:603903
- Causal gene: HBB (Ensembl ENSG00000244734, HGNC:4827)
- Persist to
data/processed/identifiers.json
-
Pull Open Targets data (
notebooks/02_disease_signature.ipynbstep 1)- Use the Open Targets Platform API or bulk Parquet download (https://platform.opentargets.org/downloads)
- Pull: target-disease associations for MONDO:0011382, evidence sources, associated targets
- Store raw in
data/raw/open_targets/ - This gives the "known biology" subgraph as a sanity reference
-
Identify and pull a GEO dataset (
notebooks/02_disease_signature.ipynbstep 2)- Search GEO for sickle cell expression studies with healthy controls
- Criteria: n>=10 per group, RNA-seq or microarray, peer-reviewed publication
- Candidates to evaluate (verify each is still available and choose the strongest):
- GSE53441 (sickle cell vs healthy whole blood)
- GSE35007 (sickle cell pediatric)
- More recent studies preferred — search "sickle cell" with "Homo sapiens" filter
- Use
GEOparseto download - Document the chosen dataset's metadata fully in the signature provenance
-
Differential expression (
notebooks/02_disease_signature.ipynbstep 3)- For microarray data: log2-transform, normalize, use
limma-equivalent in Python (or call R viarpy2if needed — but try to stay in Python) - For RNA-seq: use
pydeseq2 - Output: gene-level log fold change and adjusted p-value table
- For microarray data: log2-transform, normalize, use
-
Build the signature (
src/disease.py)- Take top ~250 up-regulated and top ~250 down-regulated genes by adjusted p-value (cut at q<0.05)
- Map gene symbols to Entrez IDs and Ensembl IDs (use
mygenepackage or pyensembl) - Persist to
data/processed/sickle_cell_signature_v1.jsonwith schema:{ "signature_id": "sickle_cell_v1", "disease_mondo_id": "MONDO:0011382", "up_regulated": [{"gene": "HBG2", "entrez_id": "3048", "log_fc": 2.1, "qvalue": 1e-8}, ...], "down_regulated": [...], "provenance": { "geo_accession": "GSE53441", "n_disease": 27, "n_healthy": 12, "platform": "Affymetrix HG-U133 Plus 2.0", "method": "limma", "created_date": "2026-..." }, "confidence_tier": "A", "tier_rationale": "Measured RNA expression, n>10/group, peer-reviewed dataset", "limitations": ["Whole-blood expression confounded by cell composition differences", ...] }
Honest pitfall to document: Sickle cell whole-blood expression is partly driven by cell composition differences (different RBC/WBC ratios in patients vs controls), not just disease state. Note this in the signature's limitations field. A v2 would do cell-type deconvolution. v1 does not.
Week 2 — Drug profiles
Goal: ~300 drug profiles with structure, targets, and LINCS expression signatures where available.
Tasks:
-
Curate the drug set deliberately (
notebooks/03_drug_profiles.ipynbstep 1)- Ground truth (n=2, non-negotiable): hydroxyurea (ChEMBL:CHEMBL467), L-glutamine (CHEMBL:CHEMBL930)
- Related-mechanism drugs (n~50): HbF inducers, anti-inflammatory drugs studied in sickle cell, NO donors, antioxidants, drugs in current sickle cell clinical trials (search ClinicalTrials.gov for "sickle cell" interventional trials)
- Negative controls (n~50): Drugs from unrelated areas — antifungals, contraceptives, antihistamines for non-sickle indications, antibiotics. These should rank low; if they don't, the method has a bias to diagnose.
- General sample (n~200): Randomly sampled drugs from the LINCS L1000 catalog (which has ~2000 perturbagens). Use a fixed random seed for reproducibility.
- Store the curated list with the reason for inclusion in
data/processed/drug_set_v1.csv
-
Pull ChEMBL data for each drug (
notebooks/03_drug_profiles.ipynbstep 2,src/drugs.py)- Use
chembl_webresource_clientto fetch: ChEMBL ID, preferred name, InChIKey, canonical SMILES, known mechanisms of action, target list with bioactivity - Resolve drug name aliases to canonical ChEMBL IDs
- Store raw responses in
data/raw/chembl/
- Use
-
Pull LINCS L1000 signatures (
notebooks/03_drug_profiles.ipynbstep 3)- LINCS data portal: https://clue.io/data/CMap2020
- Use the Level 5 consensus signatures (MODZ aggregation across cell lines and replicates)
- Format: 978 landmark genes × drugs, with z-scored expression changes
- Critical honest note: For amino acids and metabolites like L-glutamine, LINCS coverage may be missing. If a ground-truth drug lacks a signature, document this in
docs/known_limitations.md. The fallback for that specific drug is to flag it as "no signature available, would require mechanism-graph fallback in v2." - Store in
data/raw/lincs/
-
Assemble the drug profile table (
src/drugs.py)- One row per drug
- Columns: chembl_id, name, inchikey, smiles, targets (list), mechanism_of_action, lincs_signature (978-vector or null), source_provenance, confidence_tier
- Persist to
data/processed/drug_profiles_v1.parquet
Week 3 — Connectivity scoring (the matching engine)
Goal: A ranked CSV of all ~300 drugs by their connectivity score against the sickle cell signature.
Tasks:
-
Implement CMap connectivity scoring (
src/scoring.py,notebooks/04_connectivity_scoring.ipynb)- Use
cmapPyas the reference implementation (it has the Broad Institute's official implementation) - Method: weighted Kolmogorov-Smirnov-based enrichment. For each drug, the score answers: how strongly does this drug's expression signature reverse the disease's up- and down-regulated gene sets?
- Strongly negative connectivity scores = strong reversal = candidate match
- Reference: Lamb et al. 2006 (Science), Subramanian et al. 2017 (Cell) — the L1000 paper
- Use
-
Compute scores for all drugs (
notebooks/04_connectivity_scoring.ipynb)- Map between the disease signature genes (potentially full-genome) and the LINCS 978 landmark genes — only the intersection is scored. Document the gene overlap count; this matters.
- For drugs without LINCS signatures (e.g., L-glutamine likely): mark explicitly as "not scored, no signature available." Do not skip silently.
- Output:
data/results/ranked_candidates_v1.csvwith columns: rank, drug_name, chembl_id, connectivity_score, normalized_score, p_value (if available), inclusion_reason, known_targets, mechanism_summary
-
Build a secondary mechanistically-weighted ranking (
notebooks/04_connectivity_scoring.ipynb)- For each drug, compute a prior weight based on whether its known targets are in sickle cell-relevant pathways (HBF regulation, hemoglobin, NO signaling, inflammation, oxidative stress)
- Produce a second ranking blending connectivity score with mechanistic prior
- Showing both raw and prior-weighted rankings is honest and informative
-
Write a unit test (
tests/test_scoring.py)- Use a reference example from the CMap paper or
cmapPydocumentation - Verify the implementation matches the reference within tolerance
- Use a reference example from the CMap paper or
Week 4 — Recovery test and write-up
Goal: A 2-page document a sceptical pharma scientist can evaluate in 5 minutes.
Critical: Before looking at the rankings, pre-register the success criteria in writing:
"The MVP passes if hydroxyurea ranks in the top 10% (top 30 of 300) AND L-glutamine either ranks in the top 25% (top 75) OR is documented as unscorable due to missing LINCS signature. At least 4 of 5 negative-control drugs must rank in the bottom half."
Tasks:
-
Run the recovery test (
notebooks/05_recovery_test.ipynb)- Pull the ranks of hydroxyurea and L-glutamine from
ranked_candidates_v1.csv - Pull the ranks of 5 pre-specified negative controls
- Compute pass/fail against the pre-registered criteria
- Pull the ranks of hydroxyurea and L-glutamine from
-
Examine the top 10 (
notebooks/05_recovery_test.ipynb)- For each of the top 10 candidates, write a one-sentence mechanistic rationale (or note "no obvious rationale — possible false positive")
- Identify the single most interesting non-obvious candidate
- Many top-10 candidates will look mechanistically silly (HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due to widespread expression effects); document this honestly
-
Write
docs/recovery_test_report.md(~2 pages)- Section 1 — Methodology: What was built, in 5-6 sentences, with the GEO dataset, drug set composition, and scoring method named
- Section 2 — Recovery test result: Did hydroxyurea and L-glutamine pass? Did negative controls behave correctly? Pass/fail against pre-registered criteria
- Section 3 — Top 10 candidates: Brief table with each candidate, score, known mechanism, and a sentence on biological plausibility
- Section 4 — One non-obvious candidate worth investigating: A single paragraph on the most interesting result
- Section 5 — Honest limitations: Cell-composition confound, L1000 cell-line limitations, missing signatures, no mechanistic validation layer
- Section 6 — What v2 would fix: Cell-type deconvolution, knowledge graph for missing-signature drugs, second disease to test generalization
-
Document data sources fully (
docs/data_sources.md)- Every data source, version, download date, and license
- This is the artifact that proves reproducibility
-
Document known limitations (
docs/known_limitations.md)- The honest list of what would break this MVP at scale or in a different disease
- Useful for the next pharma conversation: "yes, we know these are limitations, here's how v2 addresses them"
7. Data sources reference
| Source | URL | Access | License | Use in MVP |
|---|---|---|---|---|
| Open Targets | https://platform.opentargets.org | API, bulk Parquet | CC0 | Target-disease graph |
| MONDO | http://www.obofoundry.org/ontology/mondo.html | OBO file | CC BY 4.0 | Disease ID |
| Orphanet | https://www.orpha.net | Bulk XML | CC BY 4.0 | Rare disease metadata |
| OMIM | https://omim.org | Free for academic | License for commercial | Disease genetics |
| GEO | https://www.ncbi.nlm.nih.gov/geo/ | GEOparse, FTP | Public domain | Expression data |
| ChEMBL | https://www.ebi.ac.uk/chembl | Python client, bulk SQLite | CC BY-SA 3.0 | Drug structures, targets |
| LINCS L1000 | https://clue.io/data | Bulk download | Restricted academic free | Drug expression signatures |
| ClinicalTrials.gov | https://clinicaltrials.gov | API | Public domain | Trial history |
| FDA DailyMed | https://dailymed.nlm.nih.gov | API | Public domain | Approved labels |
| Reactome | https://reactome.org | API, bulk | CC0 | Pathway data (Week 3 prior) |
Licensing note for LINCS: Read the LINCS data use terms before commercial use. For the MVP (research/proof-of-concept), the terms are permissive. For productization, this needs legal review.
8. Reproducibility requirements
This is a science artifact, not a hack. Reproducibility is the whole point.
- All data downloads must record date and version
- All randomness must use a fixed seed (set in a top-level constant)
- All signature and profile files must include
created_dateandpipeline_version - Every notebook must run end-to-end from a fresh checkout without manual intervention (other than downloading the raw data files, which have a documented script)
- The pre-registered success criteria must be committed to git before the recovery test is run
9. Honest pitfalls (do not ignore these)
These are real risks documented during planning. They are not paranoia.
-
Cell-composition confound in sickle cell expression data. Whole-blood differential expression in sickle cell partly reflects different blood cell ratios, not disease biology. v1 acknowledges this; v2 should deconvolve.
-
LINCS L1000 cell-line limitations. The 978 landmark genes were measured mostly in cancer cell lines (MCF7, A375, PC3, etc.). Signatures for non-oncology diseases may be noisy. This is a known field-wide limitation, not unique to Reverso.
-
L-glutamine probably has no LINCS signature. Amino acids and metabolites weren't LINCS priorities. If true, the ground-truth test only has hydroxyurea, which is weaker. Document honestly.
-
Connectivity scoring surfaces broad-effect drugs as false positives. HDAC inhibitors and broad kinase inhibitors often top connectivity rankings simply because they perturb many genes. Expect this; don't oversell them. The mechanistic prior in Week 3 helps filter.
-
Hydroxyurea will probably pass the recovery test by construction. Sickle cell + hydroxyurea is a well-studied pair. Passing this test is necessary but not sufficient to claim the platform generalizes. The next disease (when there is one) is the real test of generalization. Do not sell sickle cell results as proving the platform.
-
The MVP has no mechanistic validation layer. Multiple experts (Hicham, Nova In Silico) flagged that pure ML matching is not sufficient for extrapolation. The MVP knowingly omits the mechanistic layer; it's a phase-2 addition. Position the MVP as "discovery hypothesis generation," not "validated prediction."
-
Top-ranked novel candidates have not been wet-lab validated. They are computational hypotheses. Any "interesting candidate" surfaced in the write-up is a hypothesis to test, not a discovery. Use careful language.
10. What to do next (for the human picking this up)
First session in Claude Code
- Initialize the repo:
git init, create the directory structure in section 4 - Set up the Python environment with
pyproject.toml(oruv initif usinguv) - Open
notebooks/01_setup_identifiers.ipynband start Week 1, task 1 - Commit early and often. Each notebook should be a separate commit when first complete.
Pre-flight check before the recovery test (end of Week 3)
Before running the recovery test in Week 4, commit the pre-registered success criteria to git. This prevents post-hoc rationalization. If the criteria need to change after seeing partial results, that change must also be committed and explained.
When the MVP is done
The deliverable to send to anyone (investor, advisor, pharma contact) is:
- The 2-page
recovery_test_report.md - The
ranked_candidates_v1.csv - The signature and profile JSON/parquet files with their provenance
That's it. No slides yet. The single document is the artifact. If it passes the recovery test, you have earned the right to raise on the broader vision.
11. Strategic context (for future Claude Code sessions to understand the "why")
This MVP exists in a broader strategic context that was built through ~7 expert consultations. The key conclusions:
- The architecture is two databases (disease signatures + drug profiles) + a matching engine. Three independent experts described this unprompted. It is not a hypothesis; it is the standard model in the field.
- The moat is the curated data, not the algorithm. The matching algorithm is largely commodity (CMap is from 2006). The proprietary value is in the harmonized, curated, provenance-tracked data layer. Build accordingly.
- The long-term technical architecture adds a knowledge graph (phase 2) and quantum-inspired optimization for combination search (phase 3). Neither is in the MVP.
- The go-to-market is "digital CRO for drug repurposing." Quantum language is for investors; pharma clients hear "digital CRO."
- The first paying buyers are R&D programme directors and BD teams in pharma, approached either outbound (we found a match in your area) or via API/sandbox (run your shelved compounds through our engine, data stays on your side).
- Synthetic trial arms and drug repurposing share data infrastructure. This is a platform play, not a single product.
The MVP's job is to produce one credible result. Everything else follows from that.
12. Phase 2 track — Structure-based binding (scoped 2026-06-23)
Status: scoped, not committed. This is a follow-on track proposed after the MVP and its follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what those experiments empirically showed. Read §9–11 and the experiment commits first.
12.1 Why pivot modality (the evidence, not a hunch)
The expression-connectivity approach was built, validated, and pushed hard (gene-space expansion, cell-composition deconvolution, reference-library tau, supervised learning). The empirical verdict:
- Connectivity recovers hydroxyurea (top ~6–8%) but cannot achieve specificity — unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four independent methods.
- A supervised model on indication labels hit 0.925 CV AUC — but it was a degree-bias mirage: it learned drug popularity, not disease matching (it ranked hydroxyurea 231/300).
- The decisive test: with drug-popularity features removed, the model trained on the actual drug↔disease connectivity scored AUC 0.491 — pure chance. The expression-connectivity modality contains essentially no disease-specific therapeutic signal for this task.
This is a signal problem, not a model problem — no amount of model sophistication (diffusion, GNNs, etc.) extracts signal that isn't in the data. The response is to change modality to one with a strong, physical, drug-specific signal: does a molecule bind a sickle-relevant target? A drug that binds HbS is mechanistically specific by construction — the opposite of a coincidental expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is itself a diffusion model) actually has traction, because structure has physical ground truth.
12.2 Targets (sickle-specific, druggable, structurally characterised)
Small molecules only (§2). Curated shortlist with public structures and, crucially, known small-molecule binders to serve as positive controls:
| Target | Mechanism in sickle | Known binder (positive control) |
|---|---|---|
| Hemoglobin (HBB/HBA tetramer, HbS) | Anti-polymerisation; R-state stabiliser | voxelotor (binds α-Val1) |
| PKR (PKLR, red-cell pyruvate kinase) | Activator → ↓2,3-BPG → ↑O2 affinity | mitapivat, etavopivat |
| DNMT1 | HbF induction (de-repress γ-globin) | decitabine, azacitidine |
| LSD1 / KDM1A | HbF induction | tranylcypromine analogues |
| HDAC1/2 | HbF induction | vorinostat, panobinostat |
| EHMT2 (G9a) | HbF induction | UNC0642 (tool) |
| PDE9 | ↑cGMP, anti-adhesion | PF-04447943 (sickle trial) |
Hard/excluded for v1: BCL11A (transcription factor, no classic pocket — the γ-globin master repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism.
12.3 Method (baseline → generative co-folding)
- Prepare structures. Pull target structures from the PDB; AF3/Boltz-predict any missing.
- Prepare ligands. Reuse the existing ~300-drug set (we already have canonical SMILES from ChEMBL); expandable to the full ChEMBL/LINCS catalogue.
- Dock + score, in increasing sophistication:
- Baseline: classical docking (AutoDock Vina / smina) — fast, CPU, well-understood.
- Generative co-folding: an open AlphaFold3-class model — Boltz-2 (predicts the protein–ligand complex and a binding-affinity estimate, MIT-licensed), Chai-1, or DiffDock (a diffusion model for docking — the legitimate home for the "diffusion" instinct). These predict the bound pose directly and tend to beat classical docking.
- Report both; the baseline keeps us honest about whether the ML model adds anything.
12.4 Validation (a real recovery test, like §6 Week 4)
Pre-register before scoring: the known structure-based sickle drugs must rank as top binders to their targets — voxelotor→hemoglobin, mitapivat→PKR, decitabine→DNMT1. Negative controls (unrelated drugs) must not bind these pockets. This is a cleaner recovery test than the expression one, because binding is mechanistically specific — it should not have the coincidental-reverser problem that sank the connectivity approach.
12.5 The real prize — integrate, don't replace
The long-term value is both modalities together: a candidate that reverses the disease signature (expression) and binds a sickle-relevant target (structure) is far more credible than either alone. Structure supplies the specificity the expression layer lacks; expression supplies the systems-level, target-agnostic view structure lacks. The platform thesis (§11) — two databases + a matching engine — extends naturally to a third (structures) feeding the same confidence-tiered data layer.
12.6 Honest pitfalls (do not ignore)
- Binding ≠ efficacy. A molecule can bind and do nothing therapeutic. Structure-based hits are still hypotheses (cf. §9.7).
- Only covers the enzyme/pocket subset. Sickle's biggest lever (γ-globin reactivation via BCL11A) is largely transcriptional and not small-molecule-tractable — structure-based screening is blind to it. Be explicit about coverage.
- Docking/affinity accuracy is limited. Pose prediction is decent; absolute affinity is hard. Validate on known binders before trusting novel scores.
- Compute. AF3-class models are GPU-heavy; the local Mac Studio (§2) may not suffice — this track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule.
- Moat. Structures and tools are public; the proprietary value is the curated target list, the integration with the expression layer, and provenance/tiering — not the docker.
12.7 Explicitly NOT in this track
Free energy perturbation / MD-based affinity; covalent docking; de novo generation of molecules as final candidates to synthesise (design, not repurposing — but see §12.9 for the generate-then-retrieve hybrid, which is repurposing); BCL11A or any non-pocket target; biologics; combination binding.
12.8 Open decisions before committing
- Tooling: classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend: baseline first, for an honest reference — the lesson of the whole expression arc.)
- Compute: secure a GPU environment (the "all local" §2 assumption breaks here).
- Scope of v1: the 7-target shortlist above, or start with just Hb + PKR (the two with the cleanest positive controls) to de-risk the harness before scaling targets.
12.9 Door left open — generative-guided retrieval (generate → match existing)
A legitimate way to bring generative models into the repurposing frame (vs the design frame excluded in §12.7): don't generate molecules to synthesise — generate them as a search beacon.
The idea. Use a pocket-conditioned generative model (target-conditioned diffusion — e.g. TargetDiff, DiffSBDD, Pocket2Mol) to propose idealised binders for a sickle target. Then retrieve the nearest existing drugs to those proposals by chemical similarity (Tanimoto over Morgan fingerprints, or a learned molecular embedding). The retrieved neighbours — real, already-approved or clinical molecules — are the repurposing candidates. The generated molecule is never made; it only defines a region of chemical space worth searching. This is the user-proposed framing and it is sound: "generate the ideal, then find what we already have nearby."
Why it could add value. It can point at scaffolds / regions a known-binder-seeded or brute-force docking sweep would miss, and it makes the target's binding requirements explicit as geometry rather than as a single reference ligand.
Honest caveats (why it's a door, not a commitment).
- Generated molecules are often synthetically unrealistic / invalid — which is exactly why they must be used only as beacons, never as candidates.
- Similarity ≠ activity. Activity cliffs mean a near-neighbour of a good binder can be inert. So retrieved neighbours do not bypass validation — they must still be docked/scored (§12.3) and clear the binding recovery test (§12.4). The generative step proposes; it does not prove.
- Marginal-value question. Directly docking the whole existing library (§12.3) already covers chemical space. Whether generate-then-retrieve beats that — by efficiency or by surfacing non-obvious scaffolds — is an open empirical question and needs a head-to-head before it earns real investment.
- Only as good as the pocket conditioning of the generator, and the chemistry of the target.
Status: explore only after the §12.3–12.4 docking harness works and is validated on the known binders — same discipline as everywhere else: prove the baseline, then test whether the fancier method adds anything.