Files
Reverso/PLAN.md
Junior B. 6c2c71d73d PLAN §12.9: leave door open for generative-guided retrieval
Reframe de novo generation into the repurposing frame per the founder's
idea: use a pocket-conditioned generative model (TargetDiff/DiffSBDD/
Pocket2Mol) to propose an idealised binder as a SEARCH BEACON, then
retrieve the nearest EXISTING drugs by chemical similarity (Tanimoto/
embedding) as repurposing candidates — the generated molecule is never
synthesised.

Caveats kept honest: generated molecules used only as beacons (often
synthetically invalid); similarity != activity, so retrieved neighbours
still must be docked + pass the binding recovery test; open question
whether it beats brute-force docking the existing library. Explore only
after the §12.3-12.4 docking baseline is validated. §12.7 exclusion
reworded to point here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:43:25 +02:00

34 KiB
Raw Blame History

Reverso MVP — Sickle Cell Repurposing Pipeline

For Claude Code: This is the project specification. Read this entire document before suggesting actions or writing code. The decisions in section "Locked decisions" have already been made by the founder after extensive expert consultation; do not re-litigate them. Where the plan calls for a choice, propose options but default to the spec.


1. Project context

What we're building

A minimum viable drug repurposing pipeline that:

  1. Pulls public biomedical data for sickle cell disease (one disease, deliberately scoped)
  2. Builds a disease signature (transcriptomic gene expression vector)
  3. Builds drug profiles for ~300 small-molecule compounds
  4. Runs CMap-style connectivity scoring to rank drugs by their potential to reverse the disease signature
  5. Validates via a recovery test: do the two known sickle cell drugs (hydroxyurea, L-glutamine) rank in the top of the list?

This is a proof-of-concept for a broader platform. The platform thesis is "AI-driven drug repurposing using disease signature + drug profile matching, with knowledge graphs as the long-term moat and quantum-inspired optimization as a phase-3 multiplier." None of that broader vision is in scope for this MVP. The MVP exists to produce one credible, reproducible result that proves the matching method works.

Why sickle cell

  • Monogenic (HBB gene, single point mutation) — disease biology is unusually clean for a rare disease
  • Rich public data — GEO has multiple expression studies, Open Targets has well-curated associations
  • Two known repurposings/expansions as ground truth: hydroxyurea (originally for chronic myeloid leukemia, now standard sickle cell care) and L-glutamine (approved 2017). If the engine doesn't rank these highly, the engine is wrong.
  • Strong unmet need narrative for investor conversations

What success looks like

A reproducible Jupyter notebook (or notebook set) that produces:

  1. A versioned sickle cell disease signature with provenance
  2. A drug profile dataset for ~300 compounds
  3. A ranked CSV of all ~300 drugs by connectivity score
  4. A 2-page write-up in docs/recovery_test_report.md containing: methodology, the ranks of hydroxyurea and L-glutamine (the recovery test), sanity check results for negative-control drugs, the top-10 candidates with brief mechanistic rationale, and honest limitations

If hydroxyurea ranks in the top 10% and L-glutamine ranks in the top 25%, the MVP passes. Pre-register this threshold before looking at the data.


2. Locked decisions (do not re-litigate)

Decision Choice Why
Target disease Sickle cell disease (MONDO:0011382) Monogenic, public data, two known ground-truth drugs
Drug modality Small molecules only Biologics are a fundamentally different problem
Matching method CMap-style connectivity scoring (Lamb 2006 / Subramanian 2017) Well-established, no training data needed, reference implementation exists (cmapPy)
Drug set size ~300 compounds Large enough to be meaningful, small enough to curate carefully
Patient stratification None (one signature per disease) Repurposing is disease-level; patient-level is the trial-design problem, out of scope
Quantum / quantum-inspired layer Not in scope Phase 3 multiplier, not relevant until classical baseline is proven
Knowledge graph / LLM extraction Not in scope Phase 2, after classical signature matching is validated
Build environment Local notebook (Mac Studio, 96GB RAM) All data fits locally; no cloud needed for MVP
Timeline 2-4 weeks part-time Cleaner than a hackathon, faster than "production"
Language Python primary Cheminformatics and bioinformatics ecosystems are mature in Python

Things that are explicitly NOT in this MVP:

  • Combination (2+ drug) matching
  • Inverse/reverse matching (outcome → drugs)
  • Multiple diseases
  • Subtype-level signatures
  • Patient/demographic stratification
  • Knowledge graph construction
  • LLM-based literature extraction
  • Quantum / quantum-inspired optimization
  • Mechanistic / ODE-based modelling
  • Private pharma data ingestion
  • API or productization
  • Web UI
  • Agentic orchestration

If a future Claude Code session is tempted to add any of these "while we're at it" — they delay the proof point that de-risks everything else. Build only what's in the spec.


3. Architecture overview

RAW DATA SOURCES
├── Open Targets (target-disease associations)
├── Orphanet / OMIM / MONDO (disease identifiers and definitions)
├── GEO (transcriptomic expression data — disease vs healthy)
├── ChEMBL (drug structures, targets, bioactivities)
├── LINCS L1000 (drug-induced expression signatures)
├── ClinicalTrials.gov (trial history, shelved compound discovery)
└── FDA labels / DailyMed (approved indications, safety)
                        │
                        ▼
HARMONIZATION LAYER (Week 1-2)
├── Disease identifier resolution → canonical MONDO ID
├── Drug identity resolution → canonical InChIKey
├── Provenance + confidence tier attached to every record
                        │
                        ▼
FEATURE LAYER (Week 1-2)
├── sickle_cell_signature_v1.json — disease signature vector with provenance
└── drug_profiles_v1.parquet — ~300 drug profiles with LINCS signatures
                        │
                        ▼
MATCHING ENGINE (Week 3)
└── CMap connectivity scoring → ranked drug list
                        │
                        ▼
VALIDATION + WRITE-UP (Week 4)
├── Recovery test: hydroxyurea + L-glutamine ranks
├── Sanity checks: negative controls rank low
└── 2-page report

Confidence tiers (critical design decision)

Every signature and drug profile carries a confidence tier:

  • Tier A — measured data, peer-reviewed source, n>10 per group, recent
  • Tier B — measured but small-n, older, or single-source
  • Tier C — inferred / extrapolated / hypothesis-only

This is the most commercially important design decision in the whole pipeline. Sarborg's "1,700 rare disease signatures" are mostly Tier C (inferred). The platform's honesty about this is a differentiator, not a weakness. Every persisted artifact must include its tier.


4. Directory structure

reverso-mvp/
├── PLAN.md                          # This file
├── README.md                        # Short project description
├── pyproject.toml                   # Dependencies (or requirements.txt)
├── .gitignore                       # Exclude data/ and notebooks checkpoints
├── data/
│   ├── raw/                         # Downloaded data, never edited
│   │   ├── open_targets/
│   │   ├── geo/
│   │   ├── chembl/
│   │   └── lincs/
│   ├── processed/                   # Cleaned, harmonized data
│   │   ├── sickle_cell_signature_v1.json
│   │   └── drug_profiles_v1.parquet
│   └── results/
│       └── ranked_candidates_v1.csv
├── notebooks/
│   ├── 01_setup_identifiers.ipynb
│   ├── 02_disease_signature.ipynb
│   ├── 03_drug_profiles.ipynb
│   ├── 04_connectivity_scoring.ipynb
│   └── 05_recovery_test.ipynb
├── src/
│   ├── __init__.py
│   ├── identifiers.py               # MONDO, ChEMBL ID resolution
│   ├── disease.py                   # Signature construction
│   ├── drugs.py                     # Drug profile construction
│   ├── scoring.py                   # CMap connectivity scoring
│   └── provenance.py                # Tier assignment, source tracking
├── tests/
│   └── test_scoring.py              # Verify scoring against known reference
└── docs/
    ├── recovery_test_report.md      # Final 2-page write-up
    ├── data_sources.md              # Detailed list of where data came from
    └── known_limitations.md         # Honest pitfalls documented

5. Dependencies

# pyproject.toml core dependencies
python = ">=3.11,<3.13"
pandas = ">=2.0"
numpy = ">=1.24"
scipy = ">=1.11"
requests = ">=2.31"
chembl_webresource_client = ">=0.10"   # ChEMBL API client
GEOparse = ">=2.0"                      # GEO dataset access
pydeseq2 = ">=0.4"                      # Differential expression in Python
cmapPy = ">=4.0"                        # Reference CMap connectivity implementation
pyarrow = ">=14.0"                      # Parquet I/O
jupyter = ">=1.0"
matplotlib = ">=3.7"                    # Sanity-check plots
seaborn = ">=0.13"
pydantic = ">=2.0"                      # Schema validation for signatures/profiles

Install with pip install -e . after creating pyproject.toml. All free, all open-source. No licensed data sources are used in the MVP.


6. Week-by-week build plan

Week 1 — Disease signature

Goal: One Tier-A signature vector for sickle cell with full provenance.

Tasks:

  1. Pin identifiers (src/identifiers.py, notebooks/01_setup_identifiers.ipynb)

    • Sickle cell disease: MONDO:0011382, Orphanet:232, OMIM:603903
    • Causal gene: HBB (Ensembl ENSG00000244734, HGNC:4827)
    • Persist to data/processed/identifiers.json
  2. Pull Open Targets data (notebooks/02_disease_signature.ipynb step 1)

    • Use the Open Targets Platform API or bulk Parquet download (https://platform.opentargets.org/downloads)
    • Pull: target-disease associations for MONDO:0011382, evidence sources, associated targets
    • Store raw in data/raw/open_targets/
    • This gives the "known biology" subgraph as a sanity reference
  3. Identify and pull a GEO dataset (notebooks/02_disease_signature.ipynb step 2)

    • Search GEO for sickle cell expression studies with healthy controls
    • Criteria: n>=10 per group, RNA-seq or microarray, peer-reviewed publication
    • Candidates to evaluate (verify each is still available and choose the strongest):
      • GSE53441 (sickle cell vs healthy whole blood)
      • GSE35007 (sickle cell pediatric)
      • More recent studies preferred — search "sickle cell" with "Homo sapiens" filter
    • Use GEOparse to download
    • Document the chosen dataset's metadata fully in the signature provenance
  4. Differential expression (notebooks/02_disease_signature.ipynb step 3)

    • For microarray data: log2-transform, normalize, use limma-equivalent in Python (or call R via rpy2 if needed — but try to stay in Python)
    • For RNA-seq: use pydeseq2
    • Output: gene-level log fold change and adjusted p-value table
  5. Build the signature (src/disease.py)

    • Take top ~250 up-regulated and top ~250 down-regulated genes by adjusted p-value (cut at q<0.05)
    • Map gene symbols to Entrez IDs and Ensembl IDs (use mygene package or pyensembl)
    • Persist to data/processed/sickle_cell_signature_v1.json with schema:
      {
        "signature_id": "sickle_cell_v1",
        "disease_mondo_id": "MONDO:0011382",
        "up_regulated": [{"gene": "HBG2", "entrez_id": "3048", "log_fc": 2.1, "qvalue": 1e-8}, ...],
        "down_regulated": [...],
        "provenance": {
          "geo_accession": "GSE53441",
          "n_disease": 27,
          "n_healthy": 12,
          "platform": "Affymetrix HG-U133 Plus 2.0",
          "method": "limma",
          "created_date": "2026-..."
        },
        "confidence_tier": "A",
        "tier_rationale": "Measured RNA expression, n>10/group, peer-reviewed dataset",
        "limitations": ["Whole-blood expression confounded by cell composition differences", ...]
      }
      

Honest pitfall to document: Sickle cell whole-blood expression is partly driven by cell composition differences (different RBC/WBC ratios in patients vs controls), not just disease state. Note this in the signature's limitations field. A v2 would do cell-type deconvolution. v1 does not.

Week 2 — Drug profiles

Goal: ~300 drug profiles with structure, targets, and LINCS expression signatures where available.

Tasks:

  1. Curate the drug set deliberately (notebooks/03_drug_profiles.ipynb step 1)

    • Ground truth (n=2, non-negotiable): hydroxyurea (ChEMBL:CHEMBL467), L-glutamine (CHEMBL:CHEMBL930)
    • Related-mechanism drugs (n~50): HbF inducers, anti-inflammatory drugs studied in sickle cell, NO donors, antioxidants, drugs in current sickle cell clinical trials (search ClinicalTrials.gov for "sickle cell" interventional trials)
    • Negative controls (n~50): Drugs from unrelated areas — antifungals, contraceptives, antihistamines for non-sickle indications, antibiotics. These should rank low; if they don't, the method has a bias to diagnose.
    • General sample (n~200): Randomly sampled drugs from the LINCS L1000 catalog (which has ~2000 perturbagens). Use a fixed random seed for reproducibility.
    • Store the curated list with the reason for inclusion in data/processed/drug_set_v1.csv
  2. Pull ChEMBL data for each drug (notebooks/03_drug_profiles.ipynb step 2, src/drugs.py)

    • Use chembl_webresource_client to fetch: ChEMBL ID, preferred name, InChIKey, canonical SMILES, known mechanisms of action, target list with bioactivity
    • Resolve drug name aliases to canonical ChEMBL IDs
    • Store raw responses in data/raw/chembl/
  3. Pull LINCS L1000 signatures (notebooks/03_drug_profiles.ipynb step 3)

    • LINCS data portal: https://clue.io/data/CMap2020
    • Use the Level 5 consensus signatures (MODZ aggregation across cell lines and replicates)
    • Format: 978 landmark genes × drugs, with z-scored expression changes
    • Critical honest note: For amino acids and metabolites like L-glutamine, LINCS coverage may be missing. If a ground-truth drug lacks a signature, document this in docs/known_limitations.md. The fallback for that specific drug is to flag it as "no signature available, would require mechanism-graph fallback in v2."
    • Store in data/raw/lincs/
  4. Assemble the drug profile table (src/drugs.py)

    • One row per drug
    • Columns: chembl_id, name, inchikey, smiles, targets (list), mechanism_of_action, lincs_signature (978-vector or null), source_provenance, confidence_tier
    • Persist to data/processed/drug_profiles_v1.parquet

Week 3 — Connectivity scoring (the matching engine)

Goal: A ranked CSV of all ~300 drugs by their connectivity score against the sickle cell signature.

Tasks:

  1. Implement CMap connectivity scoring (src/scoring.py, notebooks/04_connectivity_scoring.ipynb)

    • Use cmapPy as the reference implementation (it has the Broad Institute's official implementation)
    • Method: weighted Kolmogorov-Smirnov-based enrichment. For each drug, the score answers: how strongly does this drug's expression signature reverse the disease's up- and down-regulated gene sets?
    • Strongly negative connectivity scores = strong reversal = candidate match
    • Reference: Lamb et al. 2006 (Science), Subramanian et al. 2017 (Cell) — the L1000 paper
  2. Compute scores for all drugs (notebooks/04_connectivity_scoring.ipynb)

    • Map between the disease signature genes (potentially full-genome) and the LINCS 978 landmark genes — only the intersection is scored. Document the gene overlap count; this matters.
    • For drugs without LINCS signatures (e.g., L-glutamine likely): mark explicitly as "not scored, no signature available." Do not skip silently.
    • Output: data/results/ranked_candidates_v1.csv with columns: rank, drug_name, chembl_id, connectivity_score, normalized_score, p_value (if available), inclusion_reason, known_targets, mechanism_summary
  3. Build a secondary mechanistically-weighted ranking (notebooks/04_connectivity_scoring.ipynb)

    • For each drug, compute a prior weight based on whether its known targets are in sickle cell-relevant pathways (HBF regulation, hemoglobin, NO signaling, inflammation, oxidative stress)
    • Produce a second ranking blending connectivity score with mechanistic prior
    • Showing both raw and prior-weighted rankings is honest and informative
  4. Write a unit test (tests/test_scoring.py)

    • Use a reference example from the CMap paper or cmapPy documentation
    • Verify the implementation matches the reference within tolerance

Week 4 — Recovery test and write-up

Goal: A 2-page document a sceptical pharma scientist can evaluate in 5 minutes.

Critical: Before looking at the rankings, pre-register the success criteria in writing:

"The MVP passes if hydroxyurea ranks in the top 10% (top 30 of 300) AND L-glutamine either ranks in the top 25% (top 75) OR is documented as unscorable due to missing LINCS signature. At least 4 of 5 negative-control drugs must rank in the bottom half."

Tasks:

  1. Run the recovery test (notebooks/05_recovery_test.ipynb)

    • Pull the ranks of hydroxyurea and L-glutamine from ranked_candidates_v1.csv
    • Pull the ranks of 5 pre-specified negative controls
    • Compute pass/fail against the pre-registered criteria
  2. Examine the top 10 (notebooks/05_recovery_test.ipynb)

    • For each of the top 10 candidates, write a one-sentence mechanistic rationale (or note "no obvious rationale — possible false positive")
    • Identify the single most interesting non-obvious candidate
    • Many top-10 candidates will look mechanistically silly (HDAC inhibitors and broad kinase inhibitors often dominate connectivity rankings due to widespread expression effects); document this honestly
  3. Write docs/recovery_test_report.md (~2 pages)

    • Section 1 — Methodology: What was built, in 5-6 sentences, with the GEO dataset, drug set composition, and scoring method named
    • Section 2 — Recovery test result: Did hydroxyurea and L-glutamine pass? Did negative controls behave correctly? Pass/fail against pre-registered criteria
    • Section 3 — Top 10 candidates: Brief table with each candidate, score, known mechanism, and a sentence on biological plausibility
    • Section 4 — One non-obvious candidate worth investigating: A single paragraph on the most interesting result
    • Section 5 — Honest limitations: Cell-composition confound, L1000 cell-line limitations, missing signatures, no mechanistic validation layer
    • Section 6 — What v2 would fix: Cell-type deconvolution, knowledge graph for missing-signature drugs, second disease to test generalization
  4. Document data sources fully (docs/data_sources.md)

    • Every data source, version, download date, and license
    • This is the artifact that proves reproducibility
  5. Document known limitations (docs/known_limitations.md)

    • The honest list of what would break this MVP at scale or in a different disease
    • Useful for the next pharma conversation: "yes, we know these are limitations, here's how v2 addresses them"

7. Data sources reference

Source URL Access License Use in MVP
Open Targets https://platform.opentargets.org API, bulk Parquet CC0 Target-disease graph
MONDO http://www.obofoundry.org/ontology/mondo.html OBO file CC BY 4.0 Disease ID
Orphanet https://www.orpha.net Bulk XML CC BY 4.0 Rare disease metadata
OMIM https://omim.org Free for academic License for commercial Disease genetics
GEO https://www.ncbi.nlm.nih.gov/geo/ GEOparse, FTP Public domain Expression data
ChEMBL https://www.ebi.ac.uk/chembl Python client, bulk SQLite CC BY-SA 3.0 Drug structures, targets
LINCS L1000 https://clue.io/data Bulk download Restricted academic free Drug expression signatures
ClinicalTrials.gov https://clinicaltrials.gov API Public domain Trial history
FDA DailyMed https://dailymed.nlm.nih.gov API Public domain Approved labels
Reactome https://reactome.org API, bulk CC0 Pathway data (Week 3 prior)

Licensing note for LINCS: Read the LINCS data use terms before commercial use. For the MVP (research/proof-of-concept), the terms are permissive. For productization, this needs legal review.


8. Reproducibility requirements

This is a science artifact, not a hack. Reproducibility is the whole point.

  • All data downloads must record date and version
  • All randomness must use a fixed seed (set in a top-level constant)
  • All signature and profile files must include created_date and pipeline_version
  • Every notebook must run end-to-end from a fresh checkout without manual intervention (other than downloading the raw data files, which have a documented script)
  • The pre-registered success criteria must be committed to git before the recovery test is run

9. Honest pitfalls (do not ignore these)

These are real risks documented during planning. They are not paranoia.

  1. Cell-composition confound in sickle cell expression data. Whole-blood differential expression in sickle cell partly reflects different blood cell ratios, not disease biology. v1 acknowledges this; v2 should deconvolve.

  2. LINCS L1000 cell-line limitations. The 978 landmark genes were measured mostly in cancer cell lines (MCF7, A375, PC3, etc.). Signatures for non-oncology diseases may be noisy. This is a known field-wide limitation, not unique to Reverso.

  3. L-glutamine probably has no LINCS signature. Amino acids and metabolites weren't LINCS priorities. If true, the ground-truth test only has hydroxyurea, which is weaker. Document honestly.

  4. Connectivity scoring surfaces broad-effect drugs as false positives. HDAC inhibitors and broad kinase inhibitors often top connectivity rankings simply because they perturb many genes. Expect this; don't oversell them. The mechanistic prior in Week 3 helps filter.

  5. Hydroxyurea will probably pass the recovery test by construction. Sickle cell + hydroxyurea is a well-studied pair. Passing this test is necessary but not sufficient to claim the platform generalizes. The next disease (when there is one) is the real test of generalization. Do not sell sickle cell results as proving the platform.

  6. The MVP has no mechanistic validation layer. Multiple experts (Hicham, Nova In Silico) flagged that pure ML matching is not sufficient for extrapolation. The MVP knowingly omits the mechanistic layer; it's a phase-2 addition. Position the MVP as "discovery hypothesis generation," not "validated prediction."

  7. Top-ranked novel candidates have not been wet-lab validated. They are computational hypotheses. Any "interesting candidate" surfaced in the write-up is a hypothesis to test, not a discovery. Use careful language.


10. What to do next (for the human picking this up)

First session in Claude Code

  1. Initialize the repo: git init, create the directory structure in section 4
  2. Set up the Python environment with pyproject.toml (or uv init if using uv)
  3. Open notebooks/01_setup_identifiers.ipynb and start Week 1, task 1
  4. Commit early and often. Each notebook should be a separate commit when first complete.

Pre-flight check before the recovery test (end of Week 3)

Before running the recovery test in Week 4, commit the pre-registered success criteria to git. This prevents post-hoc rationalization. If the criteria need to change after seeing partial results, that change must also be committed and explained.

When the MVP is done

The deliverable to send to anyone (investor, advisor, pharma contact) is:

  1. The 2-page recovery_test_report.md
  2. The ranked_candidates_v1.csv
  3. The signature and profile JSON/parquet files with their provenance

That's it. No slides yet. The single document is the artifact. If it passes the recovery test, you have earned the right to raise on the broader vision.


11. Strategic context (for future Claude Code sessions to understand the "why")

This MVP exists in a broader strategic context that was built through ~7 expert consultations. The key conclusions:

  • The architecture is two databases (disease signatures + drug profiles) + a matching engine. Three independent experts described this unprompted. It is not a hypothesis; it is the standard model in the field.
  • The moat is the curated data, not the algorithm. The matching algorithm is largely commodity (CMap is from 2006). The proprietary value is in the harmonized, curated, provenance-tracked data layer. Build accordingly.
  • The long-term technical architecture adds a knowledge graph (phase 2) and quantum-inspired optimization for combination search (phase 3). Neither is in the MVP.
  • The go-to-market is "digital CRO for drug repurposing." Quantum language is for investors; pharma clients hear "digital CRO."
  • The first paying buyers are R&D programme directors and BD teams in pharma, approached either outbound (we found a match in your area) or via API/sandbox (run your shelved compounds through our engine, data stays on your side).
  • Synthetic trial arms and drug repurposing share data infrastructure. This is a platform play, not a single product.

The MVP's job is to produce one credible result. Everything else follows from that.


12. Phase 2 track — Structure-based binding (scoped 2026-06-23)

Status: scoped, not committed. This is a follow-on track proposed after the MVP and its follow-up experiments. It does not change the MVP's locked decisions (§2); it responds to what those experiments empirically showed. Read §911 and the experiment commits first.

12.1 Why pivot modality (the evidence, not a hunch)

The expression-connectivity approach was built, validated, and pushed hard (gene-space expansion, cell-composition deconvolution, reference-library tau, supervised learning). The empirical verdict:

  • Connectivity recovers hydroxyurea (top ~68%) but cannot achieve specificity — unrelated drugs (norethindrone, ciprofloxacin) score as strong reversers. Unfixed by four independent methods.
  • A supervised model on indication labels hit 0.925 CV AUC — but it was a degree-bias mirage: it learned drug popularity, not disease matching (it ranked hydroxyurea 231/300).
  • The decisive test: with drug-popularity features removed, the model trained on the actual drug↔disease connectivity scored AUC 0.491 — pure chance. The expression-connectivity modality contains essentially no disease-specific therapeutic signal for this task.

This is a signal problem, not a model problem — no amount of model sophistication (diffusion, GNNs, etc.) extracts signal that isn't in the data. The response is to change modality to one with a strong, physical, drug-specific signal: does a molecule bind a sickle-relevant target? A drug that binds HbS is mechanistically specific by construction — the opposite of a coincidental expression reverser. Structure is also where the generative-AI frontier (AlphaFold3, which is itself a diffusion model) actually has traction, because structure has physical ground truth.

12.2 Targets (sickle-specific, druggable, structurally characterised)

Small molecules only (§2). Curated shortlist with public structures and, crucially, known small-molecule binders to serve as positive controls:

Target Mechanism in sickle Known binder (positive control)
Hemoglobin (HBB/HBA tetramer, HbS) Anti-polymerisation; R-state stabiliser voxelotor (binds α-Val1)
PKR (PKLR, red-cell pyruvate kinase) Activator → ↓2,3-BPG → ↑O2 affinity mitapivat, etavopivat
DNMT1 HbF induction (de-repress γ-globin) decitabine, azacitidine
LSD1 / KDM1A HbF induction tranylcypromine analogues
HDAC1/2 HbF induction vorinostat, panobinostat
EHMT2 (G9a) HbF induction UNC0642 (tool)
PDE9 ↑cGMP, anti-adhesion PF-04447943 (sickle trial)

Hard/excluded for v1: BCL11A (transcription factor, no classic pocket — the γ-globin master repressor but not small-molecule-tractable yet) and any gene-therapy / biologic mechanism.

12.3 Method (baseline → generative co-folding)

  1. Prepare structures. Pull target structures from the PDB; AF3/Boltz-predict any missing.
  2. Prepare ligands. Reuse the existing ~300-drug set (we already have canonical SMILES from ChEMBL); expandable to the full ChEMBL/LINCS catalogue.
  3. Dock + score, in increasing sophistication:
    • Baseline: classical docking (AutoDock Vina / smina) — fast, CPU, well-understood.
    • Generative co-folding: an open AlphaFold3-class model — Boltz-2 (predicts the proteinligand complex and a binding-affinity estimate, MIT-licensed), Chai-1, or DiffDock (a diffusion model for docking — the legitimate home for the "diffusion" instinct). These predict the bound pose directly and tend to beat classical docking.
    • Report both; the baseline keeps us honest about whether the ML model adds anything.

12.4 Validation (a real recovery test, like §6 Week 4)

Pre-register before scoring: the known structure-based sickle drugs must rank as top binders to their targets — voxelotor→hemoglobin, mitapivat→PKR, decitabine→DNMT1. Negative controls (unrelated drugs) must not bind these pockets. This is a cleaner recovery test than the expression one, because binding is mechanistically specific — it should not have the coincidental-reverser problem that sank the connectivity approach.

12.5 The real prize — integrate, don't replace

The long-term value is both modalities together: a candidate that reverses the disease signature (expression) and binds a sickle-relevant target (structure) is far more credible than either alone. Structure supplies the specificity the expression layer lacks; expression supplies the systems-level, target-agnostic view structure lacks. The platform thesis (§11) — two databases + a matching engine — extends naturally to a third (structures) feeding the same confidence-tiered data layer.

12.6 Honest pitfalls (do not ignore)

  1. Binding ≠ efficacy. A molecule can bind and do nothing therapeutic. Structure-based hits are still hypotheses (cf. §9.7).
  2. Only covers the enzyme/pocket subset. Sickle's biggest lever (γ-globin reactivation via BCL11A) is largely transcriptional and not small-molecule-tractable — structure-based screening is blind to it. Be explicit about coverage.
  3. Docking/affinity accuracy is limited. Pose prediction is decent; absolute affinity is hard. Validate on known binders before trusting novel scores.
  4. Compute. AF3-class models are GPU-heavy; the local Mac Studio (§2) may not suffice — this track likely needs a GPU box or cloud, the first MVP dependency to break the "all local" rule.
  5. Moat. Structures and tools are public; the proprietary value is the curated target list, the integration with the expression layer, and provenance/tiering — not the docker.

12.7 Explicitly NOT in this track

Free energy perturbation / MD-based affinity; covalent docking; de novo generation of molecules as final candidates to synthesise (design, not repurposing — but see §12.9 for the generate-then-retrieve hybrid, which is repurposing); BCL11A or any non-pocket target; biologics; combination binding.

12.8 Open decisions before committing

  • Tooling: classical-docking baseline first, or straight to Boltz-2/DiffDock? (Recommend: baseline first, for an honest reference — the lesson of the whole expression arc.)
  • Compute: secure a GPU environment (the "all local" §2 assumption breaks here).
  • Scope of v1: the 7-target shortlist above, or start with just Hb + PKR (the two with the cleanest positive controls) to de-risk the harness before scaling targets.

12.9 Door left open — generative-guided retrieval (generate → match existing)

A legitimate way to bring generative models into the repurposing frame (vs the design frame excluded in §12.7): don't generate molecules to synthesise — generate them as a search beacon.

The idea. Use a pocket-conditioned generative model (target-conditioned diffusion — e.g. TargetDiff, DiffSBDD, Pocket2Mol) to propose idealised binders for a sickle target. Then retrieve the nearest existing drugs to those proposals by chemical similarity (Tanimoto over Morgan fingerprints, or a learned molecular embedding). The retrieved neighbours — real, already-approved or clinical molecules — are the repurposing candidates. The generated molecule is never made; it only defines a region of chemical space worth searching. This is the user-proposed framing and it is sound: "generate the ideal, then find what we already have nearby."

Why it could add value. It can point at scaffolds / regions a known-binder-seeded or brute-force docking sweep would miss, and it makes the target's binding requirements explicit as geometry rather than as a single reference ligand.

Honest caveats (why it's a door, not a commitment).

  1. Generated molecules are often synthetically unrealistic / invalid — which is exactly why they must be used only as beacons, never as candidates.
  2. Similarity ≠ activity. Activity cliffs mean a near-neighbour of a good binder can be inert. So retrieved neighbours do not bypass validation — they must still be docked/scored (§12.3) and clear the binding recovery test (§12.4). The generative step proposes; it does not prove.
  3. Marginal-value question. Directly docking the whole existing library (§12.3) already covers chemical space. Whether generate-then-retrieve beats that — by efficiency or by surfacing non-obvious scaffolds — is an open empirical question and needs a head-to-head before it earns real investment.
  4. Only as good as the pocket conditioning of the generator, and the chemistry of the target.

Status: explore only after the §12.312.4 docking harness works and is validated on the known binders — same discipline as everywhere else: prove the baseline, then test whether the fancier method adds anything.