Structure-binding track: scaffold + ligand-retrieval baseline

Start the structure-based binding branch (PLAN §12), baseline-first.

- src/binding.py: validated RDKit ligand retrieval (morgan_fp, tanimoto,
  retrieve_nearest = the §12.9 engine) + dock() stub documenting the
  blocked ARM-Mac toolchain
- scripts/binding_ligand_baseline.py: 300 drugs vs known binders
- docs/structure_binding_notes.md: status, toolchain blocker, next steps
- pyproject: [structure] extra (rdkit); data/raw/structures/ for PDBs

Step-0 finding: retrieval engine VALIDATED on in-set classes
(decitabine->azacitidine 0.62; vorinostat->scriptaid/belinostat) but the
distinctive binders voxelotor/mitapivat have no analog in our 300-drug
set (Tanimoto ~0.2). Needs (a) bigger library, (b) real docking (§12.3),
which is blocked on the ARM-Mac docking toolchain (§12.6 pitfall 4).
Structures 5E83 (Hb+voxelotor) and 8XFD (PKR+mitapivat) fetched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-23 23:53:27 +02:00
parent 6c2c71d73d
commit 817bcda7dc
5 changed files with 195 additions and 0 deletions

89
src/binding.py Normal file
View File

@@ -0,0 +1,89 @@
"""Structure-based binding track (PLAN §12).
Two capabilities:
- ligand-based retrieval (RDKit, works now): find existing drugs near a query molecule in
chemical space — validated, and the engine behind §12.9 generative-guided retrieval.
- structure-based docking (§12.3): score whether a ligand binds a target pocket. Blocked on an
ARM-Mac docking toolchain (AutoDock Vina does not pip-install); see ``dock`` for options.
Caveat carried throughout: chemical similarity != activity, and docking != efficacy (§12.6).
"""
from __future__ import annotations
from pathlib import Path
from rdkit import Chem, DataStructs, RDLogger
from rdkit.Chem import rdFingerprintGenerator
RDLogger.DisableLog("rdApp.*")
_MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
STRUCT_DIR = Path("data/raw/structures")
# Known sickle small-molecule binders, by target (positive controls for the §12.4 recovery test).
KNOWN_BINDERS = {
"hemoglobin": "voxelotor",
"PKR": "mitapivat",
"DNMT1": "decitabine",
"HDAC": "vorinostat",
}
# Curated target structures (PLAN §12.2). Add PDB ids as the harness grows.
TARGET_PDB = {
"hemoglobin": "5E83", # hemoglobin + voxelotor (GBT440)
"PKR": "8XFD", # pyruvate kinase R + mitapivat
}
def morgan_fp(smiles: str):
"""Morgan (ECFP4) fingerprint, or None for invalid / missing SMILES ('-666', '')."""
if not isinstance(smiles, str) or smiles in ("-666", ""):
return None
mol = Chem.MolFromSmiles(smiles)
return _MORGAN.GetFingerprint(mol) if mol else None
def tanimoto(smiles_a: str, smiles_b: str) -> float | None:
fa, fb = morgan_fp(smiles_a), morgan_fp(smiles_b)
if fa is None or fb is None:
return None
return DataStructs.TanimotoSimilarity(fa, fb)
def retrieve_nearest(
query_smiles: str,
library: dict[str, str],
top_n: int = 5,
) -> list[tuple[float, str]]:
"""Rank a library of {name: smiles} by Tanimoto similarity to a query molecule.
This is the §12.9 retrieval step: the query may be a known binder (positive-control beacon)
or a generated idealised binder; the returned existing drugs are repurposing candidates that
STILL require docking/validation (similarity != activity).
"""
qfp = morgan_fp(query_smiles)
if qfp is None:
raise ValueError("invalid query SMILES")
sims = []
for name, smi in library.items():
fp = morgan_fp(smi)
if fp is not None:
sims.append((DataStructs.TanimotoSimilarity(qfp, fp), name))
return sorted(sims, reverse=True)[:top_n]
def dock(target: str, ligand_smiles: str) -> float:
"""Dock a ligand into a target pocket and return a binding score (PLAN §12.3).
Blocked: AutoDock Vina does not pip-install on ARM Mac and no docking binary is on PATH.
Resolve the toolchain first (one of):
- conda/micromamba: ``vina`` (conda-forge) or ``smina`` (bioconda), osx-arm64 builds
- an AF3-class co-folding model on GPU: Boltz-2 / Chai-1 / DiffDock (also predicts affinity)
- x86 Vina binary under Rosetta 2
Then: fetch TARGET_PDB[target], define the pocket box, prep the ligand (Meeko), score.
"""
raise NotImplementedError(
"Docking toolchain unresolved on ARM Mac (PLAN §12.6 pitfall 4 / §12.8). "
"See docstring for options."
)