Compare commits
1 Commits
main
...
structure-
| Author | SHA1 | Date | |
|---|---|---|---|
| 817bcda7dc |
0
data/raw/structures/.gitkeep
Normal file
0
data/raw/structures/.gitkeep
Normal file
34
docs/structure_binding_notes.md
Normal file
34
docs/structure_binding_notes.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# Structure-based binding track — working notes
|
||||
|
||||
Branch `structure-based-binding`. Implements PLAN §12. Baseline-first, start with the two cleanest
|
||||
targets (Hemoglobin + PKR), de-risk the harness before scaling.
|
||||
|
||||
## Status (2026-06-23)
|
||||
|
||||
**Toolchain check (PLAN §12.6 pitfall 4, confirmed real):**
|
||||
- ✅ RDKit installs on ARM Mac — ligand side ready.
|
||||
- ❌ AutoDock Vina does NOT pip-install on ARM Mac; no docking binary available. Docking (§12.3)
|
||||
is **blocked on toolchain** — must resolve via conda/micromamba (`vina`/`smina`), a GPU AF3-class
|
||||
model (Boltz-2/Chai-1/DiffDock), or an x86 Vina binary under Rosetta.
|
||||
|
||||
**Structures obtained:** `5E83` (hemoglobin + voxelotor), `8XFD` (PKR + mitapivat) in
|
||||
`data/raw/structures/`.
|
||||
|
||||
**Step 0 — ligand-based retrieval baseline (`scripts/binding_ligand_baseline.py`):**
|
||||
RDKit Tanimoto of our 300 drugs vs known sickle binders.
|
||||
- Engine VALIDATED on in-set classes: `decitabine`→azacitidine (0.62); `vorinostat`→scriptaid
|
||||
(0.42), belinostat (0.28). Correctly clusters DNMT1 / HDAC HbF-inducers.
|
||||
- But voxelotor / mitapivat have **no analog** in our set (max Tanimoto ~0.20–0.26). A 300-drug
|
||||
library is too sparse to contain look-alikes of distinctive scaffolds.
|
||||
|
||||
**Takeaways:**
|
||||
1. Ligand retrieval works but needs a **bigger drug library** to be useful for distinctive targets.
|
||||
2. The targets without in-set analogs (Hb, PKR) need **actual docking** (§12.3) — which scores
|
||||
binding directly, no look-alike required. That is the gating next step, and it needs the
|
||||
toolchain solved.
|
||||
|
||||
## Next steps
|
||||
- [ ] Resolve the docking toolchain (recommend: micromamba + smina/vina, CPU, no GPU needed for baseline).
|
||||
- [ ] Dock the known binders (voxelotor→5E83, mitapivat→8XFD) as positive controls (§12.4 recovery test).
|
||||
- [ ] Expand the ligand library (full ChEMBL/LINCS) for retrieval to have reach.
|
||||
- [ ] Only then: AF3-class co-folding (Boltz-2/DiffDock) vs the docking baseline; and §12.9 generative beacon.
|
||||
@@ -33,6 +33,12 @@ dev = [
|
||||
"pytest>=8.0",
|
||||
"ruff>=0.5",
|
||||
]
|
||||
# Structure-based binding track (PLAN §12). Docking tool (vina/smina) is NOT pip-installable on
|
||||
# ARM Mac — install via conda/micromamba or use a GPU AF3-class model; see docs/structure_binding_notes.md.
|
||||
structure = [
|
||||
"rdkit>=2024.3",
|
||||
"requests>=2.31",
|
||||
]
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
where = ["."]
|
||||
|
||||
66
scripts/binding_ligand_baseline.py
Normal file
66
scripts/binding_ligand_baseline.py
Normal file
@@ -0,0 +1,66 @@
|
||||
"""Structure-based track, step 0: ligand-based retrieval baseline (PLAN §12.9 engine).
|
||||
|
||||
Docking (§12.3) needs a toolchain that doesn't pip-install on ARM Mac (AutoDock Vina) — that's the
|
||||
next dependency to solve. Meanwhile this runs now with pure RDKit: do any of our 300 drugs sit near
|
||||
the KNOWN sickle binders (voxelotor, mitapivat, decitabine) in chemical space? This is the
|
||||
retrieval engine §12.9 would point a generative beacon at, and a sanity check on the ligand data.
|
||||
|
||||
NOT docking and NOT a binding claim — chemical similarity only. Similarity != activity (§12.9).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import pandas as pd
|
||||
import requests
|
||||
from rdkit import Chem, DataStructs, RDLogger
|
||||
from rdkit.Chem import rdFingerprintGenerator
|
||||
|
||||
RDLogger.DisableLog("rdApp.*")
|
||||
MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
|
||||
|
||||
# Known sickle binders = positive-control beacons (target in parens).
|
||||
BINDERS = ["voxelotor", "mitapivat", "decitabine", "vorinostat"]
|
||||
|
||||
|
||||
def pubchem_smiles(name: str) -> str | None:
|
||||
for prop in ("SMILES", "ConnectivitySMILES", "IsomericSMILES", "CanonicalSMILES"):
|
||||
try:
|
||||
u = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/{prop}/JSON"
|
||||
d = requests.get(u, timeout=30).json()["PropertyTable"]["Properties"][0]
|
||||
if prop in d:
|
||||
return d[prop]
|
||||
except Exception:
|
||||
continue
|
||||
return None
|
||||
|
||||
|
||||
def fp(smi: str):
|
||||
if not isinstance(smi, str) or smi in ("-666", ""):
|
||||
return None
|
||||
m = Chem.MolFromSmiles(smi)
|
||||
return MORGAN.GetFingerprint(m) if m else None
|
||||
|
||||
|
||||
def main() -> None:
|
||||
binder_smi = {b: pubchem_smiles(b) for b in BINDERS}
|
||||
print("known-binder SMILES:", {k: (v[:34] + "..." if v else "MISSING") for k, v in binder_smi.items()})
|
||||
|
||||
drugs = pd.read_csv("data/processed/drug_set_v1.csv")[["pert_iname", "canonical_smiles", "inclusion_reason"]]
|
||||
reason = dict(zip(drugs.pert_iname, drugs.inclusion_reason))
|
||||
drug_fp = {r.pert_iname: fp(r.canonical_smiles) for r in drugs.itertuples()}
|
||||
drug_fp = {k: v for k, v in drug_fp.items() if v is not None}
|
||||
print(f"fingerprinted {len(drug_fp)}/{len(drugs)} drugs\n")
|
||||
|
||||
for b, smi in binder_smi.items():
|
||||
bfp = fp(smi)
|
||||
if bfp is None:
|
||||
print(f"{b}: no SMILES\n"); continue
|
||||
sims = sorted(((DataStructs.TanimotoSimilarity(bfp, v), k) for k, v in drug_fp.items()), reverse=True)
|
||||
print(f"nearest drugs to {b}:")
|
||||
for s, k in sims[:5]:
|
||||
print(f" {s:.3f} {k:22s} [{reason.get(k,'?')}]")
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
89
src/binding.py
Normal file
89
src/binding.py
Normal file
@@ -0,0 +1,89 @@
|
||||
"""Structure-based binding track (PLAN §12).
|
||||
|
||||
Two capabilities:
|
||||
- ligand-based retrieval (RDKit, works now): find existing drugs near a query molecule in
|
||||
chemical space — validated, and the engine behind §12.9 generative-guided retrieval.
|
||||
- structure-based docking (§12.3): score whether a ligand binds a target pocket. Blocked on an
|
||||
ARM-Mac docking toolchain (AutoDock Vina does not pip-install); see ``dock`` for options.
|
||||
|
||||
Caveat carried throughout: chemical similarity != activity, and docking != efficacy (§12.6).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from rdkit import Chem, DataStructs, RDLogger
|
||||
from rdkit.Chem import rdFingerprintGenerator
|
||||
|
||||
RDLogger.DisableLog("rdApp.*")
|
||||
_MORGAN = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
|
||||
|
||||
STRUCT_DIR = Path("data/raw/structures")
|
||||
|
||||
# Known sickle small-molecule binders, by target (positive controls for the §12.4 recovery test).
|
||||
KNOWN_BINDERS = {
|
||||
"hemoglobin": "voxelotor",
|
||||
"PKR": "mitapivat",
|
||||
"DNMT1": "decitabine",
|
||||
"HDAC": "vorinostat",
|
||||
}
|
||||
|
||||
# Curated target structures (PLAN §12.2). Add PDB ids as the harness grows.
|
||||
TARGET_PDB = {
|
||||
"hemoglobin": "5E83", # hemoglobin + voxelotor (GBT440)
|
||||
"PKR": "8XFD", # pyruvate kinase R + mitapivat
|
||||
}
|
||||
|
||||
|
||||
def morgan_fp(smiles: str):
|
||||
"""Morgan (ECFP4) fingerprint, or None for invalid / missing SMILES ('-666', '')."""
|
||||
if not isinstance(smiles, str) or smiles in ("-666", ""):
|
||||
return None
|
||||
mol = Chem.MolFromSmiles(smiles)
|
||||
return _MORGAN.GetFingerprint(mol) if mol else None
|
||||
|
||||
|
||||
def tanimoto(smiles_a: str, smiles_b: str) -> float | None:
|
||||
fa, fb = morgan_fp(smiles_a), morgan_fp(smiles_b)
|
||||
if fa is None or fb is None:
|
||||
return None
|
||||
return DataStructs.TanimotoSimilarity(fa, fb)
|
||||
|
||||
|
||||
def retrieve_nearest(
|
||||
query_smiles: str,
|
||||
library: dict[str, str],
|
||||
top_n: int = 5,
|
||||
) -> list[tuple[float, str]]:
|
||||
"""Rank a library of {name: smiles} by Tanimoto similarity to a query molecule.
|
||||
|
||||
This is the §12.9 retrieval step: the query may be a known binder (positive-control beacon)
|
||||
or a generated idealised binder; the returned existing drugs are repurposing candidates that
|
||||
STILL require docking/validation (similarity != activity).
|
||||
"""
|
||||
qfp = morgan_fp(query_smiles)
|
||||
if qfp is None:
|
||||
raise ValueError("invalid query SMILES")
|
||||
sims = []
|
||||
for name, smi in library.items():
|
||||
fp = morgan_fp(smi)
|
||||
if fp is not None:
|
||||
sims.append((DataStructs.TanimotoSimilarity(qfp, fp), name))
|
||||
return sorted(sims, reverse=True)[:top_n]
|
||||
|
||||
|
||||
def dock(target: str, ligand_smiles: str) -> float:
|
||||
"""Dock a ligand into a target pocket and return a binding score (PLAN §12.3).
|
||||
|
||||
Blocked: AutoDock Vina does not pip-install on ARM Mac and no docking binary is on PATH.
|
||||
Resolve the toolchain first (one of):
|
||||
- conda/micromamba: ``vina`` (conda-forge) or ``smina`` (bioconda), osx-arm64 builds
|
||||
- an AF3-class co-folding model on GPU: Boltz-2 / Chai-1 / DiffDock (also predicts affinity)
|
||||
- x86 Vina binary under Rosetta 2
|
||||
Then: fetch TARGET_PDB[target], define the pocket box, prep the ligand (Meeko), score.
|
||||
"""
|
||||
raise NotImplementedError(
|
||||
"Docking toolchain unresolved on ARM Mac (PLAN §12.6 pitfall 4 / §12.8). "
|
||||
"See docstring for options."
|
||||
)
|
||||
Reference in New Issue
Block a user