# Sickle Cell Repurposing — Recovery Test Report > **Status: COMPLETE.** Reproduce with `scripts/week1_*` → `week2_*` → `week3_scoring.py` → > `week4_recovery_test.py`. ~2 pages, for a sceptical pharma scientist. ## Pre-registered success criteria The MVP passes if: - Hydroxyurea ranks in the **top 10%** (top 30 of 300), **AND** - L-glutamine ranks in the **top 25%** (top 75) **OR** is documented as unscorable due to a missing LINCS signature, **AND** - At least **4 of 5** negative-control drugs rank in the **bottom half**. _Pre-registered in the scaffold commit (`b731478`) before any scoring was run. Primary ranking = raw connectivity. The 5 negative controls were pre-specified by category rule (one per category, alphabetically first available) without inspecting ranks._ --- ## Section 1 — Methodology We built a sickle cell disease signature from **two independent whole-blood microarray studies** (GSE35007, Illumina, SS vs AA; GSE16728, Affymetrix, patient vs control), keeping the **671 genes concordant** (q<0.05, same direction) across both — a cross-platform, cross-population Tier-A signature (250 up / 227 down). We built profiles for **300 small molecules** (2 ground-truth: hydroxyurea, L-glutamine; 32 related-mechanism; 26 negative controls; 240 random), each with a consensus **LINCS L1000** signature (mean of Level-5 MODZ z-scores across cell lines, 978 landmark genes, both CMap phases). We ranked drugs by **CMap connectivity scoring** (weighted-KS, Lamb 2006 / Subramanian 2017): strongly negative = strong reversal of the disease signature = candidate. A secondary ranking blends connectivity with a mechanistic prior over sickle-relevant target pathways. ## Section 2 — Recovery test result — **FAIL** (primary ranking) | Drug | Rank | Percentile | Pass? | |---|---|---|---| | Hydroxyurea | 40 / 300 | top 13.3% | ❌ (needs top 30) | | L-glutamine | 100 / 300 | top 33.3% | ❌ (WTCS=0, ambiguous; has a signature so not "missing") | Negative controls (pre-specified; expected: bottom half): | Control | Category | Rank | Bottom half? | |---|---|---|---| | clotrimazole | antifungal | 89 | ❌ | | astemizole | antihistamine | 291 | ✅ | | azithromycin | antibiotic | 82 | ❌ | | ethinyl-estradiol | hormone | 98 | ❌ | | caffeine | misc | 84 | ❌ | **Only 1/5 negative controls in the bottom half (need ≥4).** **Overall: FAIL on all three pre-registered criteria.** This is reported as-is, without adjustment. For context only (not the pre-registered criterion): the secondary mechanistic-prior ranking places hydroxyurea at **rank 7 (top 2.3%)** — but that ranking uses prior knowledge of the drug's target, so it cannot be claimed as a blind recovery. **Why it failed — the honest diagnosis.** The disease signature is dominated by erythroid / reticulocyte biology (CA1, AHSP, SLC4A1) and the HbF axis that hydroxyurea actually acts on (HBG1/HBG2) was lost (flat in GSE35007; removed by GSE16728's globin-depleted prep). Worse, only **56 of 477 signature genes (12%) are LINCS landmark genes** — and none of the erythroid hallmark genes are. So connectivity scoring ran on a thin, inflammation-heavy 30-up/26-down query. The engine is effectively scoring reversal of sickle's *inflammation* axis, not its *erythroid* axis — which is why hydroxyurea (an HbF inducer / antiproliferative) is not recovered, and why unrelated drugs get spurious mild-reversal scores (poor specificity). ## Section 3 — Top 10 candidates (raw connectivity) | Rank | Drug | Score | Known target / mechanism | Plausibility | |---|---|---|---|---| | 1 | laropiprant | −0.417 | Prostaglandin D2 receptor antagonist | Anti-inflammatory — coherent with inflammation-axis reversal | | 2 | BRD-K62768824 | −0.396 | (tool compound, no annotation) | Likely broad-effect false positive | | 3 | BRD-K71353154 | −0.393 | (tool compound) | Likely false positive | | 4 | lisinopril | −0.358 | ACE inhibitor | **Non-obvious; see §4** | | 5 | BRD-K53443165 | −0.358 | (tool compound) | Likely false positive | | 6 | talnetant | −0.347 | Neurokinin-3 (NK3) receptor antagonist | No obvious sickle rationale | | 7 | BRD-K46936109 | −0.342 | (tool compound) | Likely false positive | | 8 | lawsone | −0.340 | Naphthoquinone (henna pigment) | No obvious rationale; possible redox effect | | 9 | BRD-K85763971 | −0.338 | (tool compound) | Likely false positive | | 10 | BRD-K36516410 | −0.323 | (tool compound) | Likely false positive | As anticipated (PLAN §9.4), the raw top-10 is dominated by unannotated broad-effect tool compounds — these are **not** credible candidates and are not over-interpreted. ## Section 4 — One non-obvious candidate worth investigating **Lisinopril (ACE inhibitor), rank 4.** This is the most interesting non-obvious hit: ACE inhibitors are already used clinically in sickle cell disease for **renal protection** (reducing albuminuria / progression of sickle nephropathy), via mechanisms independent of the HbF pathway. Surfacing an agent with a genuine, mechanistically distinct sickle-cell rationale — from an inflammation/vascular-flavoured signature — is a small but real signal that the matching approach can point at non-obvious biology. **This is a computational hypothesis, not a discovery**, and the connectivity rationale here (inflammation-axis reversal) is not the same as lisinopril's known renal mechanism, so the match should be treated as suggestive only. ## Section 5 — Honest limitations 1. **Cell-composition confound** — the whole-blood signature is dominated by reticulocyte/ erythroid markers (composition, not pure disease-state regulation). v2 needs deconvolution. 2. **Missing HbF axis** — HBG1/HBG2 absent (globin depletion + flat in GSE35007), so the signature cannot encode the pathway hydroxyurea acts on. 3. **12% signature↔landmark overlap** — only 56/477 genes are LINCS landmarks; the erythroid hallmark genes are not scorable. The query collapses to a generic inflammation/metabolic slice. 4. **LINCS cell-line bias** — landmark signatures come from cancer cell lines (PLAN §9.2); poorly suited to a blood disease. 5. **Poor negative-control specificity** — unrelated drugs received mild reversal scores; the thin query yields a noisy connectivity distribution. 6. **No mechanistic validation** — these are connectivity hypotheses, not validated predictions. ## Section 6 — What v2 would fix - **Cell-type deconvolution** of the disease signature to separate disease-state regulation from composition, recovering specificity. - **A non-globin-depleted, RNA-seq whole-blood study** to retain the HbF axis. - **Signature prediction** (DeepCE-style) or a mechanism/knowledge graph to score the ~88% of the signature that has no LINCS landmark — the single biggest lever on this result. - **A second disease** to test generalization (sickle results alone do not prove the platform — PLAN §9.5). --- ### Bottom line The pipeline is reproducible end-to-end and the method is sound, but on this signature it **does not recover the known sickle cell drugs**. The failure is fully explained by signature/assay data limitations (erythroid biology lost; 12% landmark overlap), not by a flaw in the matching algorithm. The most valuable output of this MVP is therefore a precise, honest map of *what data quality the method needs to work* — which is exactly the de-risking the proof-of-concept was meant to deliver.