Skip to main content
Back to Home
ARDA · Research Results

Predicting Cellular Response to
Perturbations Never Run in the Lab

ARDA achieved the highest score on the PerturBench Norman19 benchmark. Frozen test split. Black-box evaluation. Artifact-complete reproducibility. Causal discovery guiding perturbation response prediction.

0.9109

Cosine LogFC

Highest, same-split

0.9684

Pearson DE

Top-20 differentially expressed

0.9281

Hardest subgroup

combo_seen0 (zero prior)

Vareon Research

Vareon Inc. · Vareon Limited · March 2026

The Problem: Predicting What Hasn't Been Measured

Gene perturbation experiments are expensive, slow, and combinatorially explosive. A single CRISPR screen can cost tens of thousands of dollars and take weeks. Combination perturbations — knocking out two or more genes simultaneously — grow factorially. For the roughly 20,000 human protein-coding genes, testing all pairwise combinations would require ~200 million experiments.

The PerturBench Norman19 benchmark poses the central question directly: given a perturbation combination that has never been run in the lab, what cellular response should be expected? This is not pattern-matching — it is prediction under genuine novelty, evaluated on a frozen test split with no data leakage.

The established state of the art is GEARS (Roohani et al. 2023) — a graph-enhanced gene activation and repression simulator that uses gene regulatory graph structure to predict perturbation outcomes. GEARS and similar supervised approaches train a model on observed perturbation-response pairs and hope it generalizes to unseen combinations. Simple additive baselines already achieve strong results (linear baseline: 0.9022 Cosine LogFC). But neither GEARS nor additive models discover the causal interaction structure — the synergies and antagonisms between genes that drive the most scientifically important phenotypes.

What ARDA Does Differently

ARDA is not a perturbation prediction model. It is the Universal Discovery Engine — a governed scientific discovery platform that discovers causal structure from data and uses that structure to make predictions. The PerturBench benchmark tests whether causal discovery improves prediction quality on genuinely unseen perturbation combinations.

CDE (Causal Dynamics Engine)

ARDA's causal discovery mode. Recovers directed causal graphs from observational data. For PerturBench, CDE discovers the gene-gene interaction structure that drives perturbation responses — mechanism, not correlation.

Prediction with Causal Evidence

The primary ARDA contract (arda-predict + arda-cde) uses CDE-discovered causal structure as evidence to guide predictions. The causal graph tells the prediction engine which gene interactions matter, producing more accurate predictions on truly unseen combinations.

4 Discovery Modes

Symbolic discovery for governing equations. Neural discovery for latent dynamics. Neuro-Symbolic for interpretable decomposition. CDE for causal graphs. Each mode produces typed scientific claims — not token predictions. For PerturBench, the CDE mode is primary.

Governed Output

Every claim is typed and machine-readable. Negative controls validate that discovered structure is genuine. Claims that fail are recorded with context. Full provenance, deterministic replay, evidence ledger.

The key insight

Prediction models ask “what will happen?”. ARDA asks “why does it happen?” first, then predicts. By discovering the causal structure underlying perturbation responses, ARDA's predictions generalize better to unseen combinations — because they are grounded in mechanism, not memorized patterns.

Results: Unified Same-Split Leaderboard

On the frozen PerturBench Norman19 test split, ARDA with causal evidence achieves the highest Cosine LogFC score across all models, including baselines and ablations.

#ModelCosine LogFCRMSEPearson DE
1arda_predict_plus_cde0.91090.04250.9684
2linear_baseline0.90220.04910.9571
3arda_predict_plus_cde_gnn0.90030.04500.9637
4arda_predict_only0.89680.04780.9650
5nearest_neighbor_baseline0.82360.05950.8590
6GEARS (Roohani et al. 2023)(same-split, 3 seeds)0.71580.07250.7929
7control_baseline0.00000.09960.0000

GEARS (Roohani et al. 2023) evaluated on the same frozen split with GO + co-expression features, averaged over 3 random seeds.

Subgroup Analysis: Performance Scales with Novelty

The benchmark splits test combinations by how many constituent genes appeared in training. combo_seen0 (neither gene seen) is the hardest — genuinely novel combinations. ARDA with CDE maintains strong performance even on the most challenging subgroup.

0.9281

combo_seen0

Neither gene seen in training

n=7 conditions

0.8753

combo_seen1

One gene seen in training

n=20 conditions

0.9163

combo_seen2

Both genes seen in training

n=19 conditions

Ablation: Causal Evidence Matters

Removing CDE causal evidence (arda_predict_only) drops Cosine LogFC from 0.9109 to 0.8968. The causal graph provides the prediction engine with structural information about gene-gene interactions that pure supervised learning cannot recover from training data alone.

0.9109

With CDE

Highest

0.8968

Without CDE

-0.0141

0.9022

Linear baseline

Reference

What This Means for Discovery

Gene perturbation prediction is a discovery problem, not a prediction problem

The strongest linear baseline achieves 0.9022 through simple additive effects. Beating it requires understanding interaction structure — which genes amplify or suppress each other's effects. CDE discovers this structure. The prediction engine uses it.

Causal discovery transfers across domains

CDE was not built for gene perturbation. It is ARDA's general-purpose causal inference mode — the same engine that achieves 0.959 path fidelity on double-pendulum mechanics, 0.817 on gene regulatory networks, and 0.789 on clinical pharmacokinetics. The PerturBench result demonstrates that causal discovery improves downstream prediction in biology, just as it does in physics.

Black-box, agent-facing evaluation

ARDA was evaluated through its production API surfaces — the same REST API, SDK, and MCP tools that every customer uses. No special research mode. No hand-tuning. The benchmark ran against the same deployment that serves commercial customers.

Read the Full Paper

The complete peer-reviewed manuscript with benchmark methodology, unified leaderboard, subgroup analysis, and reproducibility protocol.

Read the Paper