BenchmarkJan 29, 2026-12 min read-arXiv:2601.21800v1

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

BioAgent Bench is a benchmark dataset and evaluation suite designed to measure how reliably AI agents can execute real, multi-step bioinformatics pipelines. It focuses on concrete output artifacts, end-to-end tool use, and robustness testing under controlled perturbations.

Authors: Dionizije Fa (Entropic), Marko Culjak (TakeLab, FER), Bruno Pandza (Entropic), Mateo Cupic (Entropic).

Tasks

10 end-to-end pipelines

Models

8 models across CLI harnesses

Constraints

Runtime < 4h, RAM <= 48GB

Intro

BioAgent Bench is our attempt to evaluate bioinformatics agents the way they get used in real life. It bundles ten end-to-end pipelines with an agent harness and grader that checks outputs, artifacts, and traces. We test more than whether an agent finishes, including how it behaves when inputs are corrupted, files are decoys, or prompts are bloated. The headline result: frontier models can complete complex pipelines, but robustness is still the bottleneck. And because many workflows use sensitive data or proprietary references, open-weight models remain a practical option even when completion rates are lower.

Key takeaways

A benchmark dataset that mirrors practical bioinformatics work.
A head-to-head look at closed and open-weight models as agents.
An evaluation suite that records traces, grades outputs, and probes robustness.

Repos: bioagent-bench/bioagent-bench and bioagent-bench/bioagent-experiments.

Overview of the BioAgent Bench evaluation harness

Figure 1. End-to-end evaluation harness: task prompt + input data + optional references, tool execution, artifact capture, and LLM grading of outputs.

Why this benchmark exists

Bioinformatics pipelines often chain command-line tools, manage heterogeneous file formats, and interpret intermediate outputs that are domain-specific. Traditional benchmarks reduce this into static question answering or code generation, which misses the reality of tool orchestration and artifact production. BioAgent Bench instead defines end-to-end tasks that require agents to produce concrete outputs such as VCFs, CSVs, or QC reports.

Benchmark design

BioAgent Bench includes 10 tasks spanning bulk and single-cell RNA-seq, comparative genomics, variant calling, metagenomics, viral metagenomics, transcript quantification, and experimental evolution. Each task is a single instance that includes a natural language prompt, the required input files, and reference data when available.

Two constraints guided dataset selection: runtime below 4 hours and memory usage at or below 48GB. This keeps the suite reproducible and focuses on smaller organisms where reference data can be bundled as inputs. The tradeoff is that some large-organism workflows and external reference sourcing are out of scope.

Task definitions

Task: a single prompt with fixed inputs and a clear success criterion.
Trial: one execution of a task by an agent harness.
Transcript: full log of messages, tool calls, and intermediate artifacts.
Outcome: final produced result artifact.

Figure 2. Task coverage across organisms, domains, and pipeline types.

Tasks and verifiability

Some tasks are fully verifiable with strict pass/fail criteria, while others require LLM grading because multiple valid pipelines exist and intermediate artifacts are voluminous. Output formats are typically CSV or TSV for automated evaluation, with VCFs or other bioinformatics artifacts where appropriate.

Identifier	Task	Language	Tool calls	Verifiable
alzheimer-mouse	Alzheimer Mouse Models: Comparative Pathway Analysis	Python	No	No
comparative-genomics	Comparative Genomics: Co-evolving Gene Clusters	R	No	No
cystic-fibrosis	Cystic Fibrosis Mendelian Variant Identification	bash	Yes	Yes
deseq	RNA-Seq Differential Expression (DESeq2)	Python	Yes	No
evolution	Experimental Evolution Variant Calling (E. coli)	bash	Yes	No
giab	GIAB Variant Calling (NA12878)	bash	Yes	Yes
metagenomics	Metagenomics: Community Comparison (Cuatro Cienegas)	R	Yes	No
single-cell	Single-cell RNA-seq: Skeletal Muscle Exercise Response	Python	No	No
transcript-quant	Transcript Quantification (Simulated RNA-Seq)	bash	Yes	Yes
viral-metagenomics	Viral Metagenomics: Species Identification (Dolphin)	bash	Yes	Yes

Experimental setup

Each model runs inside a CLI harness. Tasks are executed in a sandboxed folder with network access. The system prompt instructs the agent to produce artifacts for each pipeline step and stop when the final output is generated. A grader model (GPT-5.1) evaluates the outputs and traces.

Grader inputs

Input and reference data paths
Expected outcomes (ground-truth tables)
Agent outcomes (CSV or TSV artifacts)
Agent trace (folders and file paths)
Grading rubric prioritizing pipeline completion

Grader outputs

Steps completed vs steps to completion
Final result reached (artifact exists)
Results match (task-specific correctness)
F1 score (GIAB only)

Results snapshot

Frontier models achieve high completion rates across tasks. GPT-5.2 and Gemini 3 Pro lead the filtered benchmark, while the strongest open-weight models remain competitive and other open-weight models range down to roughly 65%.

All Tasks Per Model

Open environment with reference data

Completion rates for end-to-end pipelines for each model in the with-reference benchmark.

Updated Jan 13, 2026, 11:32 AM

Tasks 10

Focused model: GPT 5.293%Tasks 10Steps 39/42

Experiment

Open environment with reference data

Runs grounded with reference artifacts to validate pipeline completion against known outputs.

Updated Jan 13, 2026, 11:32 AM

Tasks 10

Task

GPT 5.2

Minimax M2.1

GPT 5.1 Codex-Max

GLM 4.7

Kimi K2 Thinking

Qwen3 Coder 480B A35B

Devstral 2 2512

Gemini 3 Pro Preview

alzheimer-mouse

comparative-genomics

cystic-fibrosis

deseq

evolution

giab

metagenomics

single-cell

transcript-quant

viral-metagenomics

Hover a cell to inspect a model-task combo. Click to pin the details.

Focused: alzheimer-mouse · GPT 5.1 Codex-Max100%Steps 3/3

Figure 3. Completion heatmap across tasks and models, plus average completion rates.

Planning quality vs completion

When models are asked to produce a high-level plan without executing tools, planning quality correlates with overall completion (Pearson r = 0.61). The relationship is not deterministic: some models complete pipelines even with weaker explicit plans, indicating that baseline domain knowledge and agentic execution ability both matter.

Interpretation

Stronger planning generally predicts better end-to-end results, but success can still occur when execution reliability compensates for weaker planning. Open-weight models tend to have lower plan ratings and more variable completion outcomes.

Figure 4. Plan rating vs pipeline completion rate across models.

Robustness and perturbations

Robustness tests include prompt bloat, corrupted inputs, and decoy files. Across tasks, the mean Jaccard overlap of categorical outputs is 0.43 and the mean Pearson correlation for numerical outputs is 0.73, indicating substantial variability between trials. Prompt bloat reduces completion by an average of 28 percentage points, and decoy or corrupted files sometimes slip through shallow file-selection heuristics.

Corrupted inputs

Agents sometimes proceed despite corrupted files, or attempt to route around issues with alternative references. In other cases, obvious corruption causes early termination. The most concerning cases are subtle corruptions that preserve file structure but invalidate biological meaning.

Decoy files

Failures often stem from shallow heuristics such as globbing file name patterns instead of grounding tool choices in biological context. In metagenomics, for example, an agent selected a viral database instead of the required bacterial reference.

Prompt bloat

Excess prompt text can cause agents to repeatedly restate the task, cycle through shallow reformulations, and end without producing artifacts. The behavior resembles weaker agent harnesses, where tool use and state tracking are lost under distraction.

Conclusion

Completion alone is a necessary but insufficient metric for real-world readiness. The benchmark shows that agents can construct pipelines and produce final artifacts, yet still miss step-level reasoning failures such as incorrect file selection or ignoring corrupted inputs. For clinical or regulated settings, the question is not just whether an agent produces a result, but whether it can justify its choices and avoid proceeding when evidence is unreliable.

BioAgent Bench shifts evaluation from "can it finish?" to "can it finish reliably, for the right reasons?" The benchmark captures realistic tool orchestration and structured outputs, while exposing brittle behaviors under perturbations. The next step is to expand task diversity, include larger and messier inputs, require external reference justification, and integrate robustness directly into primary scoring.

Dive into the full paper or explore the code and experiments.

Read the arXiv paper View on Github

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

Intro

Key takeaways

Why this benchmark exists

Benchmark design

Task definitions

Tasks and verifiability

Experimental setup

Grader inputs

Grader outputs

Results snapshot

Open environment with reference data

Open environment with reference data

Planning quality vs completion

Interpretation

Robustness and perturbations

Corrupted inputs

Decoy files

Prompt bloat

Conclusion

Read more