BenchmarkJan 29, 2026-12 min read-arXiv:2601.21800v1

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

BioAgent Bench is a benchmark dataset and evaluation suite designed to measure how reliably AI agents can execute real, multi-step bioinformatics pipelines. It focuses on concrete output artifacts, end-to-end tool use, and robustness testing under controlled perturbations.

Authors: Dionizije Fa (Entropic), Marko Culjak (TakeLab, FER), Bruno Pandza (Entropic), Mateo Cupic (Entropic).
Tasks
10 end-to-end pipelines
Models
10 models across 3 harnesses
Constraints
Runtime < 4h, RAM <= 48GB

Intro

BioAgent Bench is our attempt to evaluate bioinformatics agents the way they get used in real life. It bundles ten end-to-end pipelines with an agent harness and grader that checks outputs, artifacts, and traces. We test more than whether an agent finishes, including how it behaves when inputs are corrupted, files are decoys, or prompts are bloated. The headline result: frontier models can complete complex pipelines, but robustness is still the bottleneck. And because many workflows use sensitive data or proprietary references, open-weight models remain a practical option even when completion rates are lower.

Key takeaways

  • A benchmark dataset that mirrors practical bioinformatics work.
  • A head-to-head look at closed and open-weight models as agents.
  • An evaluation suite that records traces, grades outputs, and probes robustness.
Repos: bioagent-bench/bioagent-bench and bioagent-bench/bioagent-experiments.
Overview of the BioAgent Bench evaluation harness

Figure 1. End-to-end evaluation harness: task prompt + input data + optional references, tool execution, artifact capture, and LLM grading of outputs.

Why this benchmark exists

Bioinformatics pipelines often chain command-line tools, manage heterogeneous file formats, and interpret intermediate outputs that are domain-specific. Traditional benchmarks reduce this into static question answering or code generation, which misses the reality of tool orchestration and artifact production. BioAgent Bench instead defines end-to-end tasks that require agents to produce concrete outputs such as VCFs, CSVs, or QC reports.

Benchmark design

BioAgent Bench includes 10 tasks spanning bulk and single-cell RNA-seq, comparative genomics, variant calling, metagenomics, viral metagenomics, transcript quantification, and experimental evolution. Each task is a single instance that includes a natural language prompt, the required input files, and reference data when available.

Two constraints guided dataset selection: runtime below 4 hours and memory usage at or below 48GB. This keeps the suite reproducible and focuses on smaller organisms where reference data can be bundled as inputs. The tradeoff is that some large-organism workflows and external reference sourcing are out of scope.

Task definitions

  • Task: a single prompt with fixed inputs and a clear success criterion.
  • Trial: one execution of a task by an agent harness.
  • Transcript: full log of messages, tool calls, and intermediate artifacts.
  • Outcome: final produced result artifact.
BioAgent Bench task coverage

Figure 2. Task coverage across organisms, domains, and pipeline types.

Tasks and verifiability

Some tasks are fully verifiable with strict pass/fail criteria, while others require LLM grading because multiple valid pipelines exist and intermediate artifacts are voluminous. Output formats are typically CSV or TSV for automated evaluation, with VCFs or other bioinformatics artifacts where appropriate.

IdentifierTaskLanguageTool callsVerifiable
alzheimer-mouseAlzheimer Mouse Models: Comparative Pathway AnalysisPythonNoNo
comparative-genomicsComparative Genomics: Co-evolving Gene ClustersRNoNo
cystic-fibrosisCystic Fibrosis Mendelian Variant IdentificationbashYesYes
deseqRNA-Seq Differential Expression (DESeq2)PythonYesNo
evolutionExperimental Evolution Variant Calling (E. coli)bashYesNo
giabGIAB Variant Calling (NA12878)bashYesYes
metagenomicsMetagenomics: Community Comparison (Cuatro Cienegas)RYesNo
single-cellSingle-cell RNA-seq: Skeletal Muscle Exercise ResponsePythonNoNo
transcript-quantTranscript Quantification (Simulated RNA-Seq)bashYesYes
viral-metagenomicsViral Metagenomics: Species Identification (Dolphin)bashYesYes

Experimental setup

Each model runs inside a harness (Claude Code, Codex CLI, or OpenCode). Tasks are executed in a sandboxed folder with network access. The system prompt instructs the agent to produce artifacts for each pipeline step and stop when the final output is generated. A grader model (GPT-5.1) evaluates the outputs and traces.

Grader inputs

  • Input and reference data paths
  • Expected outcomes (ground-truth tables)
  • Agent outcomes (CSV or TSV artifacts)
  • Agent trace (folders and file paths)
  • Grading rubric prioritizing pipeline completion

Grader outputs

  • Steps completed vs steps to completion
  • Final result reached (artifact exists)
  • Results match (task-specific correctness)
  • F1 score (GIAB only)

Results snapshot

Frontier models achieve high completion rates across tasks. Claude Opus 4.5 reaches 100%, while Gemini 3 Pro, GPT-5.2, and Claude Sonnet 4.5 exceed 90%. The best open-weight model, GLM-4.7, reaches 82.5% in the Codex CLI harness, with other open-weight models ranging down to roughly 65%.

Model-task completion heatmap

Figure 3. Completion heatmap across tasks and models, plus average completion rates.

Planning quality vs completion

When models are asked to produce a high-level plan without executing tools, planning quality correlates with overall completion (Pearson r = 0.61). The relationship is not deterministic: some models complete pipelines even with weaker explicit plans, indicating that baseline domain knowledge and agentic execution ability both matter.

Interpretation

Stronger planning generally predicts better end-to-end results, but success can still occur when execution reliability compensates for weaker planning. Open-weight models tend to have lower plan ratings and more variable completion outcomes.

Plan rating versus completion rate

Figure 4. Plan rating vs pipeline completion rate across models.

Robustness and perturbations

Robustness tests include prompt bloat, corrupted inputs, and decoy files. Across tasks, the mean Jaccard overlap of categorical outputs is 0.43 and the mean Pearson correlation for numerical outputs is 0.73, indicating substantial variability between trials. Prompt bloat reduces completion by an average of 28 percentage points, and decoy or corrupted files sometimes slip through shallow file-selection heuristics.

Corrupted inputs

Agents sometimes proceed despite corrupted files, or attempt to route around issues with alternative references. In other cases, obvious corruption causes early termination. The most concerning cases are subtle corruptions that preserve file structure but invalidate biological meaning.

Decoy files

Failures often stem from shallow heuristics such as globbing file name patterns instead of grounding tool choices in biological context. In metagenomics, for example, an agent selected a viral database instead of the required bacterial reference.

Prompt bloat

Excess prompt text can cause agents to repeatedly restate the task, cycle through shallow reformulations, and end without producing artifacts. The behavior resembles weaker agent harnesses, where tool use and state tracking are lost under distraction.

Conclusion

Completion alone is a necessary but insufficient metric for real-world readiness. The benchmark shows that agents can construct pipelines and produce final artifacts, yet still miss step-level reasoning failures such as incorrect file selection or ignoring corrupted inputs. For clinical or regulated settings, the question is not just whether an agent produces a result, but whether it can justify its choices and avoid proceeding when evidence is unreliable.

BioAgent Bench shifts evaluation from "can it finish?" to "can it finish reliably, for the right reasons?" The benchmark captures realistic tool orchestration and structured outputs, while exposing brittle behaviors under perturbations. The next step is to expand task diversity, include larger and messier inputs, require external reference justification, and integrate robustness directly into primary scoring.

Read more

Dive into the full paper or explore the code and experiments.