End-to-end pipeline completion across models

We challenge models to run full bioinformatics pipelines and reproduce known results—variant calls, pathway enrichments, and QC outcomes. An LLM grades every result and pipeline step against the reference truth so you can see which models are dependable in real pipelines.

Read the BioAgent Bench blog post->

All Tasks Per Model

Open environment with reference data

Completion rates for end-to-end pipelines for each model in the with-reference benchmark.

Updated Jan 13, 2026, 11:32 AM

Tasks 10

Focused model: GPT 5.293%Tasks 10Steps 39/42

Experiment

Open environment with reference data

Runs grounded with reference artifacts to validate pipeline completion against known outputs.

Updated Jan 13, 2026, 11:32 AM

Tasks 10

Task

GPT 5.2

Minimax M2.1

GPT 5.1 Codex-Max

GLM 4.7

Kimi K2 Thinking

Qwen3 Coder 480B A35B

Devstral 2 2512

Gemini 3 Pro Preview

alzheimer-mouse

comparative-genomics

cystic-fibrosis

deseq

evolution

giab

metagenomics

single-cell

transcript-quant

viral-metagenomics

Hover a cell to inspect a model-task combo. Click to pin the details.

Focused: alzheimer-mouse · GPT 5.1 Codex-Max100%Steps 3/3