Task-level completion across models

We challenge models to reproduce known bioinformatics results—variant calls, pathway enrichments, and QC outcomes. An LLM grades every result and pipeline step against the reference truth so you can see which models are dependable in real pipelines.

All Tasks Per Model

Open environment with reference data

Completion rates across all tasks for each model in the with-reference benchmark.

Updated Jan 13, 2026, 11:32 AM

Tasks 10
Focused model: Claude Opus 4.595%Tasks 10Steps 39/41

Experiment

Open environment with reference data

Runs grounded with reference artifacts to validate pipeline completion against known outputs.

Updated Jan 13, 2026, 11:32 AM

Tasks 10
Task
Claude Opus 4.5
GPT 5.2
Claude Sonnet 4.5
Minimax M2.1
GPT 5.1 Codex-Max
GLM 4.7
Kimi K2 Thinking
Qwen3 Coder 480B A35B
Devstral 2 2512
Gemini 3 Pro Preview

alzheimer-mouse

comparative-genomics

cystic-fibrosis

deseq

evolution

giab

metagenomics

single-cell

transcript-quant

viral-metagenomics

Hover a cell to inspect a model-task combo. Click to pin the details.

Focused: alzheimer-mouse · GPT 5.1 Codex-Max100%Steps 3/3