Task-level completion across models
We challenge models to reproduce known bioinformatics results—variant calls, pathway enrichments, and QC outcomes. An LLM grades every result and pipeline step against the reference truth so you can see which models are dependable in real pipelines.
All Tasks Per Model
Open environment with reference data
Completion rates across all tasks for each model in the with-reference benchmark.
Updated Jan 13, 2026, 11:32 AM
Tasks 10
Focused model: Claude Opus 4.595%Tasks 10Steps 39/41
Experiment
Open environment with reference data
Runs grounded with reference artifacts to validate pipeline completion against known outputs.
Updated Jan 13, 2026, 11:32 AM
Tasks 10
Task
Claude Opus 4.5
GPT 5.2
Claude Sonnet 4.5
Minimax M2.1
GPT 5.1 Codex-Max
GLM 4.7
Kimi K2 Thinking
Qwen3 Coder 480B A35B
Devstral 2 2512
Gemini 3 Pro Preview
alzheimer-mouse
comparative-genomics
cystic-fibrosis
deseq
evolution
giab
metagenomics
single-cell
transcript-quant
viral-metagenomics
Hover a cell to inspect a model-task combo. Click to pin the details.
Focused: alzheimer-mouse · GPT 5.1 Codex-Max100%Steps 3/3