🔧

Tool Use Accuracy Benchmark

Tests LLM ability to select the correct tool and pass the right arguments using OpenAI function calling and Anthropic tool_use. 250+ tasks from 2 sources (TAB + BFCL) across 13+ categories with dual-score evaluation: Tool Selection (70%) + Argument Accuracy (30%). Includes irrelevance detection tests where models must know when NOT to use a tool.

Total Tasks

Tool Use Accuracy Benchmark

Run Benchmark

Per-Task Results

My Runs

Leaderboard

Browse Tasks