Tests LLM ability to select the correct tool and pass the right arguments using OpenAI function calling and Anthropic tool_use. 250+ tasks from 2 sources (TAB + BFCL) across 13+ categories with dual-score evaluation: Tool Selection (70%) + Argument Accuracy (30%). Includes irrelevance detection tests where models must know when NOT to use a tool.