Best LLMs for Agentic Use Cases¶

A practical, opinionated guide to choosing the right model for your Promptise agent. Covers proprietary and open-source options, real-world ratings, and which model fits which job.

Last updated: April 2026

The agentic LLM landscape moves fast. This page reflects the state of available models as of April 2026. We update it as new models ship.

What makes a good agentic model?¶

Picking a chat model for an agent is different from picking one for a chatbot. An agent needs to:

Call tools reliably — invoke functions with the right arguments, in the right order, without hallucinating signatures
Reason across steps — chain multiple tool calls toward a goal, recover from errors, decide when to stop
Follow instructions precisely — respect role definitions, system prompts, output schemas
Handle structured output — return valid JSON when asked, match Pydantic schemas
Stay within budget — call only the tools that matter, not 40 in a row

Models that score well on chat benchmarks (MMLU, HellaSwag) often fail on agentic ones. The benchmarks that matter for agents are:

Benchmark	What it measures
Berkeley Function Calling Leaderboard (BFCL)	Tool-call accuracy, parameter extraction, multi-step composition
τ²-Bench	Realistic multi-turn agent tasks across customer service domains
Terminal-Bench	Real terminal/coding agent tasks end-to-end
IFBench	Instruction-following reliability under structured constraints
ToolComp	Composing dependent tool calls toward a goal

Quick recommendations¶

If you don't want to read the whole page, here's where to start:

Your situation	Use this model
Just starting, want it to work	`openai:gpt-5-mini` — best balance of cost, speed, reliability
Production agent, need maximum reliability	`anthropic:claude-sonnet-4.6` — best tool calling in 2026
Maximum reasoning quality, cost no object	`anthropic:claude-opus-4.6` or `openai:gpt-5`
Multi-million token context	`google:gemini-3-pro` — 2M token window
Self-hosted, no data leaves your infra	`ollama:qwen3-coder` or `ollama:deepseek-v3.2`
Local laptop dev	`ollama:phi-4-mini` or `ollama:qwen3-coder-30b-a3b`
Hybrid: cheap for simple steps, strong for hard ones	Use Promptise's per-node `model_override` in the Reasoning Engine

Proprietary models¶

Anthropic Claude 4.6 family¶

The current state-of-the-art for agentic workloads. Claude 4.6 leads tool-calling benchmarks and is the most reliable choice for production agents that orchestrate complex multi-tool workflows.

Model	Provider String	Input $/M	Output $/M	Speed	Context	Best For
Claude Opus 4.6	`anthropic:claude-opus-4-6`	$15	$75	20-30 t/s	1M	Maximum quality, hardest tasks
Claude Sonnet 4.6	`anthropic:claude-sonnet-4-6`	$3	$15	40-60 t/s	1M	Production agents, best value
Claude Haiku 4.5	`anthropic:claude-haiku-4-5`	$1	$5	80+ t/s	200K	Fast, simple agent tasks

Rating	Score
Tool calling reliability	★★★★★
Reasoning quality	★★★★★ (Opus) / ★★★★☆ (Sonnet)
Speed	★★★☆☆ (Opus) / ★★★★☆ (Sonnet) / ★★★★★ (Haiku)
Cost efficiency	★★☆☆☆ (Opus) / ★★★★☆ (Sonnet) / ★★★★★ (Haiku)
Setup complexity	★☆☆☆☆ (just an API key)

When to use each:

Opus 4.6 — Long-horizon agents, complex multi-agent orchestration, agents that need to debug their own mistakes. Use when getting the right answer matters more than the cost per call.
Sonnet 4.6 — The default choice for production. Near-Opus quality at 5x lower cost. Best tool-calling/cost ratio in the market right now.
Haiku 4.5 — Fast classification, routing, simple tool calls. Pair it with Sonnet via the Reasoning Engine for cost-optimized pipelines.

from promptise import build_agent

agent = await build_agent(
    model="anthropic:claude-sonnet-4-6",
    servers={"tools": HTTPServerSpec(url="http://localhost:8080/mcp")},
    instructions="You are a senior analyst.",
)

OpenAI GPT-5 family¶

GPT-5.2 currently leads several agentic benchmarks alongside Claude. GPT-5-mini is the best entry-level model for cost-conscious agentic work.

Model	Provider String	Input $/M	Output $/M	Speed	Context	Best For
GPT-5	`openai:gpt-5`	$5	$20	30-50 t/s	1M	Top-tier reasoning, parity with Claude Opus
GPT-5-mini	`openai:gpt-5-mini`	$1.25	$10	60-80 t/s	400K	Best entry point — fast, cheap, capable
GPT-5-nano	`openai:gpt-5-nano`	$0.30	$1.20	100+ t/s	128K	High-volume, simple tasks

Rating	Score
Tool calling reliability	★★★★★
Reasoning quality	★★★★★ (GPT-5) / ★★★★☆ (mini)
Speed	★★★★☆
Cost efficiency	★★★★☆ (mini is exceptional value)
Setup complexity	★☆☆☆☆

When to use each:

GPT-5 — When you need the absolute best on coding and reasoning. Strong native function calling. Excellent for complex agentic workflows.
GPT-5-mini — Promptise's recommended default for getting started. Great tool calling, fast, cheap enough to iterate freely.
GPT-5-nano — Use for high-volume agent fleets where each call is simple (classification, routing, simple lookups).

agent = await build_agent(
    model="openai:gpt-5-mini",
    servers={"tools": HTTPServerSpec(url="...")},
)

Google Gemini 3 family¶

Gemini 3 Pro's massive 2M token context window is the killer feature. If your agent needs to read entire codebases, full document libraries, or hours of conversation history, Gemini is the only option.

Model	Provider String	Input $/M	Output $/M	Speed	Context	Best For
Gemini 3 Pro	`google:gemini-3-pro`	$2 / $4	$12 / $24	40-60 t/s	2M	Massive context, multimodal
Gemini 3 Flash	`google:gemini-3-flash`	$0.30	$2.50	100+ t/s	1M	Fast, very cheap

Gemini 3 Pro pricing doubles beyond 200K tokens (input) and 1M tokens (output)

Rating	Score
Tool calling reliability	★★★★☆
Reasoning quality	★★★★★ (Pro)
Speed	★★★★☆
Cost efficiency	★★★★☆ (Flash is excellent value)
Setup complexity	★★☆☆☆

When to use:

Gemini 3 Pro — Document analysis agents, codebase navigation, long-running conversations where the full history matters.
Gemini 3 Flash — High-throughput agent workers, real-time agents that need fast responses, multimodal tasks (image + text).

Open-source models¶

The gap between proprietary and open-source has narrowed dramatically in 2026. For self-hosted production deployments, open-source models are now genuinely competitive.

Qwen 3 family¶

The current leader in open-source agentic performance. Qwen3 models top the Berkeley Function Calling Leaderboard among open weights and ship under Apache 2.0 (commercially permissive).

Model	Provider String	Parameters	Hardware	Best For
Qwen3-Coder Next (80B MoE)	`ollama:qwen3-coder`	80B total / 3B active	1× A100 80GB or 2× RTX 4090	Coding agents, long-horizon reasoning
Qwen3-Coder-30B-A3B	`ollama:qwen3-coder-30b`	30B total / 3B active	1× RTX 4090	Best laptop/workstation option
Qwen3.5-397B-A17B	`ollama:qwen3-397b`	397B total / 17B active	4× A100 80GB	Top-tier self-hosted reasoning

Rating	Score
Tool calling reliability	★★★★★ (best open-source)
Reasoning quality	★★★★☆
Speed	★★★★☆ (MoE = fast inference)
Cost efficiency	★★★★★ (free + low compute via MoE)
Setup complexity	★★★☆☆ (requires GPU + Ollama/vLLM)

agent = await build_agent(
    model="ollama:qwen3-coder",  # local Ollama
    servers={...},
)

DeepSeek V3.2¶

The first open model to integrate "thinking mode" directly into tool use. Strong general reasoning, very good for autonomous agent workloads.

Model	Provider String	Parameters	Hardware	Best For
DeepSeek V3.2	`ollama:deepseek-v3.2`	671B total / 37B active	Multi-GPU cluster	Top open-source reasoning + tool use

Rating	Score
Tool calling reliability	★★★★★
Reasoning quality	★★★★★ (rivals GPT-4 Turbo)
Speed	★★★☆☆
Cost efficiency	★★★★☆ (free, but high compute)
Setup complexity	★★★★☆ (large model, complex serving)

GLM 4.7 / GLM 5¶

The strongest open-source competitor to Claude on agentic terminal tasks. Hybrid reasoning modes (think/no-think) are unique among open models.

Model	Provider String	Parameters	Best For
GLM-4.7 Flash	`ollama:glm-4.7-flash`	30B MoE	Lightweight agents, fast serving
GLM-5	`ollama:glm-5`	Larger MoE	Reasoning + coding + agent triple threat

Rating	Score
Tool calling reliability	★★★★★ (90.6% on tool use benchmarks)
Reasoning quality	★★★★★
Speed	★★★★☆
Cost efficiency	★★★★☆
Setup complexity	★★★☆☆

Llama 4 family¶

Meta's open model. Strong general performance and a 10M token context window on Scout. Lower tool-calling reliability than Qwen/DeepSeek but the most ecosystem support.

Model	Provider String	Parameters	Context	Best For
Llama 4 Scout	`ollama:llama4-scout`	17B active	10M	Massive context, document processing
Llama 4 Maverick	`ollama:llama4-maverick`	400B total	1M	General-purpose, multilingual

Rating	Score
Tool calling reliability	★★★☆☆ (improving but not best-in-class)
Reasoning quality	★★★★☆
Speed	★★★★☆
Cost efficiency	★★★★☆
Setup complexity	★★☆☆☆ (excellent ecosystem)

Mistral Large 3 / Medium 3.1¶

Apache 2.0, European, strong general capabilities. Mistral Medium 3.1 specifically is positioned as "Claude Sonnet quality at 8x lower cost" — true for many tasks.

Model	Provider String	Parameters	Best For
Mistral Large 3	`mistral:mistral-large-3`	675B / 41B active	Top-tier API-based open model
Mistral Medium 3.1	`mistral:mistral-medium-3.1`	~70B	Best price/performance in Mistral lineup
Mistral Small 4	`mistral:mistral-small-4`	~24B	25x cheaper than GPT-4o, capable

Rating	Score
Tool calling reliability	★★★★☆
Reasoning quality	★★★★☆
Speed	★★★★☆
Cost efficiency	★★★★★
Setup complexity	★☆☆☆☆ (API-based)

Lightweight local models¶

For laptop development, edge deployments, or privacy-sensitive workflows. Don't expect frontier performance — but they support real function calling on consumer hardware.

Model	Provider String	Parameters	Hardware	Best For
Phi-4-Mini	`ollama:phi-4-mini`	14B	M2 Mac, RTX 3060	Surprisingly strong tool calling on a laptop
Falcon 3	`ollama:falcon3`	7B / 10B	Consumer GPU	Native function calling, fast
Qwen3-Coder-30B-A3B	`ollama:qwen3-coder-30b`	30B / 3B active	M3 Max, RTX 4090	Best laptop option for coding agents

Rating	Score
Tool calling reliability	★★★☆☆
Reasoning quality	★★★☆☆
Speed	★★★★★ (small + local)
Cost efficiency	★★★★★ (zero per-token cost)
Setup complexity	★★☆☆☆ (Ollama install)

# Local development with Ollama
agent = await build_agent(
    model="ollama:phi-4-mini",
    servers={...},
)

Cost vs. quality cheat sheet¶

Per-million-token costs as of April 2026, sorted by output cost:

Model	Input $/M	Output $/M	Tier
Claude Opus 4.6	$15	$75	Premium
GPT-5	$5	$20	Premium
Claude Sonnet 4.6	$3	$15	High
Gemini 3 Pro	$2	$12	High
GPT-5-mini	$1.25	$10	Mid
Claude Haiku 4.5	$1	$5	Mid
Mistral Medium 3.1	$0.40	$2	Low
Gemini 3 Flash	$0.30	$2.50	Low
GPT-5-nano	$0.30	$1.20	Low
Self-hosted (any open model)	$0	$0	Free + compute

Mixing models in one agent¶

This is where Promptise's Reasoning Engine shines. Use a cheap model for simple steps and an expensive one for hard reasoning — all in the same agent.

from promptise import build_agent
from promptise.engine import PromptNode, ThinkNode, SynthesizeNode

agent = await build_agent(
    model="openai:gpt-5-mini",  # Default model for most nodes
    servers={"tools": HTTPServerSpec(url="...")},
    node_pool=[
        ThinkNode("think"),  # Uses default model
        PromptNode("research",
            inject_tools=True,
            model_override="anthropic:claude-haiku-4-5",  # Fast/cheap for tool calls
        ),
        PromptNode("deep_analysis",
            input_keys=["research_results"],
            model_override="anthropic:claude-opus-4-6",  # Strong model for hard reasoning
            max_tokens=8192,
        ),
        SynthesizeNode("final",
            model_override="openai:gpt-5-mini",  # Cheap for formatting
        ),
    ],
)

This pattern can cut LLM costs by 60-80% on complex agent workflows while maintaining quality where it matters.

How to evaluate a model for your use case¶

Don't trust benchmarks alone. The best test is your own workflow on a realistic task. Here's a quick eval template:

import asyncio
from promptise import build_agent
from promptise.config import HTTPServerSpec

async def eval_model(model_id: str, task: str) -> dict:
    agent = await build_agent(
        model=model_id,
        servers={"tools": HTTPServerSpec(url="http://localhost:8080/mcp")},
        instructions="You are a research assistant.",
    )

    import time
    start = time.time()
    result = await agent.ainvoke({"messages": [{"role": "user", "content": task}]})
    elapsed = time.time() - start

    tool_calls = [m for m in result["messages"] if getattr(m, "type", "") == "tool"]
    response = next(
        (m.content for m in reversed(result["messages"])
         if getattr(m, "type", "") == "ai" and m.content),
        ""
    )

    await agent.shutdown()
    return {
        "model": model_id,
        "time_seconds": elapsed,
        "tool_calls": len(tool_calls),
        "response_length": len(response),
        "response": response,
    }

# Test the same task across models
async def main():
    task = "Find the weather in Berlin, Tokyo, and NYC. Calculate the average temperature."
    for model in ["openai:gpt-5-mini", "anthropic:claude-sonnet-4-6", "google:gemini-3-flash"]:
        result = await eval_model(model, task)
        print(f"{model}: {result['time_seconds']:.1f}s, {result['tool_calls']} tools")

asyncio.run(main())

Run this with your real MCP servers. The model that uses fewer tool calls and produces a correct answer fastest is the right one for your workload.

Tips and gotchas¶

Default to Claude Sonnet 4.6 for production

If you're shipping a real product, Claude Sonnet 4.6 is the safest choice in 2026. Best tool-calling reliability, near-Opus quality, predictable costs.

Default to GPT-5-mini for prototyping

For local development and iteration, GPT-5-mini is hard to beat. Fast, cheap, reliable enough to debug your agent logic without thinking about per-token costs.

Open-source isn't free

The model itself is free, but GPU hosting isn't. A 70B+ model needs a serious GPU (A100/H100). Factor in serving infrastructure, monitoring, and model updates before going self-hosted.

Tool calling support varies by model

Some open-source models support tool calling natively (Qwen, DeepSeek, GLM, Phi-4). Others need prompt engineering to fake it. Check the model's docs before assuming it works with inject_tools=True.

Use semantic tool optimization for big tool sets

If your agent has 30+ tools, enable optimize_tools=True in build_agent(). This uses local embeddings to send only the relevant tools per query, cutting input tokens by 40-70% regardless of model.