Skip to content

Best LLMs for Agentic Use Cases

A practical, opinionated guide to choosing the right model for your Promptise agent. Covers proprietary and open-source options, real-world ratings, and which model fits which job.

Last updated: April 2026

The agentic LLM landscape moves fast. This page reflects the state of available models as of April 2026. We update it as new models ship.


What makes a good agentic model?

Picking a chat model for an agent is different from picking one for a chatbot. An agent needs to:

  1. Call tools reliably — invoke functions with the right arguments, in the right order, without hallucinating signatures
  2. Reason across steps — chain multiple tool calls toward a goal, recover from errors, decide when to stop
  3. Follow instructions precisely — respect role definitions, system prompts, output schemas
  4. Handle structured output — return valid JSON when asked, match Pydantic schemas
  5. Stay within budget — call only the tools that matter, not 40 in a row

Models that score well on chat benchmarks (MMLU, HellaSwag) often fail on agentic ones. The benchmarks that matter for agents are:

Benchmark What it measures
Berkeley Function Calling Leaderboard (BFCL) Tool-call accuracy, parameter extraction, multi-step composition
τ²-Bench Realistic multi-turn agent tasks across customer service domains
Terminal-Bench Real terminal/coding agent tasks end-to-end
IFBench Instruction-following reliability under structured constraints
ToolComp Composing dependent tool calls toward a goal

Quick recommendations

If you don't want to read the whole page, here's where to start:

Your situation Use this model
Just starting, want it to work openai:gpt-5-mini — best balance of cost, speed, reliability
Production agent, need maximum reliability anthropic:claude-sonnet-4.6 — best tool calling in 2026
Maximum reasoning quality, cost no object anthropic:claude-opus-4.6 or openai:gpt-5
Multi-million token context google:gemini-3-pro — 2M token window
Self-hosted, no data leaves your infra ollama:qwen3-coder or ollama:deepseek-v3.2
Local laptop dev ollama:phi-4-mini or ollama:qwen3-coder-30b-a3b
Hybrid: cheap for simple steps, strong for hard ones Use Promptise's per-node model_override in the Reasoning Engine

Proprietary models

Anthropic Claude 4.6 family

The current state-of-the-art for agentic workloads. Claude 4.6 leads tool-calling benchmarks and is the most reliable choice for production agents that orchestrate complex multi-tool workflows.

Model Provider String Input $/M Output $/M Speed Context Best For
Claude Opus 4.6 anthropic:claude-opus-4-6 $15 $75 20-30 t/s 1M Maximum quality, hardest tasks
Claude Sonnet 4.6 anthropic:claude-sonnet-4-6 $3 $15 40-60 t/s 1M Production agents, best value
Claude Haiku 4.5 anthropic:claude-haiku-4-5 $1 $5 80+ t/s 200K Fast, simple agent tasks
Rating Score
Tool calling reliability ★★★★★
Reasoning quality ★★★★★ (Opus) / ★★★★☆ (Sonnet)
Speed ★★★☆☆ (Opus) / ★★★★☆ (Sonnet) / ★★★★★ (Haiku)
Cost efficiency ★★☆☆☆ (Opus) / ★★★★☆ (Sonnet) / ★★★★★ (Haiku)
Setup complexity ★☆☆☆☆ (just an API key)

When to use each:

  • Opus 4.6 — Long-horizon agents, complex multi-agent orchestration, agents that need to debug their own mistakes. Use when getting the right answer matters more than the cost per call.
  • Sonnet 4.6 — The default choice for production. Near-Opus quality at 5x lower cost. Best tool-calling/cost ratio in the market right now.
  • Haiku 4.5 — Fast classification, routing, simple tool calls. Pair it with Sonnet via the Reasoning Engine for cost-optimized pipelines.
from promptise import build_agent

agent = await build_agent(
    model="anthropic:claude-sonnet-4-6",
    servers={"tools": HTTPServerSpec(url="http://localhost:8080/mcp")},
    instructions="You are a senior analyst.",
)

OpenAI GPT-5 family

GPT-5.2 currently leads several agentic benchmarks alongside Claude. GPT-5-mini is the best entry-level model for cost-conscious agentic work.

Model Provider String Input $/M Output $/M Speed Context Best For
GPT-5 openai:gpt-5 $5 $20 30-50 t/s 1M Top-tier reasoning, parity with Claude Opus
GPT-5-mini openai:gpt-5-mini $1.25 $10 60-80 t/s 400K Best entry point — fast, cheap, capable
GPT-5-nano openai:gpt-5-nano $0.30 $1.20 100+ t/s 128K High-volume, simple tasks
Rating Score
Tool calling reliability ★★★★★
Reasoning quality ★★★★★ (GPT-5) / ★★★★☆ (mini)
Speed ★★★★☆
Cost efficiency ★★★★☆ (mini is exceptional value)
Setup complexity ★☆☆☆☆

When to use each:

  • GPT-5 — When you need the absolute best on coding and reasoning. Strong native function calling. Excellent for complex agentic workflows.
  • GPT-5-mini — Promptise's recommended default for getting started. Great tool calling, fast, cheap enough to iterate freely.
  • GPT-5-nano — Use for high-volume agent fleets where each call is simple (classification, routing, simple lookups).
agent = await build_agent(
    model="openai:gpt-5-mini",
    servers={"tools": HTTPServerSpec(url="...")},
)

Google Gemini 3 family

Gemini 3 Pro's massive 2M token context window is the killer feature. If your agent needs to read entire codebases, full document libraries, or hours of conversation history, Gemini is the only option.

Model Provider String Input $/M Output $/M Speed Context Best For
Gemini 3 Pro google:gemini-3-pro $2 / $4 $12 / $24 40-60 t/s 2M Massive context, multimodal
Gemini 3 Flash google:gemini-3-flash $0.30 $2.50 100+ t/s 1M Fast, very cheap

Gemini 3 Pro pricing doubles beyond 200K tokens (input) and 1M tokens (output)

Rating Score
Tool calling reliability ★★★★☆
Reasoning quality ★★★★★ (Pro)
Speed ★★★★☆
Cost efficiency ★★★★☆ (Flash is excellent value)
Setup complexity ★★☆☆☆

When to use:

  • Gemini 3 Pro — Document analysis agents, codebase navigation, long-running conversations where the full history matters.
  • Gemini 3 Flash — High-throughput agent workers, real-time agents that need fast responses, multimodal tasks (image + text).

Open-source models

The gap between proprietary and open-source has narrowed dramatically in 2026. For self-hosted production deployments, open-source models are now genuinely competitive.

Qwen 3 family

The current leader in open-source agentic performance. Qwen3 models top the Berkeley Function Calling Leaderboard among open weights and ship under Apache 2.0 (commercially permissive).

Model Provider String Parameters Hardware Best For
Qwen3-Coder Next (80B MoE) ollama:qwen3-coder 80B total / 3B active 1× A100 80GB or 2× RTX 4090 Coding agents, long-horizon reasoning
Qwen3-Coder-30B-A3B ollama:qwen3-coder-30b 30B total / 3B active 1× RTX 4090 Best laptop/workstation option
Qwen3.5-397B-A17B ollama:qwen3-397b 397B total / 17B active 4× A100 80GB Top-tier self-hosted reasoning
Rating Score
Tool calling reliability ★★★★★ (best open-source)
Reasoning quality ★★★★☆
Speed ★★★★☆ (MoE = fast inference)
Cost efficiency ★★★★★ (free + low compute via MoE)
Setup complexity ★★★☆☆ (requires GPU + Ollama/vLLM)
agent = await build_agent(
    model="ollama:qwen3-coder",  # local Ollama
    servers={...},
)

DeepSeek V3.2

The first open model to integrate "thinking mode" directly into tool use. Strong general reasoning, very good for autonomous agent workloads.

Model Provider String Parameters Hardware Best For
DeepSeek V3.2 ollama:deepseek-v3.2 671B total / 37B active Multi-GPU cluster Top open-source reasoning + tool use
Rating Score
Tool calling reliability ★★★★★
Reasoning quality ★★★★★ (rivals GPT-4 Turbo)
Speed ★★★☆☆
Cost efficiency ★★★★☆ (free, but high compute)
Setup complexity ★★★★☆ (large model, complex serving)

GLM 4.7 / GLM 5

The strongest open-source competitor to Claude on agentic terminal tasks. Hybrid reasoning modes (think/no-think) are unique among open models.

Model Provider String Parameters Best For
GLM-4.7 Flash ollama:glm-4.7-flash 30B MoE Lightweight agents, fast serving
GLM-5 ollama:glm-5 Larger MoE Reasoning + coding + agent triple threat
Rating Score
Tool calling reliability ★★★★★ (90.6% on tool use benchmarks)
Reasoning quality ★★★★★
Speed ★★★★☆
Cost efficiency ★★★★☆
Setup complexity ★★★☆☆

Llama 4 family

Meta's open model. Strong general performance and a 10M token context window on Scout. Lower tool-calling reliability than Qwen/DeepSeek but the most ecosystem support.

Model Provider String Parameters Context Best For
Llama 4 Scout ollama:llama4-scout 17B active 10M Massive context, document processing
Llama 4 Maverick ollama:llama4-maverick 400B total 1M General-purpose, multilingual
Rating Score
Tool calling reliability ★★★☆☆ (improving but not best-in-class)
Reasoning quality ★★★★☆
Speed ★★★★☆
Cost efficiency ★★★★☆
Setup complexity ★★☆☆☆ (excellent ecosystem)

Mistral Large 3 / Medium 3.1

Apache 2.0, European, strong general capabilities. Mistral Medium 3.1 specifically is positioned as "Claude Sonnet quality at 8x lower cost" — true for many tasks.

Model Provider String Parameters Best For
Mistral Large 3 mistral:mistral-large-3 675B / 41B active Top-tier API-based open model
Mistral Medium 3.1 mistral:mistral-medium-3.1 ~70B Best price/performance in Mistral lineup
Mistral Small 4 mistral:mistral-small-4 ~24B 25x cheaper than GPT-4o, capable
Rating Score
Tool calling reliability ★★★★☆
Reasoning quality ★★★★☆
Speed ★★★★☆
Cost efficiency ★★★★★
Setup complexity ★☆☆☆☆ (API-based)

Lightweight local models

For laptop development, edge deployments, or privacy-sensitive workflows. Don't expect frontier performance — but they support real function calling on consumer hardware.

Model Provider String Parameters Hardware Best For
Phi-4-Mini ollama:phi-4-mini 14B M2 Mac, RTX 3060 Surprisingly strong tool calling on a laptop
Falcon 3 ollama:falcon3 7B / 10B Consumer GPU Native function calling, fast
Qwen3-Coder-30B-A3B ollama:qwen3-coder-30b 30B / 3B active M3 Max, RTX 4090 Best laptop option for coding agents
Rating Score
Tool calling reliability ★★★☆☆
Reasoning quality ★★★☆☆
Speed ★★★★★ (small + local)
Cost efficiency ★★★★★ (zero per-token cost)
Setup complexity ★★☆☆☆ (Ollama install)
# Local development with Ollama
agent = await build_agent(
    model="ollama:phi-4-mini",
    servers={...},
)

Cost vs. quality cheat sheet

Per-million-token costs as of April 2026, sorted by output cost:

Model Input $/M Output $/M Tier
Claude Opus 4.6 $15 $75 Premium
GPT-5 $5 $20 Premium
Claude Sonnet 4.6 $3 $15 High
Gemini 3 Pro $2 $12 High
GPT-5-mini $1.25 $10 Mid
Claude Haiku 4.5 $1 $5 Mid
Mistral Medium 3.1 $0.40 $2 Low
Gemini 3 Flash $0.30 $2.50 Low
GPT-5-nano $0.30 $1.20 Low
Self-hosted (any open model) $0 $0 Free + compute

Mixing models in one agent

This is where Promptise's Reasoning Engine shines. Use a cheap model for simple steps and an expensive one for hard reasoning — all in the same agent.

from promptise import build_agent
from promptise.engine import PromptNode, ThinkNode, SynthesizeNode

agent = await build_agent(
    model="openai:gpt-5-mini",  # Default model for most nodes
    servers={"tools": HTTPServerSpec(url="...")},
    node_pool=[
        ThinkNode("think"),  # Uses default model
        PromptNode("research",
            inject_tools=True,
            model_override="anthropic:claude-haiku-4-5",  # Fast/cheap for tool calls
        ),
        PromptNode("deep_analysis",
            input_keys=["research_results"],
            model_override="anthropic:claude-opus-4-6",  # Strong model for hard reasoning
            max_tokens=8192,
        ),
        SynthesizeNode("final",
            model_override="openai:gpt-5-mini",  # Cheap for formatting
        ),
    ],
)

This pattern can cut LLM costs by 60-80% on complex agent workflows while maintaining quality where it matters.


How to evaluate a model for your use case

Don't trust benchmarks alone. The best test is your own workflow on a realistic task. Here's a quick eval template:

import asyncio
from promptise import build_agent
from promptise.config import HTTPServerSpec

async def eval_model(model_id: str, task: str) -> dict:
    agent = await build_agent(
        model=model_id,
        servers={"tools": HTTPServerSpec(url="http://localhost:8080/mcp")},
        instructions="You are a research assistant.",
    )

    import time
    start = time.time()
    result = await agent.ainvoke({"messages": [{"role": "user", "content": task}]})
    elapsed = time.time() - start

    tool_calls = [m for m in result["messages"] if getattr(m, "type", "") == "tool"]
    response = next(
        (m.content for m in reversed(result["messages"])
         if getattr(m, "type", "") == "ai" and m.content),
        ""
    )

    await agent.shutdown()
    return {
        "model": model_id,
        "time_seconds": elapsed,
        "tool_calls": len(tool_calls),
        "response_length": len(response),
        "response": response,
    }

# Test the same task across models
async def main():
    task = "Find the weather in Berlin, Tokyo, and NYC. Calculate the average temperature."
    for model in ["openai:gpt-5-mini", "anthropic:claude-sonnet-4-6", "google:gemini-3-flash"]:
        result = await eval_model(model, task)
        print(f"{model}: {result['time_seconds']:.1f}s, {result['tool_calls']} tools")

asyncio.run(main())

Run this with your real MCP servers. The model that uses fewer tool calls and produces a correct answer fastest is the right one for your workload.


Tips and gotchas

Default to Claude Sonnet 4.6 for production

If you're shipping a real product, Claude Sonnet 4.6 is the safest choice in 2026. Best tool-calling reliability, near-Opus quality, predictable costs.

Default to GPT-5-mini for prototyping

For local development and iteration, GPT-5-mini is hard to beat. Fast, cheap, reliable enough to debug your agent logic without thinking about per-token costs.

Open-source isn't free

The model itself is free, but GPU hosting isn't. A 70B+ model needs a serious GPU (A100/H100). Factor in serving infrastructure, monitoring, and model updates before going self-hosted.

Tool calling support varies by model

Some open-source models support tool calling natively (Qwen, DeepSeek, GLM, Phi-4). Others need prompt engineering to fake it. Check the model's docs before assuming it works with inject_tools=True.

Use semantic tool optimization for big tool sets

If your agent has 30+ tools, enable optimize_tools=True in build_agent(). This uses local embeddings to send only the relevant tools per query, cutting input tokens by 40-70% regardless of model.


Sources