Model Benchmarking

Model Benchmarking

When building agentic workflows with resolute-agent, the choice of LLM model directly impacts report quality, execution cost, and reliability. This guide covers how to benchmark models and make informed production decisions.

Why Benchmark

Agentic workflows differ from simple prompt/response use cases. The model must:

  1. Use tools correctly — construct valid arguments for function calls across many tools
  2. Follow instructions — use provided data (IDs, names) from the prompt, not ask for them
  3. Plan multi-step sequences — decide which tools to call and in what order
  4. Synthesize across sources — combine data from multiple tool responses into coherent output
  5. Recognize patterns — identify recurring issues, historical trends, and correlations

A model that excels at chat may fail at tool orchestration. Benchmarking against your actual workflow is the only reliable way to evaluate.

Methodology

1. Pick a Representative Input

Choose a real input that exercises the full workflow. For incident review, use an actual incident ID that has:

  • Multiple alerts and log entries
  • Historical precedent (past incidents on the same service)
  • Related tickets and documentation
  • Slack channel discussion

2. Hold Everything Constant

When comparing models, only the LLM configuration should change:

# Switch model via env var
kubectl set env deploy/my-agent-flow \
  AGENT_PROVIDER_TYPE=<anthropic|ollama> \
  AGENT_BASE_URL=<url> \
  AGENT_MODEL=<model>

kubectl rollout status deployment my-agent-flow

Keep the same:

  • System prompt
  • MCP tool configuration and tool count
  • Cost limits and max iterations
  • Compaction threshold
  • Input data (same incident/trigger)

3. Trigger and Collect

Trigger the workflow and extract metrics from Temporal:

# Trigger via webhook
curl -X POST http://localhost:8080/hooks/trigger \
  -H "Content-Type: application/json" \
  -d '{"event": {...}}'

# Extract metrics from the agent child workflow
temporal workflow show --workflow-id <id>/agent-<name> \
  --namespace default --output json

The agent child workflow output contains all metrics needed for comparison.

Metrics to Compare

Quantitative

MetricWhat It Tells You
IterationsHow efficiently the model plans — fewer iterations = better planning
Input tokensTotal context consumed — affects cost and compaction frequency
Output tokensResponse verbosity
Cost (USD)Direct API cost per run
DurationWall-clock time including tool calls
Tool callsTotal tools invoked — compare against expected minimum
Tool errorsErrors from tools (distinguish model errors from API limitations)
Compaction triggeredWhether context exceeded threshold — indicates context efficiency

Qualitative

Score each model’s output on:

CriterionWhat to Look For
CorrectnessDid it identify the right entity/issue?
CompletenessDid it gather data from all available sources?
ActionabilityAre the recommendations specific and useful?
Pattern recognitionDid it find historical patterns and correlations?
Format complianceDoes the output match the requested structure?

Capability Requirements by Workflow Type

Different agentic workflows have different model requirements:

Workflow TypeTool CountKey RequirementMinimum Model Class
Simple Q&A with tools5-10Basic function calling14B+ local, Haiku-class API
Multi-source synthesis20-50Instruction following + synthesis32B+ local, Sonnet-class API
Complex orchestration50-100+Planning + tool use + synthesis70B+ local, Sonnet/Opus-class API

Common Failure Modes

Model ignores provided data

The model asks “which incident should I investigate?” instead of using the ID from the prompt. This indicates insufficient instruction-following capability for the tool count.

Model doesn’t support tool use

Some local models lack function calling support entirely. Ollama will return a 400 error. Check model compatibility before benchmarking.

Excessive iterations

The model calls tools one at a time instead of planning a sequence. Results in 2-3x more iterations than necessary. The data quality may still be acceptable, but cost and latency increase.

Context overflow without compaction

Without compaction enabled, large tool responses (e.g., 72 Atlassian tools with full schemas) can exhaust the context window. Always configure compaction for workflows with many tools.

Token Efficiency Analysis

Compare input tokens per iteration across models:

Tokens/iteration = Total input tokens / Iterations

Lower tokens/iteration with the same output quality indicates better planning — the model is making fewer, more targeted tool calls. However, extremely low token counts may indicate the model is skipping important data sources.

Track PerTurnInputTokens from the agent output to identify context growth patterns:

  • Linear growth: Normal — each iteration adds tool results
  • Sudden spikes (>2x): Large tool response or redundant calls
  • Flat after compaction: Compaction working as expected

Model Categories

API Models (Anthropic, OpenAI)

Pros: Best quality, no infrastructure, consistent availability Cons: Per-token cost, data leaves your network, rate limits

Best for: Production workloads where quality matters more than cost.

Cloud-Routed Models (Ollama cloud tags)

Pros: Zero direct cost, good quality on large models Cons: Routes through third-party infrastructure, rate limits, not truly self-hosted

Best for: Development, benchmarking, cost-constrained environments without data sovereignty requirements.

Local Models (Ollama local)

Pros: No API cost, data stays on-premises, no rate limits Cons: Requires GPU infrastructure, quality varies significantly with parameter count

Best for: Air-gapped environments, high-volume low-cost processing, data sovereignty requirements.

Reproducing Benchmarks

A complete benchmark run:

# 1. Deploy with target model
kubectl set env deploy/my-agent-flow \
  AGENT_PROVIDER_TYPE=anthropic \
  AGENT_MODEL=claude-sonnet-4-6
kubectl rollout status deployment my-agent-flow

# 2. Port-forward
kubectl port-forward deploy/my-agent-flow 9090:8080 &

# 3. Trigger with test input
curl -X POST http://localhost:9090/hooks/trigger \
  -H "Content-Type: application/json" \
  -d @test-payload.json

# 4. Wait for completion, then extract metrics
temporal workflow show \
  --workflow-id <workflow-id>/agent-<name> \
  --namespace default --output json

# 5. Repeat steps 1-4 for each model

Save the workflow output JSON for each run. The output contains: iterations, input_tokens, output_tokens, total_cost, duration, verdict, per_turn_input_tokens, tokens_saved_by_compact, tool_calls, and response.

Decision Framework

Is report quality critical?
├── Yes → Claude Sonnet 4.6 (or Opus for maximum quality)
└── No
    ├── Is cost the primary constraint?
    │   ├── Yes, and cloud API is OK → Claude Haiku 4.5
    │   └── Yes, and must be self-hosted → Qwen3.5 32B+ (test first)
    └── Is data sovereignty required?
        ├── Yes → Local model with sufficient parameters (test 32B+)
        └── No → Claude Haiku 4.5 (best quality/cost ratio)

See Also