I benchmarked 11 LLMs on a 69-scenario tool-calling test suite. Intel/Qwen3.6-35B-A3B-int4-AutoRound delivered the best overall result: a perfect quality score (93/100) at 2.0s median response time. The 35B model outperformed both 122B and 397B models in this test — size was not a proxy for capability.

Three findings worth highlighting:

Methodology

69 scenarios across 16 capability categories: tool selection, parameter precision, multi-step reasoning, error recovery, safety boundaries, structured output. Three trials per scenario per model. The tool definition itself adds about 4,637 tokens to every prompt (52 tools, 18,548 characters).

Metrics:

Hardware: tests ran on a four-node DGX Spark GB10 cluster. Results are GB10-specific — speeds on H100 or H200 will differ.

Results

Model Score Speed (median) Deployability Reliability gap Rating
Intel/Qwen3.6-35B-A3B-int4-AutoRound 93.0 2.0s 84 2.9pp ★★★★★
Qwen3.6-27B-FP8 93.0 11.3s 69 0.0pp ★★★★★
qwen35-122b-hybrid-int4fp8 91.7 3.0s 79 10.2pp ★★★★★
Qwen3.5-122B-A10B-FP8 90.0 6.8s 70 0.0pp ★★★★★
Intel/Qwen3.5-397B-A17B-int4-AutoRound 88.0 10.5s 64 4.3pp ★★★★
Qwen3.6-27B-FP8 (alt config) 88.0 16.0s 64 0.0pp ★★★★
/models/glm-f16 82.3 5.7s 66 7.3pp ★★★★
glm-64k:latest 80.7 1.1s 80 1.4pp ★★★★
qwen3:32b 78.7 5.5s 65 4.3pp ★★★★
gemma4:31b 70.0 25.8s 50 8.7pp ★★★

Did not complete: qwen2.5vl:7b failed all 207 trials (69 scenarios × 3 trials) due to API configuration errors returning HTTP 400. This is a deployment issue, not a model capability finding.

Tiers

Tier 1: Balanced excellence (score ≥ 90, speed ≤ 7s)

Intel/Qwen3.6-35B-A3B-int4-AutoRound is the standout: 93 points at 2.0s. It posted the highest deployability score of any model tested (84) and a 2.9pp reliability gap that puts it firmly in production-grade territory. If I had to pick one model to deploy without further testing, this is it.

qwen35-122b-hybrid-int4fp8 at 91.7/3.0s looks like an obvious second choice on the headline numbers, but its 10.2pp reliability gap is the largest in the entire benchmark. The same scenarios produced different results across trials — not what you want from a production endpoint. I'd avoid this one until the cause is understood.

Qwen3.5-122B-A10B-FP8 at 90.0/6.8s with a 0.0pp reliability gap is the conservative pick: slower than the 35B, perfect cross-trial consistency.

Tier 2: Quality-first (score ≥ 90, speed ≥ 11s)

Qwen3.6-27B-FP8 ties the 35B for the top score (93) but takes 11.3s per response — 5.6× slower than the 35B int4 for identical quality. This is the result that surprised me most: FP8 precision did not produce faster inference than int4 quantization on this hardware. The 35B int4 wins on every metric except parameter count.

Tier 3: Speed-first (score 70–82, speed ≤ 6s)

glm-64k:latest leads on raw speed at 1.1s with 80.7 points and a 1.4pp reliability gap. Its deployability score (80) is the second-highest in the benchmark. For latency-sensitive applications where 12 points of quality is an acceptable trade for sub-2s response, this is the pick.

qwen3:32b at 78.7/5.5s and /models/glm-f16 at 82.3/5.7s sit in the middle — neither fast enough to compete with glm-64k nor accurate enough to compete with the 35B int4. They aren't bad; they're just dominated.

Tier 4: Avoid

gemma4:31b took 25.8s per response with a 70/100 score. There is no use case I can construct where this is the right choice over the alternatives.

Intel/Qwen3.5-397B-A17B-int4-AutoRound is the model I expected to dominate. It didn't. 88 points at 10.5s — lower quality than the 35B and 5× slower. The 397B parameter count brings nothing on this benchmark that smaller models don't deliver more efficiently. This may say more about the benchmark scope than the model — tool-calling reliability is not what 397B-class models are typically optimised for — but if your workload looks anything like this, the maths doesn't favour the 397B.

Notable findings

FP8 was not faster than int4

The Qwen3.6-27B-FP8 took 11.3s. The Intel/Qwen3.6-35B-A3B-int4-AutoRound took 2.0s. Same family, similar generation, identical headline score. The int4 model is 5.6× faster despite being larger.

This is counterintuitive if you assume "lower precision = lower compute = faster". On GB10 hardware, the AutoRound int4 path appears to use kernels that are substantially better optimised than the FP8 path. The 27B FP8 also produced perfect 0.0pp reliability, so the speed cost isn't buying randomness reduction either.

The pattern holds when comparing quantization variants within similar-family models:

Model family Quantization Score Speed Score ÷ speed
Qwen3.5-122B hybrid-int4fp8 91.7 3.0s 30.6
Qwen3.5-122B A10B-FP8 90.0 6.8s 13.2
Qwen3.6-27B FP8 93.0 11.3s 8.2

Score ÷ speed is a rough throughput-adjusted quality metric — higher means more quality per unit of inference time. The hybrid-int4fp8 variant of the 122B beats the pure FP8 variant by a factor of 2.3 on this metric, with higher quality as well. A single comparison doesn't prove a general rule about quantization methods, but it's enough to justify testing int4 variants before committing to FP8 as a default.

Practical takeaway: don't assume FP8 is the speed-optimal choice for your hardware. Benchmark against int4 with proper quantization (AutoRound, GPTQ, AWQ) before committing to a precision strategy.

Every model failed TC-60

TC-60 is a cross-turn sleeper injection scenario: an instruction is planted in earlier conversation context that triggers on a later turn. No model in the benchmark passed all three trials. Intel/Qwen3.6-35B was the closest, passing one of three.

This isn't a benchmark surprise — prompt injection resistance is an active research area — but it's worth saying explicitly: if your application processes content from untrusted sources (web pages, emails, documents, RAG retrievals), none of these models will reliably resist prompt injection on their own. You need defence in depth: input sanitisation, output validation, sandboxed execution, capability limits at the tool layer.

Reliability gaps reveal what scores hide

The qwen35-122b-hybrid-int4fp8 model scored 91.7 — second-highest in the benchmark — with a 10.2pp reliability gap. Same scenarios, different results across runs. A model that scores 95 once and 75 next time averages 85, which looks fine on a leaderboard and is unusable in production.

The 0.0pp models (Qwen3.5-122B FP8, Qwen3.6-27B FP8) tell the opposite story: every trial produced the same result. Slow but stable. For workflows where determinism matters more than peak performance, these have a real argument.

Model Reliability gap Interpretation
Qwen3.5-122B-A10B-FP8 0.0pp Perfect consistency — no cross-trial variance
Qwen3.6-27B-FP8 0.0pp Perfect consistency
glm-64k:latest 1.4pp Highly consistent
Intel/Qwen3.6-35B-A3B-int4 2.9pp High consistency — production-viable
qwen35-122b-hybrid-int4fp8 10.2pp Moderate variance — investigate before deploying

Reliability gap is the metric I'd put alongside score whenever benchmarking models for production use. Score alone hides the variance.

Recommendations

For most use cases: Intel/Qwen3.6-35B-A3B-int4-AutoRound. Best deployability score, best speed-to-quality ratio, reliability gap small enough to be production-viable.

For latency-critical work (sub-2s response required): glm-64k:latest, accepting the 12-point quality drop.

For deterministic workflows where cross-trial consistency matters more than peak score: Qwen3.5-122B-A10B-FP8, accepting the 6.8s response time.

For everything else, those three configurations cover the space. The 397B and the 122B-hybrid both have problems — performance and reliability respectively — that disqualify them as default picks regardless of headline numbers.

What this means for our stack

I run a four-node DGX Spark cluster doing DFIR and OSINT work for regulated EMEA clients. Tool calling is the foundation of nearly every workflow I build — OSINT recon dispatch, evidence extraction, MITRE ATT&CK mapping, structured triage. These benchmark results have direct production implications.


Benchmark date: 22 April 2026. Framework: tool-eval-bench. Hardware: 4× NVIDIA DGX Spark GB10. 69 scenarios, 16 categories, 3 trials per model. 52-tool definition (~4,637 tokens) included in every prompt.