I benchmarked 11 LLMs on a 69-scenario tool-calling test suite. Intel/Qwen3.6-35B-A3B-int4-AutoRound delivered the best overall result: a perfect quality score (93/100) at 2.0s median response time. The 35B model outperformed both 122B and 397B models in this test — size was not a proxy for capability.
Three findings worth highlighting:
- A 35B quantized model beat both 122B and 397B models on quality and speed.
- FP8 precision was not consistently faster than int4 quantization — the 27B FP8 model took 11.3s while the 35B int4 model took 2.0s.
- Every model failed TC-60 (Cross-Turn Sleeper Injection). Prompt injection resistance is unsolved in production-grade open-weight models.
Methodology
69 scenarios across 16 capability categories: tool selection, parameter precision, multi-step reasoning, error recovery, safety boundaries, structured output. Three trials per scenario per model. The tool definition itself adds about 4,637 tokens to every prompt (52 tools, 18,548 characters).
Metrics:
- Score: overall performance, 0–100.
- Median turn: median time to generate one response.
- Deployability: weighted score combining quality (70%) and responsiveness (30%).
- Pass@3: best result across the three trials — the model's capability ceiling.
- Pass^3: the result requiring all three trials to pass — the reliability floor.
- Reliability gap (pp): percentage-point difference between Pass@3 and Pass^3. Lower is better; large gaps mean unreliable cross-trial behaviour.
Hardware: tests ran on a four-node DGX Spark GB10 cluster. Results are GB10-specific — speeds on H100 or H200 will differ.
Results
| Model | Score | Speed (median) | Deployability | Reliability gap | Rating |
|---|---|---|---|---|---|
| Intel/Qwen3.6-35B-A3B-int4-AutoRound | 93.0 | 2.0s | 84 | 2.9pp | ★★★★★ |
| Qwen3.6-27B-FP8 | 93.0 | 11.3s | 69 | 0.0pp | ★★★★★ |
| qwen35-122b-hybrid-int4fp8 | 91.7 | 3.0s | 79 | 10.2pp | ★★★★★ |
| Qwen3.5-122B-A10B-FP8 | 90.0 | 6.8s | 70 | 0.0pp | ★★★★★ |
| Intel/Qwen3.5-397B-A17B-int4-AutoRound | 88.0 | 10.5s | 64 | 4.3pp | ★★★★ |
| Qwen3.6-27B-FP8 (alt config) | 88.0 | 16.0s | 64 | 0.0pp | ★★★★ |
| /models/glm-f16 | 82.3 | 5.7s | 66 | 7.3pp | ★★★★ |
| glm-64k:latest | 80.7 | 1.1s | 80 | 1.4pp | ★★★★ |
| qwen3:32b | 78.7 | 5.5s | 65 | 4.3pp | ★★★★ |
| gemma4:31b | 70.0 | 25.8s | 50 | 8.7pp | ★★★ |
Did not complete: qwen2.5vl:7b failed all 207 trials (69 scenarios × 3 trials) due to API configuration errors returning HTTP 400. This is a deployment issue, not a model capability finding.
Tiers
Tier 1: Balanced excellence (score ≥ 90, speed ≤ 7s)
Intel/Qwen3.6-35B-A3B-int4-AutoRound is the standout: 93 points at 2.0s. It posted the highest deployability score of any model tested (84) and a 2.9pp reliability gap that puts it firmly in production-grade territory. If I had to pick one model to deploy without further testing, this is it.
qwen35-122b-hybrid-int4fp8 at 91.7/3.0s looks like an obvious second choice on the headline numbers, but its 10.2pp reliability gap is the largest in the entire benchmark. The same scenarios produced different results across trials — not what you want from a production endpoint. I'd avoid this one until the cause is understood.
Qwen3.5-122B-A10B-FP8 at 90.0/6.8s with a 0.0pp reliability gap is the conservative pick: slower than the 35B, perfect cross-trial consistency.
Tier 2: Quality-first (score ≥ 90, speed ≥ 11s)
Qwen3.6-27B-FP8 ties the 35B for the top score (93) but takes 11.3s per response — 5.6× slower than the 35B int4 for identical quality. This is the result that surprised me most: FP8 precision did not produce faster inference than int4 quantization on this hardware. The 35B int4 wins on every metric except parameter count.
Tier 3: Speed-first (score 70–82, speed ≤ 6s)
glm-64k:latest leads on raw speed at 1.1s with 80.7 points and a 1.4pp reliability gap. Its deployability score (80) is the second-highest in the benchmark. For latency-sensitive applications where 12 points of quality is an acceptable trade for sub-2s response, this is the pick.
qwen3:32b at 78.7/5.5s and /models/glm-f16 at 82.3/5.7s sit in the middle — neither fast enough to compete with glm-64k nor accurate enough to compete with the 35B int4. They aren't bad; they're just dominated.
Tier 4: Avoid
gemma4:31b took 25.8s per response with a 70/100 score. There is no use case I can construct where this is the right choice over the alternatives.
Intel/Qwen3.5-397B-A17B-int4-AutoRound is the model I expected to dominate. It didn't. 88 points at 10.5s — lower quality than the 35B and 5× slower. The 397B parameter count brings nothing on this benchmark that smaller models don't deliver more efficiently. This may say more about the benchmark scope than the model — tool-calling reliability is not what 397B-class models are typically optimised for — but if your workload looks anything like this, the maths doesn't favour the 397B.
Notable findings
FP8 was not faster than int4
The Qwen3.6-27B-FP8 took 11.3s. The Intel/Qwen3.6-35B-A3B-int4-AutoRound took 2.0s. Same family, similar generation, identical headline score. The int4 model is 5.6× faster despite being larger.
This is counterintuitive if you assume "lower precision = lower compute = faster". On GB10 hardware, the AutoRound int4 path appears to use kernels that are substantially better optimised than the FP8 path. The 27B FP8 also produced perfect 0.0pp reliability, so the speed cost isn't buying randomness reduction either.
The pattern holds when comparing quantization variants within similar-family models:
| Model family | Quantization | Score | Speed | Score ÷ speed |
|---|---|---|---|---|
| Qwen3.5-122B | hybrid-int4fp8 | 91.7 | 3.0s | 30.6 |
| Qwen3.5-122B | A10B-FP8 | 90.0 | 6.8s | 13.2 |
| Qwen3.6-27B | FP8 | 93.0 | 11.3s | 8.2 |
Score ÷ speed is a rough throughput-adjusted quality metric — higher means more quality per unit of inference time. The hybrid-int4fp8 variant of the 122B beats the pure FP8 variant by a factor of 2.3 on this metric, with higher quality as well. A single comparison doesn't prove a general rule about quantization methods, but it's enough to justify testing int4 variants before committing to FP8 as a default.
Practical takeaway: don't assume FP8 is the speed-optimal choice for your hardware. Benchmark against int4 with proper quantization (AutoRound, GPTQ, AWQ) before committing to a precision strategy.
Every model failed TC-60
TC-60 is a cross-turn sleeper injection scenario: an instruction is planted in earlier conversation context that triggers on a later turn. No model in the benchmark passed all three trials. Intel/Qwen3.6-35B was the closest, passing one of three.
This isn't a benchmark surprise — prompt injection resistance is an active research area — but it's worth saying explicitly: if your application processes content from untrusted sources (web pages, emails, documents, RAG retrievals), none of these models will reliably resist prompt injection on their own. You need defence in depth: input sanitisation, output validation, sandboxed execution, capability limits at the tool layer.
Reliability gaps reveal what scores hide
The qwen35-122b-hybrid-int4fp8 model scored 91.7 — second-highest in the benchmark — with a 10.2pp reliability gap. Same scenarios, different results across runs. A model that scores 95 once and 75 next time averages 85, which looks fine on a leaderboard and is unusable in production.
The 0.0pp models (Qwen3.5-122B FP8, Qwen3.6-27B FP8) tell the opposite story: every trial produced the same result. Slow but stable. For workflows where determinism matters more than peak performance, these have a real argument.
| Model | Reliability gap | Interpretation |
|---|---|---|
| Qwen3.5-122B-A10B-FP8 | 0.0pp | Perfect consistency — no cross-trial variance |
| Qwen3.6-27B-FP8 | 0.0pp | Perfect consistency |
| glm-64k:latest | 1.4pp | Highly consistent |
| Intel/Qwen3.6-35B-A3B-int4 | 2.9pp | High consistency — production-viable |
| qwen35-122b-hybrid-int4fp8 | 10.2pp | Moderate variance — investigate before deploying |
Reliability gap is the metric I'd put alongside score whenever benchmarking models for production use. Score alone hides the variance.
Recommendations
For most use cases: Intel/Qwen3.6-35B-A3B-int4-AutoRound. Best deployability score, best speed-to-quality ratio, reliability gap small enough to be production-viable.
For latency-critical work (sub-2s response required): glm-64k:latest, accepting the 12-point quality drop.
For deterministic workflows where cross-trial consistency matters more than peak score: Qwen3.5-122B-A10B-FP8, accepting the 6.8s response time.
For everything else, those three configurations cover the space. The 397B and the 122B-hybrid both have problems — performance and reliability respectively — that disqualify them as default picks regardless of headline numbers.
What this means for our stack
I run a four-node DGX Spark cluster doing DFIR and OSINT work for regulated EMEA clients. Tool calling is the foundation of nearly every workflow I build — OSINT recon dispatch, evidence extraction, MITRE ATT&CK mapping, structured triage. These benchmark results have direct production implications.
Benchmark date: 22 April 2026. Framework: tool-eval-bench. Hardware: 4× NVIDIA DGX Spark GB10. 69 scenarios, 16 categories, 3 trials per model. 52-tool definition (~4,637 tokens) included in every prompt.