NVIDIA GB10 / DGX Spark

On-silicon benchmarks: NVIDIA GB10 / DGX Spark

The data below was measured by AIMA on NVIDIA GB10 / DGX Spark and is traceable to the AIMA model catalog. Updated 2026-04. Benchmarks for more silicon (AMD, Ascend, Hygon DCU, Moore Threads, MetaX, Apple) are being compiled.

Updated 2026-04 Source: AIMA catalog

How it's measured

AIMA's built-in agent runs the benchmarks: for each model, after deployment it samples decode throughput (tok/s), time-to-first-token (TTFT), time-per-output-token (TPOT), VRAM usage, and max context, plus modality-specific end-to-end metrics (RTF for ASR, synthesis latency for TTS, image generation latency, vision-token handling for VLMs). All numbers come from measured logs and are written back to the catalog. See the AIMA docs for how to run them yourself.

gemma-4-26b-a4b-it

VLM benchmark + validated 2026-04-04
Metric Value Note
Decode throughput (tok/s) 24–28
TTFT (ms) 127–489
TPOT (ms) 45–59
VRAM (MiB) 92,800
Max context (tokens) 155,648

Validated on NVIDIA GB10 / DGX Spark; aggregate tok/s and TTFT come from catalog summary, AIMA matrix columns come from the same validated note.

glm-4.7-flash

LLM benchmark 2026-03-01
Metric Value Note
Decode throughput (tok/s) 14.5–25.8
TTFT (ms) 71–10266
TPOT (ms) 39–69.3
VRAM (MiB) 60,400
Max context (tokens) 65,536

128K context causes OOM on GB10; concurrency columns are from the file's explicitly labeled concurrency_1k block.

qwen3.5-35b-a3b

LLM/VLM benchmark 2026-02-28
Metric Value Note
Decode throughput (tok/s) 24.2–30
TTFT (ms) 96–34015
TPOT (ms) 31–34
VRAM (MiB) 67,100
Max context (tokens) 131,072
1024 image tokens 1025
Max single-image resolution 3424x3424
Max images at 1024 res 255

ttft_ms_max uses the 131072-token context-scaling result (34.015s); vision columns come from the measured vision section.

qwen3-coder-next-fp8

Code LLM benchmark
Metric Value Note
Decode throughput (tok/s) 0.2–42.5
TTFT (ms) 239–141200
TPOT (ms) 22.3–46.6
VRAM (MiB) 113,691
Max context (tokens) 262,144

Source file does not label the 2nd value in concurrency_1k; it is preserved in the raw columns without further inference.

qwen3-asr-1.7b

ASR benchmark + validated 2026-03-20
Metric Value Note
VRAM (MiB) 3,870
RTF 0.076–0.17 approx 6–13x realtime
1.3s audio latency 0.22s
7s audio latency 0.59s
20s audio latency 1.56s
Total footprint ~10.5 GB model + KV cache + CUDA graphs
Passed: ASR

Audio timings are from the validated GB10 note; total footprint about 10.5 GB comes from model + KV cache + CUDA graphs in notes.

qwen3-tts-0.6b

TTS benchmark + validated 2026-03-20
Metric Value Note
VRAM (MiB) 2,048
GPU RTF 0.8–0.9
7s audio synthesis 5.4s
20s audio synthesis 16.2s
CPU ARM64 RTF (GB10 reference) ~5.0
Passed: TTS

GPU RTF is from the GB10 CUDA note; the same source also mentions CPU ARM64 on GB10 at about RTF 5.0, which is kept in this note rather than mixed into GPU metric columns.

z-image

ImageGen benchmark + validated 2026-03-31
Metric Value Note
VRAM (MiB) 22,000
512×512 / 28 steps 20s/image
Passed: Image gen

Image generation latency is the GB10 512x512 / 28 steps figure from the model catalog; end-to-end pass comes from the GB10 OpenClaw E2E report.

qwen3.5-9b

VLM/LLM validated 2026-03-20 / 2026-03-31
Metric Value Note
Decode throughput (tok/s) 13–17
TTFT (ms) 30–200
VRAM (MiB) 18,000
Max context (tokens) 65,536
Passed: LLM chat VLM

This row is included because GB10 validation is explicit, but the tok/s and TTFT numbers come from catalog estimates rather than a standalone measured benchmark log.

More silicon benchmarks are being compiled — watch the GitHub repository or the blog for updates. GitHub