NVIDIA GB10 / DGX Spark

On-silicon benchmarks: NVIDIA GB10 / DGX Spark

The data below was measured by AIMA on NVIDIA GB10 / DGX Spark and is traceable to the AIMA model catalog. Updated 2026-04. Benchmarks for more silicon (AMD, Ascend, Hygon DCU, Moore Threads, MetaX, Apple) are being compiled.

Updated 2026-04 Source: AIMA catalog

How it's measured

AIMA's built-in agent runs the benchmarks: for each model, after deployment it samples decode throughput (tok/s), time-to-first-token (TTFT), time-per-output-token (TPOT), VRAM usage, and max context, plus modality-specific end-to-end metrics (RTF for ASR, synthesis latency for TTS, image generation latency, vision-token handling for VLMs). All numbers come from measured logs and are written back to the catalog. See the AIMA docs for how to run them yourself.

gemma-4-26b-a4b-it

VLM benchmark + validated 2026-04-04

Metric	Value	Note
Decode throughput (tok/s)	24–28
TTFT (ms)	127–489
TPOT (ms)	45–59
VRAM (MiB)	92,800
Max context (tokens)	155,648

Source: gemma-4-26b-a4b-it.yaml

Validated on NVIDIA GB10 / DGX Spark; aggregate tok/s and TTFT come from catalog summary, AIMA matrix columns come from the same validated note.

glm-4.7-flash

LLM benchmark 2026-03-01

Metric	Value	Note
Decode throughput (tok/s)	14.5–25.8
TTFT (ms)	71–10266
TPOT (ms)	39–69.3
VRAM (MiB)	60,400
Max context (tokens)	65,536

Source: glm-4.7-flash.yaml

128K context causes OOM on GB10; concurrency columns are from the file's explicitly labeled concurrency_1k block.

qwen3.5-35b-a3b

LLM/VLM benchmark 2026-02-28

Metric	Value	Note
Decode throughput (tok/s)	24.2–30
TTFT (ms)	96–34015
TPOT (ms)	31–34
VRAM (MiB)	67,100
Max context (tokens)	131,072
1024 image tokens	1025
Max single-image resolution	3424x3424
Max images at 1024 res	255

Source: qwen3.5-35b-a3b.yaml

ttft_ms_max uses the 131072-token context-scaling result (34.015s); vision columns come from the measured vision section.

qwen3-coder-next-fp8

Code LLM benchmark

Metric	Value	Note
Decode throughput (tok/s)	0.2–42.5
TTFT (ms)	239–141200
TPOT (ms)	22.3–46.6
VRAM (MiB)	113,691
Max context (tokens)	262,144

Source: qwen3-coder-next-fp8.yaml

Source file does not label the 2nd value in concurrency_1k; it is preserved in the raw columns without further inference.

qwen3-asr-1.7b

ASR benchmark + validated 2026-03-20

Metric	Value	Note
VRAM (MiB)	3,870
RTF	0.076–0.17	approx 6–13x realtime
1.3s audio latency	0.22s
7s audio latency	0.59s
20s audio latency	1.56s
Total footprint	~10.5 GB	model + KV cache + CUDA graphs

Passed: ASR

Source: qwen3-asr-1.7b.yaml openclaw-multi.yaml

Audio timings are from the validated GB10 note; total footprint about 10.5 GB comes from model + KV cache + CUDA graphs in notes.

qwen3-tts-0.6b

TTS benchmark + validated 2026-03-20

Metric	Value	Note
VRAM (MiB)	2,048
GPU RTF	0.8–0.9
7s audio synthesis	5.4s
20s audio synthesis	16.2s
CPU ARM64 RTF (GB10 reference)	~5.0

Passed: TTS

Source: qwen3-tts-0.6b.yaml openclaw-multi.yaml

GPU RTF is from the GB10 CUDA note; the same source also mentions CPU ARM64 on GB10 at about RTF 5.0, which is kept in this note rather than mixed into GPU metric columns.

z-image

ImageGen benchmark + validated 2026-03-31

Metric	Value	Note
VRAM (MiB)	22,000
512×512 / 28 steps	20s/image

Passed: Image gen

Source: z-image.yaml BUG-gb10-openclaw-multi-e2e-20260331.md

Image generation latency is the GB10 512x512 / 28 steps figure from the model catalog; end-to-end pass comes from the GB10 OpenClaw E2E report.

qwen3.5-9b

VLM/LLM validated 2026-03-20 / 2026-03-31

Metric	Value	Note
Decode throughput (tok/s)	13–17
TTFT (ms)	30–200
VRAM (MiB)	18,000
Max context (tokens)	65,536

Passed: LLM chat VLM

Source: openclaw-multi.yaml BUG-gb10-openclaw-multi-e2e-20260331.md qwen3.5-9b.yaml

This row is included because GB10 validation is explicit, but the tok/s and TTFT numbers come from catalog estimates rather than a standalone measured benchmark log.

More silicon benchmarks are being compiled — watch the GitHub repository or the blog for updates. GitHub

← Home AIMA Product