真机实测 — NVIDIA GB10 / DGX Spark
下面这些数字是 AIMA 在 NVIDIA GB10 / DGX Spark 上跑出来的,每一条都能追溯到 AIMA 模型 catalog 里对应的 YAML。
测量方式
AIMA 内置 agent 完成全程:模型部署后,采样 decode 吞吐(tok/s)、首 token 延迟(TTFT)、每 token 延迟(TPOT)、显存峰值、最大上下文;语音识别(ASR)另测实时率(RTF),语音合成(TTS)测合成耗时,图像生成测端到端延迟,视觉模型(VLM)测视觉 token 处理耗时。所有数字来自实测日志,写回 catalog。复现命令见 AIMA 文档。
gemma-4-26b-a4b-it
VLM benchmark + validated 2026-04-04| 指标 | 数值 | 说明 |
|---|---|---|
| Decode 吞吐 (tok/s) | 24–28 | |
| TTFT (ms) | 127–489 | |
| TPOT (ms) | 45–59 | |
| 显存峰值 (MiB) | 92,800 | |
| 最大上下文 (tokens) | 155,648 |
Validated on NVIDIA GB10 / DGX Spark; aggregate tok/s and TTFT come from catalog summary, AIMA matrix columns come from the same validated note.
glm-4.7-flash
LLM benchmark 2026-03-01| 指标 | 数值 | 说明 |
|---|---|---|
| Decode 吞吐 (tok/s) | 14.5–25.8 | |
| TTFT (ms) | 71–10266 | |
| TPOT (ms) | 39–69.3 | |
| 显存峰值 (MiB) | 60,400 | |
| 最大上下文 (tokens) | 65,536 |
128K context causes OOM on GB10; concurrency columns are from the file's explicitly labeled concurrency_1k block.
qwen3.5-35b-a3b
LLM/VLM benchmark 2026-02-28| 指标 | 数值 | 说明 |
|---|---|---|
| Decode 吞吐 (tok/s) | 24.2–30 | |
| TTFT (ms) | 96–34015 | |
| TPOT (ms) | 31–34 | |
| 显存峰值 (MiB) | 67,100 | |
| 最大上下文 (tokens) | 131,072 | |
| 1024 image tokens | 1025 | |
| Max single-image resolution | 3424x3424 | |
| Max images at 1024 res | 255 |
ttft_ms_max uses the 131072-token context-scaling result (34.015s); vision columns come from the measured vision section.
qwen3-coder-next-fp8
Code LLM benchmark| 指标 | 数值 | 说明 |
|---|---|---|
| Decode 吞吐 (tok/s) | 0.2–42.5 | |
| TTFT (ms) | 239–141200 | |
| TPOT (ms) | 22.3–46.6 | |
| 显存峰值 (MiB) | 113,691 | |
| 最大上下文 (tokens) | 262,144 |
Source file does not label the 2nd value in concurrency_1k; it is preserved in the raw columns without further inference.
qwen3-asr-1.7b
ASR benchmark + validated 2026-03-20| 指标 | 数值 | 说明 |
|---|---|---|
| 显存峰值 (MiB) | 3,870 | |
| RTF | 0.076–0.17 | approx 6–13x realtime |
| 1.3s audio latency | 0.22s | |
| 7s audio latency | 0.59s | |
| 20s audio latency | 1.56s | |
| Total footprint | ~10.5 GB | model + KV cache + CUDA graphs |
Audio timings are from the validated GB10 note; total footprint about 10.5 GB comes from model + KV cache + CUDA graphs in notes.
qwen3-tts-0.6b
TTS benchmark + validated 2026-03-20| 指标 | 数值 | 说明 |
|---|---|---|
| 显存峰值 (MiB) | 2,048 | |
| GPU RTF | 0.8–0.9 | |
| 7s audio synthesis | 5.4s | |
| 20s audio synthesis | 16.2s | |
| CPU ARM64 RTF (GB10 reference) | ~5.0 |
GPU RTF is from the GB10 CUDA note; the same source also mentions CPU ARM64 on GB10 at about RTF 5.0, which is kept in this note rather than mixed into GPU metric columns.
z-image
ImageGen benchmark + validated 2026-03-31| 指标 | 数值 | 说明 |
|---|---|---|
| 显存峰值 (MiB) | 22,000 | |
| 512×512 / 28 steps | 20s/image |
Image generation latency is the GB10 512x512 / 28 steps figure from the model catalog; end-to-end pass comes from the GB10 OpenClaw E2E report.
qwen3.5-9b
VLM/LLM validated 2026-03-20 / 2026-03-31| 指标 | 数值 | 说明 |
|---|---|---|
| Decode 吞吐 (tok/s) | 13–17 | |
| TTFT (ms) | 30–200 | |
| 显存峰值 (MiB) | 18,000 | |
| 最大上下文 (tokens) | 65,536 |
This row is included because GB10 validation is explicit, but the tok/s and TTFT numbers come from catalog estimates rather than a standalone measured benchmark log.