AIMA — AI Infrastructure, managed by AI
A single Go binary that manages AI inference on your hardware: detect accelerators, pick an engine and config from a YAML knowledge base, deploy the model, run benchmarks, write the winning config back. The built-in agent drives the whole loop; AIMA is also an MCP server.
The engine is auto-picked per hardware
Three backends: vLLM, SGLang, llama.cpp. AIMA picks the currently fastest one for your accelerator, model, quantization, and context length from a YAML knowledge base — you never touch the vLLM parameter soup.
High-throughput path for discrete GPUs (NVIDIA / AMD)
Structured generation / multi-node, high prefix-cache hit rate
GGUF / CPU / lightweight deployments, first choice on Apple Silicon
Validated on silicon
NVIDIA, AMD, Huawei Ascend, Hygon DCU, Moore Threads, MetaX, Apple Silicon — all benchmarked on real hardware. CPU-only works too.
| Vendor / Chip | Status | Notes |
|---|---|---|
| NVIDIA GPU | ✓ validated | CUDA |
| AMD GPU | ✓ validated | ROCm |
| Huawei Ascend | ✓ validated | validated on silicon |
| Hygon DCU | ✓ validated | validated on silicon |
| Moore Threads | ✓ validated | validated on silicon |
| MetaX | ✓ validated | validated on silicon |
| Apple Silicon | ✓ validated | Metal |
| CPU-only | ○ supported | x86_64 + ARM64 |
LAN fleet via mDNS
Multiple machines on the same LAN auto-discover each other and form a fleet. Models, the knowledge base, and benchmark results sync across the fleet. The fastest config measured on one machine is immediately available to the whole fleet.
Offline / airgap: images preloaded
Works in air-gapped environments — engine images and common models can be preloaded offline. The full inference stack has no external network dependency.
MCP-native: it's a server, and it runs an agent inside
AIMA is an MCP server — point any MCP-compatible runtime at http://<aima-host>:6188/mcp and you get the full operational surface: hardware detection, model scan, engine selection, deployment, benchmark, fleet discovery, knowledge sync. AIMA also consumes MCP internally: the built-in PDCA agent (codename Explorer) plans benchmarks, deploys configs, samples metrics, and promotes winning configs to the shared knowledge base. When a new chip arrives, the agent runs the tuning matrix itself.
Runs in production as OpenClaw's inference backend — covering LLM, ASR, TTS, image generation, and VLM.
MCP config example
{
"mcpServers": {
"aima": { "type": "http", "url": "http://<aima-host>:6188/mcp" }
}
} The knowledge base: faster over time
"What runs fastest on this silicon" accumulates in a YAML knowledge base — not in a consultant's head. Every benchmark run writes the winning config back; the next time the same hardware meets the same model, it's a lookup, not an exploration. The first deployment explores; every subsequent one is a table lookup.
Get AIMA
curl -fsSL https://raw.githubusercontent.com/Approaching-AI/AIMA/master/install.sh | sh irm https://raw.githubusercontent.com/Approaching-AI/AIMA/master/install.ps1 | iex Or grab a pre-built binary from Releases (macOS arm64 / Linux amd64·arm64 / Windows amd64), or build from source: git clone … && make build Releases · GitHub
After install: aima hal detect → aima onboarding → aima run qwen3-4b → aima serve