Open source · Apache 2.0

AIMA — AI Infrastructure, managed by AI

A single Go binary that manages AI inference on your hardware: detect accelerators, pick an engine and config from a YAML knowledge base, deploy the model, run benchmarks, write the winning config back. The built-in agent drives the whole loop; AIMA is also an MCP server.

The engine is auto-picked per hardware

Three backends: vLLM, SGLang, llama.cpp. AIMA picks the currently fastest one for your accelerator, model, quantization, and context length from a YAML knowledge base — you never touch the vLLM parameter soup.

vLLM

High-throughput path for discrete GPUs (NVIDIA / AMD)

SGLang

Structured generation / multi-node, high prefix-cache hit rate

llama.cpp

GGUF / CPU / lightweight deployments, first choice on Apple Silicon

Validated on silicon

NVIDIA, AMD, Huawei Ascend, Hygon DCU, Moore Threads, MetaX, Apple Silicon — all benchmarked on real hardware. CPU-only works too.

Vendor / Chip Status Notes
NVIDIA GPU validated CUDA
AMD GPU validated ROCm
Huawei Ascend validated validated on silicon
Hygon DCU validated validated on silicon
Moore Threads validated validated on silicon
MetaX validated validated on silicon
Apple Silicon validated Metal
CPU-only supported x86_64 + ARM64

See the full benchmark data →

LAN fleet via mDNS

Multiple machines on the same LAN auto-discover each other and form a fleet. Models, the knowledge base, and benchmark results sync across the fleet. The fastest config measured on one machine is immediately available to the whole fleet.

Offline / airgap: images preloaded

Works in air-gapped environments — engine images and common models can be preloaded offline. The full inference stack has no external network dependency.

MCP-native: it's a server, and it runs an agent inside

AIMA is an MCP server — point any MCP-compatible runtime at http://<aima-host>:6188/mcp and you get the full operational surface: hardware detection, model scan, engine selection, deployment, benchmark, fleet discovery, knowledge sync. AIMA also consumes MCP internally: the built-in PDCA agent (codename Explorer) plans benchmarks, deploys configs, samples metrics, and promotes winning configs to the shared knowledge base. When a new chip arrives, the agent runs the tuning matrix itself.

Runs in production as OpenClaw's inference backend — covering LLM, ASR, TTS, image generation, and VLM.

MCP config example

mcp config
{
  "mcpServers": {
    "aima": { "type": "http", "url": "http://<aima-host>:6188/mcp" }
  }
}

The knowledge base: faster over time

"What runs fastest on this silicon" accumulates in a YAML knowledge base — not in a consultant's head. Every benchmark run writes the winning config back; the next time the same hardware meets the same model, it's a lookup, not an exploration. The first deployment explores; every subsequent one is a table lookup.

1st deployment: explore Nth deployment: table lookup

Get AIMA

# macOS / Linux
Terminal
curl -fsSL https://raw.githubusercontent.com/Approaching-AI/AIMA/master/install.sh | sh
# Windows (PowerShell)
PowerShell
irm https://raw.githubusercontent.com/Approaching-AI/AIMA/master/install.ps1 | iex

Or grab a pre-built binary from Releases (macOS arm64 / Linux amd64·arm64 / Windows amd64), or build from source: git clone … && make build Releases · GitHub

# After install
aima hal detect
aima onboarding
aima run qwen3-4b
aima serve

After install: aima hal detect → aima onboarding → aima run qwen3-4b → aima serve