Open source · Apache 2.0

AIMA — AI Infrastructure, managed by AI

A single Go binary that manages AI inference on your hardware: detect accelerators, pick an engine and config from a YAML knowledge base, deploy the model, run benchmarks, write the winning config back. The built-in agent drives the whole loop; AIMA is also an MCP server.

Install AIMA View on GitHub

The engine is auto-picked per hardware

Three backends: vLLM, SGLang, llama.cpp. AIMA picks the currently fastest one for your accelerator, model, quantization, and context length from a YAML knowledge base — you never touch the vLLM parameter soup.

vLLM

High-throughput path for discrete GPUs (NVIDIA / AMD)

SGLang

Structured generation / multi-node, high prefix-cache hit rate

llama.cpp

GGUF / CPU / lightweight deployments, first choice on Apple Silicon

Validated on silicon

NVIDIA, AMD, Huawei Ascend, Hygon DCU, Moore Threads, MetaX, Apple Silicon — all benchmarked on real hardware. CPU-only works too.

Vendor / Chip	Status	Notes
NVIDIA GPU	✓ validated	CUDA
AMD GPU	✓ validated	ROCm
Huawei Ascend	✓ validated	validated on silicon
Hygon DCU	✓ validated	validated on silicon
Moore Threads	✓ validated	validated on silicon
MetaX	✓ validated	validated on silicon
Apple Silicon	✓ validated	Metal
CPU-only	○ supported	x86_64 + ARM64

See the full benchmark data →

LAN fleet via mDNS

Multiple machines on the same LAN auto-discover each other and form a fleet. Models, the knowledge base, and benchmark results sync across the fleet. The fastest config measured on one machine is immediately available to the whole fleet.

Offline / airgap: images preloaded

Works in air-gapped environments — engine images and common models can be preloaded offline. The full inference stack has no external network dependency.

MCP-native: it's a server, and it runs an agent inside

AIMA is an MCP server — point any MCP-compatible runtime at http://<aima-host>:6188/mcp and you get the full operational surface: hardware detection, model scan, engine selection, deployment, benchmark, fleet discovery, knowledge sync. AIMA also consumes MCP internally: the built-in PDCA agent (codename Explorer) plans benchmarks, deploys configs, samples metrics, and promotes winning configs to the shared knowledge base. When a new chip arrives, the agent runs the tuning matrix itself.

Runs in production as OpenClaw's inference backend — covering LLM, ASR, TTS, image generation, and VLM.

MCP config example

mcp config

{
  "mcpServers": {
    "aima": { "type": "http", "url": "http://<aima-host>:6188/mcp" }
  }
}

The knowledge base: faster over time

"What runs fastest on this silicon" accumulates in a YAML knowledge base — not in a consultant's head. Every benchmark run writes the winning config back; the next time the same hardware meets the same model, it's a lookup, not an exploration. The first deployment explores; every subsequent one is a table lookup.

1st deployment: explore → Nth deployment: table lookup

Get AIMA

# macOS / Linux

Terminal

curl -fsSL https://aimaserver.com/install.sh | sh

# Windows (PowerShell)

PowerShell

irm https://aimaserver.com/install.ps1 | iex

Or grab a pre-built binary from Releases (macOS arm64 / Linux amd64·arm64 / Windows amd64), or build from source: git clone … && make build Releases · GitHub

# After install

aima hal detect

aima onboarding

aima run qwen3-4b

aima serve

After install: aima hal detect → aima onboarding → aima run qwen3-4b → aima serve

Read the docs (GitHub README) →