Why AIMA: let an agent be the inference operator
Two corners, one bet
Private LLM inference stacks today usually sit in one of two corners.
One corner is Ollama and LM Studio: a single binary, one engine (llama.cpp / GGUF), and defaults that work out of the box. The cost is throughput capped at the engine’s ceiling. Good for experiments; not for production at scale.
The other corner is raw vLLM, SGLang, and TensorRT-LLM: the numbers look good, but flag tuning, quantization selection, deployment wiring, and per-vendor quirks are all on you. Each new chip is basically a redo. The operator is you.
Both choices have a price. AIMA takes a different bet.
Replace the operator with an agent
AIMA’s core bet: swap the inference operator from a person to an agent.
In practice: AIMA detects the hardware (NVIDIA, AMD, Huawei Ascend, Hygon DCU, Moore Threads, MetaX, Apple, or CPU-only), picks the best engine and config from a YAML knowledge base, deploys the model, runs benchmarks, and writes the winning config back. The loop is automatic and continuous.
A built-in PDCA agent (codename Explorer) keeps running: plan the next benchmark, deploy a candidate config, sample throughput and TTFT, promote the winner to a shared knowledge base.
When a new chip arrives, the agent runs the tuning matrix itself. No documentation to read, no parameter table to comb through, no trial-and-error from scratch.
Knowledge in YAML, not in someone’s head
Tuning results are written back to a YAML knowledge base. “On this silicon, for this model, this quantization, this parallelism — what is the fastest configuration” accumulates in the knowledge base, not in an engineer’s head, not in a document that may go stale.
The first deployment is exploration. Every subsequent one is a table lookup.
Machines on the same LAN form a fleet and share the knowledge base and benchmark results. The winning config measured on one machine is immediately available to the whole fleet.
Agent-native: MCP server plus a built-in agent
AIMA is an MCP server, and it also runs an agent internally.
From the outside: point any MCP-compatible runtime at AIMA’s port and you get the full operational surface — hardware detection, model scan, engine selection, deployment, benchmark, fleet discovery, knowledge sync. No REST wrapper to write, no official SDK to wait for.
AIMA currently runs in production as OpenClaw’s inference backend, covering LLM, ASR, TTS, image generation, and VLM. Any other MCP-speaking runtime plugs in the same way.
From the inside: AIMA also consumes MCP. The Explorer agent drives the self-tuning loop — that is why a single binary can reach vLLM-level throughput without exposing vLLM’s flag soup to you.
Try it
AIMA is open source under the Apache 2.0 license. One command installs the binary. Run aima hal detect to see what hardware it finds. Run aima run <model> to deploy a model.
Code on GitHub.