Can I use llama.cpp as the AI provider for OpenClaw?

Yes. llama.cpp ships an OpenAI-compatible HTTP server. Configure OpenClaw to use the OpenAI provider with baseURL pointing at the llama.cpp server (default port 8080) and any model ID you reported to the server.

Is llama.cpp better than Ollama for OpenClaw?

They cover different needs. llama.cpp gives you maximum control and runs any GGUF file directly. Ollama is friendlier and manages model storage for you. Both work as OpenClaw providers — pick llama.cpp if you need a specific model file or maximum efficiency.

What hardware do I need to run llama.cpp with OpenClaw?

A 7B Q4_K_M model runs on CPU at 10-25 tokens/sec on modern hardware. With a single consumer GPU (3090/4090), 7B-14B models run at full speed and 30B quantized fits in 24 GB VRAM. Larger models scale up from there.

Do I need an API key for llama.cpp?

No. The llama.cpp server runs locally and does not require an API key by default. OpenClaw still expects a non-empty apiKey field in the provider config — any string works.

← All Guides

Guide

OpenClaw + llama.cpp: Self-Host an Open Model and Connect It to OpenClaw

Run a GGUF model with llama.cpp's built-in OpenAI-compatible server, then point your OpenClaw agent at it. CPU-friendly, GPU-accelerated, fully local — here's the full setup.

Why llama.cpp?

llama.cpp is the most efficient open-source runtime for GGUF models. It runs anywhere — CPU, CUDA, Metal, ROCm, Vulkan — ships an OpenAI-compatible HTTP server out of the box, and supports aggressive quantization (Q4_K_M and below) so you can run larger models on modest hardware than Ollama would manage.

Use this guide if you want to:

Run a model that isn't in Ollama's registry
Squeeze more performance out of a constrained box (CPU-only or low-VRAM GPU)
Pin a specific GGUF quantization for reproducibility
Avoid the Ollama daemon and run llama.cpp's server directly

Step 1: Install llama.cpp

On macOS:

brew install llama.cpp
# Or build from source for the absolute latest:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build --config Release

On Linux with CUDA:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

On Docker (any host):

docker pull ghcr.io/ggml-org/llama.cpp:server-cuda
# Or :server for CPU-only

Step 2: Pull a GGUF Model

Grab any GGUF file from Hugging Face. Good starter models:

Qwen3-7B-Instruct (Q4_K_M) — ~4.5 GB, runs on CPU, excellent quality for size
Llama 4 8B Instruct (Q4_K_M) — ~5 GB, strong general purpose
DeepSeek V4 Flash (Q4) — ~25 GB, frontier-adjacent (needs 24 GB+ VRAM)

# Example: Qwen3-7B Q4_K_M
mkdir -p ~/models
huggingface-cli download Qwen/Qwen3-7B-Instruct-GGUF \
  qwen3-7b-instruct-q4_k_m.gguf --local-dir ~/models

Step 3: Start the llama.cpp OpenAI-Compatible Server

llama-server \
  -m ~/models/qwen3-7b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 32768 \
  --n-gpu-layers 999

Drop --n-gpu-layers if you're CPU-only. Test the endpoint:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role":"user","content":"hello"}]
  }'

You should get a normal OpenAI-shaped response back.

Step 4: Point OpenClaw at the llama.cpp Server

llama.cpp's server speaks OpenAI's wire format, so OpenClaw uses the OpenAI provider with a custom baseURL:

{
  "models": {
    "providers": {
      "openai": {
        "baseURL": "http://host.docker.internal:8080/v1",
        "apiKey": "no-key-needed"
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "openai/local"
      }
    }
  }
}

The apiKey can be any non-empty string (llama.cpp ignores it unless you start the server with --api-key). The model ID after the slash matches whatever you reported to the server — local is fine for a single-model setup.

Step 5: Restart OpenClaw and Test

Send SIGUSR1 to your OpenClaw process (or restart the container) to pick up the new model config. Then send a test message via Telegram, Discord, or the gateway web chat.

Tip: Add --keep-alive 0 to the llama-server command to keep the model resident in memory between requests — otherwise you eat a 1–3 second cold-start on every reply.

Performance Notes

CPU-only: 7B Q4_K_M models hit ~10–25 tokens/sec on a recent M-series Mac or Ryzen 7 box. Usable for low-traffic bots.
Single consumer GPU (3090/4090): 7B–14B at full speed; 30B quantized fits if you stay below 24 GB VRAM.
Workstation GPU (A6000/H100): 70B Q4_K_M comfortable; 1M-context models possible with paged attention.

llama.cpp vs Ollama vs Hosted

llama.cpp — Maximum control, run any GGUF, smallest footprint, hands-on. Best when you have a specific model file or want to squeeze a constrained box.
Ollama — Friendlier UX, model registry, daemon manages models for you. See the OpenClaw + Ollama guide.
Hosted (OpenClaw Launch) — Zero-ops, frontier models like Claude/GPT/DeepSeek V4 Pro, AI credits included. From $3/mo.

What's Next?

OpenClaw + Ollama — Friendlier alternative if you don't need llama.cpp's control
OpenClaw + vLLM — Production-grade serving for higher throughput
OpenClaw + LM Studio — GUI option around llama.cpp
All supported models