← All Guides

Guide

OpenClaw + llama.cpp: Self-Host an Open Model and Connect It to OpenClaw

Run a GGUF model with llama.cpp's built-in OpenAI-compatible server, then point your OpenClaw agent at it. CPU-friendly, GPU-accelerated, fully local — here's the full setup.

Why llama.cpp?

llama.cpp is the most efficient open-source runtime for GGUF models. It runs anywhere — CPU, CUDA, Metal, ROCm, Vulkan — ships an OpenAI-compatible HTTP server out of the box, and supports aggressive quantization (Q4_K_M and below) so you can run larger models on modest hardware than Ollama would manage.

Use this guide if you want to:

  • Run a model that isn't in Ollama's registry
  • Squeeze more performance out of a constrained box (CPU-only or low-VRAM GPU)
  • Pin a specific GGUF quantization for reproducibility
  • Avoid the Ollama daemon and run llama.cpp's server directly

Step 1: Install llama.cpp

On macOS:

brew install llama.cpp
# Or build from source for the absolute latest:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build --config Release

On Linux with CUDA:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

On Docker (any host):

docker pull ghcr.io/ggml-org/llama.cpp:server-cuda
# Or :server for CPU-only

Step 2: Pull a GGUF Model

Grab any GGUF file from Hugging Face. Good starter models:

  • Qwen3-7B-Instruct (Q4_K_M) — ~4.5 GB, runs on CPU, excellent quality for size
  • Llama 4 8B Instruct (Q4_K_M) — ~5 GB, strong general purpose
  • DeepSeek V4 Flash (Q4) — ~25 GB, frontier-adjacent (needs 24 GB+ VRAM)
# Example: Qwen3-7B Q4_K_M
mkdir -p ~/models
huggingface-cli download Qwen/Qwen3-7B-Instruct-GGUF \
  qwen3-7b-instruct-q4_k_m.gguf --local-dir ~/models

Step 3: Start the llama.cpp OpenAI-Compatible Server

llama-server \
  -m ~/models/qwen3-7b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 32768 \
  --n-gpu-layers 999

Drop --n-gpu-layers if you're CPU-only. Test the endpoint:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role":"user","content":"hello"}]
  }'

You should get a normal OpenAI-shaped response back.

Step 4: Point OpenClaw at the llama.cpp Server

llama.cpp's server speaks OpenAI's wire format, so OpenClaw uses the OpenAI provider with a custom baseURL:

{
  "models": {
    "providers": {
      "openai": {
        "baseURL": "http://host.docker.internal:8080/v1",
        "apiKey": "no-key-needed"
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "openai/local"
      }
    }
  }
}

The apiKey can be any non-empty string (llama.cpp ignores it unless you start the server with --api-key). The model ID after the slash matches whatever you reported to the server — local is fine for a single-model setup.

Step 5: Restart OpenClaw and Test

Send SIGUSR1 to your OpenClaw process (or restart the container) to pick up the new model config. Then send a test message via Telegram, Discord, or the gateway web chat.

Tip: Add --keep-alive 0 to the llama-server command to keep the model resident in memory between requests — otherwise you eat a 1–3 second cold-start on every reply.

Performance Notes

  • CPU-only: 7B Q4_K_M models hit ~10–25 tokens/sec on a recent M-series Mac or Ryzen 7 box. Usable for low-traffic bots.
  • Single consumer GPU (3090/4090): 7B–14B at full speed; 30B quantized fits if you stay below 24 GB VRAM.
  • Workstation GPU (A6000/H100): 70B Q4_K_M comfortable; 1M-context models possible with paged attention.

llama.cpp vs Ollama vs Hosted

  • llama.cpp — Maximum control, run any GGUF, smallest footprint, hands-on. Best when you have a specific model file or want to squeeze a constrained box.
  • Ollama — Friendlier UX, model registry, daemon manages models for you. See the OpenClaw + Ollama guide.
  • Hosted (OpenClaw Launch) — Zero-ops, frontier models like Claude/GPT/DeepSeek V4 Pro, AI credits included. From $3/mo.

What's Next?

Skip the Setup

Get a frontier-quality AI agent running in 10 seconds with OpenClaw Launch — AI credits included. Plans from $3/mo.

Deploy Now