Guide
OpenClaw + llama.cpp: Self-Host an Open Model and Connect It to OpenClaw
Run a GGUF model with llama.cpp's built-in OpenAI-compatible server, then point your OpenClaw agent at it. CPU-friendly, GPU-accelerated, fully local — here's the full setup.
Why llama.cpp?
llama.cpp is the most efficient open-source runtime for GGUF models. It runs anywhere — CPU, CUDA, Metal, ROCm, Vulkan — ships an OpenAI-compatible HTTP server out of the box, and supports aggressive quantization (Q4_K_M and below) so you can run larger models on modest hardware than Ollama would manage.
Use this guide if you want to:
- Run a model that isn't in Ollama's registry
- Squeeze more performance out of a constrained box (CPU-only or low-VRAM GPU)
- Pin a specific GGUF quantization for reproducibility
- Avoid the Ollama daemon and run llama.cpp's server directly
Step 1: Install llama.cpp
On macOS:
brew install llama.cpp
# Or build from source for the absolute latest:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build --config ReleaseOn Linux with CUDA:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -jOn Docker (any host):
docker pull ghcr.io/ggml-org/llama.cpp:server-cuda
# Or :server for CPU-onlyStep 2: Pull a GGUF Model
Grab any GGUF file from Hugging Face. Good starter models:
- Qwen3-7B-Instruct (Q4_K_M) — ~4.5 GB, runs on CPU, excellent quality for size
- Llama 4 8B Instruct (Q4_K_M) — ~5 GB, strong general purpose
- DeepSeek V4 Flash (Q4) — ~25 GB, frontier-adjacent (needs 24 GB+ VRAM)
# Example: Qwen3-7B Q4_K_M
mkdir -p ~/models
huggingface-cli download Qwen/Qwen3-7B-Instruct-GGUF \
qwen3-7b-instruct-q4_k_m.gguf --local-dir ~/modelsStep 3: Start the llama.cpp OpenAI-Compatible Server
llama-server \
-m ~/models/qwen3-7b-instruct-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 32768 \
--n-gpu-layers 999Drop --n-gpu-layers if you're CPU-only. Test the endpoint:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role":"user","content":"hello"}]
}'You should get a normal OpenAI-shaped response back.
Step 4: Point OpenClaw at the llama.cpp Server
llama.cpp's server speaks OpenAI's wire format, so OpenClaw uses the OpenAI provider with a custom baseURL:
{
"models": {
"providers": {
"openai": {
"baseURL": "http://host.docker.internal:8080/v1",
"apiKey": "no-key-needed"
}
}
},
"agents": {
"defaults": {
"model": {
"primary": "openai/local"
}
}
}
}The apiKey can be any non-empty string (llama.cpp ignores it unless you start the server with --api-key). The model ID after the slash matches whatever you reported to the server — local is fine for a single-model setup.
Step 5: Restart OpenClaw and Test
Send SIGUSR1 to your OpenClaw process (or restart the container) to pick up the new model config. Then send a test message via Telegram, Discord, or the gateway web chat.
--keep-alive 0 to the llama-server command to keep the model resident in memory between requests — otherwise you eat a 1–3 second cold-start on every reply.Performance Notes
- CPU-only: 7B Q4_K_M models hit ~10–25 tokens/sec on a recent M-series Mac or Ryzen 7 box. Usable for low-traffic bots.
- Single consumer GPU (3090/4090): 7B–14B at full speed; 30B quantized fits if you stay below 24 GB VRAM.
- Workstation GPU (A6000/H100): 70B Q4_K_M comfortable; 1M-context models possible with paged attention.
llama.cpp vs Ollama vs Hosted
- llama.cpp — Maximum control, run any GGUF, smallest footprint, hands-on. Best when you have a specific model file or want to squeeze a constrained box.
- Ollama — Friendlier UX, model registry, daemon manages models for you. See the OpenClaw + Ollama guide.
- Hosted (OpenClaw Launch) — Zero-ops, frontier models like Claude/GPT/DeepSeek V4 Pro, AI credits included. From $3/mo.
What's Next?
- OpenClaw + Ollama — Friendlier alternative if you don't need llama.cpp's control
- OpenClaw + vLLM — Production-grade serving for higher throughput
- OpenClaw + LM Studio — GUI option around llama.cpp
- All supported models