Setup Guide
OpenClaw + vLLM: High-Throughput Local LLM Inference
vLLM is the fastest open-source LLM inference engine available. This guide walks you through installing vLLM, starting a model server, and connecting it to OpenClaw via its OpenAI-compatible API — so your AI agent runs entirely on your own hardware.
What Is vLLM?
vLLM is an open-source, high-throughput LLM inference and serving engine developed by UC Berkeley. It achieves state-of-the-art serving performance through two core innovations:
- PagedAttention — A memory management algorithm that virtually eliminates GPU memory waste from KV cache fragmentation. vLLM can serve 2–4× more requests per GPU compared to naive implementations.
- Continuous batching — Instead of waiting for a full batch to finish, vLLM processes new requests as soon as a slot becomes available. This dramatically improves GPU utilization and reduces latency under concurrent load.
vLLM exposes an OpenAI-compatible REST API, so any application that supports OpenAI can use vLLM as a drop-in local backend — including OpenClaw.
Why Use vLLM with OpenClaw?
If you are self-hosting OpenClaw and want to run local models, vLLM is the best choice for production workloads:
- Faster than Ollama for production — vLLM's continuous batching serves multiple concurrent users efficiently. Ollama processes one request at a time by default, making it slow under load.
- GPU memory efficiency — PagedAttention lets vLLM fit larger models into the same GPU VRAM and handle longer contexts without OOM errors.
- Multi-user concurrency — OpenClaw can serve many users simultaneously. vLLM handles concurrent inference without queueing everything sequentially.
- Tensor parallelism — vLLM can split large models across multiple GPUs automatically, enabling 70B+ parameter models without expensive hardware.
- OpenAI-compatible — vLLM's API is fully compatible with the OpenAI format that OpenClaw already uses, so no custom integration is needed.
vLLM vs Ollama vs LM Studio vs TGI
Here is how vLLM compares to other popular local inference options:
| Feature | vLLM | Ollama | LM Studio | TGI |
|---|---|---|---|---|
| Throughput | Very high (continuous batching) | Moderate | Low–moderate | High |
| Ease of setup | Moderate (needs CUDA) | Very easy | Very easy (GUI) | Moderate |
| GPU efficiency | Best (PagedAttention) | Good | Good | Good |
| Multi-user concurrency | Excellent | Limited | Limited | Good |
| OpenAI-compatible API | Yes (built-in) | Yes | Yes | Partial |
| Best for | Production, multi-user | Dev / single-user | Beginners / desktop | Production (HuggingFace) |
Bottom line: Use vLLM when you need production-grade throughput, multi-user concurrency, or the best GPU utilization. Use Ollama for quick local development or single-user setups where ease of install matters more than performance.
Prerequisites
- A Linux machine with an NVIDIA GPU (CUDA 12.1+)
- At least 16 GB VRAM for 7B models; 40 GB+ for 70B models
- Python 3.9–3.12
- CUDA toolkit installed (
nvidia-smishould work)
How to Install vLLM
Option A: pip install
Create a virtual environment and install vLLM:
python3 -m venv vllm-env source vllm-env/bin/activate pip install vllm
This installs vLLM with all CUDA dependencies. The first install may take a few minutes as it downloads compiled CUDA kernels.
Option B: Docker
The official vLLM Docker image is the easiest way to get a reproducible environment:
docker pull vllm/vllm-openai:latest docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-8B-Instruct
The -v ~/.cache/huggingface mount caches model weights so they are not re-downloaded on container restart.
How to Start the vLLM Server
Once installed, start the vLLM OpenAI-compatible server with a model from HuggingFace Hub:
# Serve Llama 3.1 8B vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 # Serve Qwen 3 32B vllm serve Qwen/Qwen3-32B-Instruct --port 8000 # Serve DeepSeek R1 14B with quantization (less VRAM) vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \ --quantization awq --port 8000
The server starts an OpenAI-compatible API at http://localhost:8000. You can verify it is running:
curl http://localhost:8000/v1/models
export HUGGING_FACE_HUB_TOKEN=hf_your_token before running vLLM.How to Connect vLLM to OpenClaw
vLLM's API is OpenAI-compatible, so you configure it as an OpenAI-format provider in your openclaw.json config file. OpenClaw will send all model requests to your local vLLM server.
1. Add vLLM as a model provider
In your OpenClaw config, add a provider entry pointing to your vLLM server. Since vLLM uses the OpenAI API format, use the openai provider type with a custom baseURL:
{
"models": {
"providers": {
"vllm": {
"baseUrl": "http://localhost:8000/v1",
"apiKey": "EMPTY"
}
}
}
}vLLM does not require a real API key by default. Use "EMPTY" or any non-empty string as the key value.
2. Set the default model
Set the agent's primary model to your vLLM-hosted model. Use the exact model name you passed to vllm serve, prefixed with your provider name:
{
"agents": {
"defaults": {
"model": {
"primary": "vllm/meta-llama/Llama-3.1-8B-Instruct"
}
}
}
}3. Full openclaw.json example
{
"models": {
"providers": {
"vllm": {
"baseUrl": "http://localhost:8000/v1",
"apiKey": "EMPTY"
}
}
},
"agents": {
"defaults": {
"model": {
"primary": "vllm/meta-llama/Llama-3.1-8B-Instruct"
}
}
}
}4. Restart OpenClaw
After updating the config, restart your OpenClaw container. It will now route all model requests through your local vLLM server.
localhost with host.docker.internal:http://host.docker.internal:8000/v1Performance Tuning
Tensor Parallelism (Multi-GPU)
Spread a large model across multiple GPUs using the --tensor-parallel-size flag. This is required for models that do not fit in a single GPU's VRAM:
# Serve a 70B model across 2 GPUs vllm serve meta-llama/Llama-3.3-70B-Instruct \ --tensor-parallel-size 2 \ --port 8000
Quantization
Quantization reduces VRAM usage at a slight quality cost. vLLM supports several quantization formats:
# AWQ quantization (recommended — best quality/size trade-off) vllm serve TheBloke/Llama-3.1-8B-Instruct-AWQ \ --quantization awq --port 8000 # GPTQ quantization vllm serve TheBloke/Llama-3.1-8B-Instruct-GPTQ \ --quantization gptq --port 8000
Adjusting Max Batch Size
Tune --max-num-seqs to control how many concurrent requests vLLM processes. Higher values improve throughput but require more VRAM:
vllm serve meta-llama/Llama-3.1-8B-Instruct \ --max-num-seqs 32 \ --port 8000
Troubleshooting
CUDA errors on startup
If vLLM fails with a CUDA error, verify your CUDA version matches vLLM's requirements:
nvidia-smi # Check GPU and CUDA version nvcc --version # Check CUDA compiler version pip show vllm # Check installed vLLM version
vLLM requires CUDA 12.1 or higher. If you have an older version, either upgrade CUDA or install a vLLM version pinned to your CUDA release (check the vLLM installation docs).
Out of Memory (OOM)
If vLLM runs out of GPU memory, try one or more of these:
- Use a quantized model (
--quantization awq) - Reduce
--max-model-len(context window) to a smaller value like4096 - Lower
--gpu-memory-utilizationfrom the default0.9to0.7 - Use a smaller model (e.g., 8B instead of 32B)
- Add more GPUs and enable tensor parallelism
Slow startup
The first time vLLM loads a model it compiles CUDA kernels and downloads model weights from HuggingFace. This can take 5–15 minutes depending on model size and network speed. Subsequent starts are much faster because weights are cached in ~/.cache/huggingface.
If you are using Docker, mount the cache directory to avoid re-downloading:
docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-8B-Instruct
Frequently Asked Questions
Does OpenClaw support vLLM natively?
Yes. vLLM serves an OpenAI-compatible API, and OpenClaw can connect to any OpenAI-compatible endpoint. Configure vLLM as a provider in openclaw.json with your vLLM server's base URL and OpenClaw will use it seamlessly.
Is vLLM faster than Ollama?
Yes, significantly, under concurrent load. vLLM's continuous batching and PagedAttention deliver 2–10× higher throughput than Ollama when multiple users are sending requests simultaneously. For single-user, single-request use cases, the difference is smaller. Ollama is easier to install and better suited for local development; vLLM is built for production multi-user serving.
What models does vLLM support?
vLLM supports most popular open-source models from HuggingFace, including Llama 3, Qwen 3, Mistral, DeepSeek R1, Phi-4, Gemma 3, and many others. Full model support list: docs.vllm.ai/en/stable/models/supported_models.
Can I use vLLM without a GPU?
vLLM has experimental CPU inference support but performance is very slow — typically 10–50× slower than GPU. For CPU-only local inference, Ollama is a more practical choice. Alternatively, use a cloud model via OpenClaw Launch and avoid managing local hardware entirely.
What's Next
Once vLLM is connected to OpenClaw, explore these related guides:
- OpenClaw + Ollama — Easier local inference for development and single-user setups
- OpenClaw + LiteLLM — Proxy layer for routing between vLLM, cloud APIs, and multiple providers
- OpenClaw + OpenRouter — Access 100+ hosted cloud models without managing your own GPU
- OpenClaw Agent Guide — Configure agents, skills, and memory once your model is connected
Skip the GPU Altogether
Running vLLM requires a powerful NVIDIA GPU, careful CUDA setup, and ongoing infrastructure maintenance. If you want a production-grade AI agent without managing servers, OpenClaw Launch gives you cloud-hosted OpenClaw with leading models (Claude Opus 4, GPT-5.2, Gemini 2.5) in 10 seconds — no GPU required.