← Home

Setup Guide

OpenClaw + vLLM: High-Throughput Local LLM Inference

vLLM is the fastest open-source LLM inference engine available. This guide walks you through installing vLLM, starting a model server, and connecting it to OpenClaw via its OpenAI-compatible API — so your AI agent runs entirely on your own hardware.

What Is vLLM?

vLLM is an open-source, high-throughput LLM inference and serving engine developed by UC Berkeley. It achieves state-of-the-art serving performance through two core innovations:

  • PagedAttention — A memory management algorithm that virtually eliminates GPU memory waste from KV cache fragmentation. vLLM can serve 2–4× more requests per GPU compared to naive implementations.
  • Continuous batching — Instead of waiting for a full batch to finish, vLLM processes new requests as soon as a slot becomes available. This dramatically improves GPU utilization and reduces latency under concurrent load.

vLLM exposes an OpenAI-compatible REST API, so any application that supports OpenAI can use vLLM as a drop-in local backend — including OpenClaw.

Why Use vLLM with OpenClaw?

If you are self-hosting OpenClaw and want to run local models, vLLM is the best choice for production workloads:

  • Faster than Ollama for production — vLLM's continuous batching serves multiple concurrent users efficiently. Ollama processes one request at a time by default, making it slow under load.
  • GPU memory efficiency — PagedAttention lets vLLM fit larger models into the same GPU VRAM and handle longer contexts without OOM errors.
  • Multi-user concurrency — OpenClaw can serve many users simultaneously. vLLM handles concurrent inference without queueing everything sequentially.
  • Tensor parallelism — vLLM can split large models across multiple GPUs automatically, enabling 70B+ parameter models without expensive hardware.
  • OpenAI-compatible — vLLM's API is fully compatible with the OpenAI format that OpenClaw already uses, so no custom integration is needed.

vLLM vs Ollama vs LM Studio vs TGI

Here is how vLLM compares to other popular local inference options:

FeaturevLLMOllamaLM StudioTGI
ThroughputVery high (continuous batching)ModerateLow–moderateHigh
Ease of setupModerate (needs CUDA)Very easyVery easy (GUI)Moderate
GPU efficiencyBest (PagedAttention)GoodGoodGood
Multi-user concurrencyExcellentLimitedLimitedGood
OpenAI-compatible APIYes (built-in)YesYesPartial
Best forProduction, multi-userDev / single-userBeginners / desktopProduction (HuggingFace)

Bottom line: Use vLLM when you need production-grade throughput, multi-user concurrency, or the best GPU utilization. Use Ollama for quick local development or single-user setups where ease of install matters more than performance.

Prerequisites

  • A Linux machine with an NVIDIA GPU (CUDA 12.1+)
  • At least 16 GB VRAM for 7B models; 40 GB+ for 70B models
  • Python 3.9–3.12
  • CUDA toolkit installed (nvidia-smi should work)
Apple Silicon / CPU: vLLM has experimental support for Apple Silicon (Metal) and CPU inference, but performance is significantly lower than on NVIDIA GPUs. For Mac local inference, Ollama is a better fit.

How to Install vLLM

Option A: pip install

Create a virtual environment and install vLLM:

python3 -m venv vllm-env
source vllm-env/bin/activate
pip install vllm

This installs vLLM with all CUDA dependencies. The first install may take a few minutes as it downloads compiled CUDA kernels.

Option B: Docker

The official vLLM Docker image is the easiest way to get a reproducible environment:

docker pull vllm/vllm-openai:latest

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

The -v ~/.cache/huggingface mount caches model weights so they are not re-downloaded on container restart.

How to Start the vLLM Server

Once installed, start the vLLM OpenAI-compatible server with a model from HuggingFace Hub:

# Serve Llama 3.1 8B
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Serve Qwen 3 32B
vllm serve Qwen/Qwen3-32B-Instruct --port 8000

# Serve DeepSeek R1 14B with quantization (less VRAM)
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
  --quantization awq --port 8000

The server starts an OpenAI-compatible API at http://localhost:8000. You can verify it is running:

curl http://localhost:8000/v1/models
HuggingFace token: Some models (like Llama 3) require accepting a license on HuggingFace and setting export HUGGING_FACE_HUB_TOKEN=hf_your_token before running vLLM.

How to Connect vLLM to OpenClaw

vLLM's API is OpenAI-compatible, so you configure it as an OpenAI-format provider in your openclaw.json config file. OpenClaw will send all model requests to your local vLLM server.

1. Add vLLM as a model provider

In your OpenClaw config, add a provider entry pointing to your vLLM server. Since vLLM uses the OpenAI API format, use the openai provider type with a custom baseURL:

{
  "models": {
    "providers": {
      "vllm": {
        "baseUrl": "http://localhost:8000/v1",
        "apiKey": "EMPTY"
      }
    }
  }
}

vLLM does not require a real API key by default. Use "EMPTY" or any non-empty string as the key value.

2. Set the default model

Set the agent's primary model to your vLLM-hosted model. Use the exact model name you passed to vllm serve, prefixed with your provider name:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "vllm/meta-llama/Llama-3.1-8B-Instruct"
      }
    }
  }
}

3. Full openclaw.json example

{
  "models": {
    "providers": {
      "vllm": {
        "baseUrl": "http://localhost:8000/v1",
        "apiKey": "EMPTY"
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "vllm/meta-llama/Llama-3.1-8B-Instruct"
      }
    }
  }
}

4. Restart OpenClaw

After updating the config, restart your OpenClaw container. It will now route all model requests through your local vLLM server.

Docker networking note: If OpenClaw runs in Docker but vLLM runs on the host, replace localhost with host.docker.internal:
http://host.docker.internal:8000/v1

Performance Tuning

Tensor Parallelism (Multi-GPU)

Spread a large model across multiple GPUs using the --tensor-parallel-size flag. This is required for models that do not fit in a single GPU's VRAM:

# Serve a 70B model across 2 GPUs
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --port 8000

Quantization

Quantization reduces VRAM usage at a slight quality cost. vLLM supports several quantization formats:

# AWQ quantization (recommended — best quality/size trade-off)
vllm serve TheBloke/Llama-3.1-8B-Instruct-AWQ \
  --quantization awq --port 8000

# GPTQ quantization
vllm serve TheBloke/Llama-3.1-8B-Instruct-GPTQ \
  --quantization gptq --port 8000

Adjusting Max Batch Size

Tune --max-num-seqs to control how many concurrent requests vLLM processes. Higher values improve throughput but require more VRAM:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-num-seqs 32 \
  --port 8000

Troubleshooting

CUDA errors on startup

If vLLM fails with a CUDA error, verify your CUDA version matches vLLM's requirements:

nvidia-smi          # Check GPU and CUDA version
nvcc --version      # Check CUDA compiler version
pip show vllm       # Check installed vLLM version

vLLM requires CUDA 12.1 or higher. If you have an older version, either upgrade CUDA or install a vLLM version pinned to your CUDA release (check the vLLM installation docs).

Out of Memory (OOM)

If vLLM runs out of GPU memory, try one or more of these:

  • Use a quantized model (--quantization awq)
  • Reduce --max-model-len (context window) to a smaller value like 4096
  • Lower --gpu-memory-utilization from the default 0.9 to 0.7
  • Use a smaller model (e.g., 8B instead of 32B)
  • Add more GPUs and enable tensor parallelism

Slow startup

The first time vLLM loads a model it compiles CUDA kernels and downloads model weights from HuggingFace. This can take 5–15 minutes depending on model size and network speed. Subsequent starts are much faster because weights are cached in ~/.cache/huggingface.

If you are using Docker, mount the cache directory to avoid re-downloading:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

Frequently Asked Questions

Does OpenClaw support vLLM natively?

Yes. vLLM serves an OpenAI-compatible API, and OpenClaw can connect to any OpenAI-compatible endpoint. Configure vLLM as a provider in openclaw.json with your vLLM server's base URL and OpenClaw will use it seamlessly.

Is vLLM faster than Ollama?

Yes, significantly, under concurrent load. vLLM's continuous batching and PagedAttention deliver 2–10× higher throughput than Ollama when multiple users are sending requests simultaneously. For single-user, single-request use cases, the difference is smaller. Ollama is easier to install and better suited for local development; vLLM is built for production multi-user serving.

What models does vLLM support?

vLLM supports most popular open-source models from HuggingFace, including Llama 3, Qwen 3, Mistral, DeepSeek R1, Phi-4, Gemma 3, and many others. Full model support list: docs.vllm.ai/en/stable/models/supported_models.

Can I use vLLM without a GPU?

vLLM has experimental CPU inference support but performance is very slow — typically 10–50× slower than GPU. For CPU-only local inference, Ollama is a more practical choice. Alternatively, use a cloud model via OpenClaw Launch and avoid managing local hardware entirely.

What's Next

Once vLLM is connected to OpenClaw, explore these related guides:

Skip the GPU Altogether

Running vLLM requires a powerful NVIDIA GPU, careful CUDA setup, and ongoing infrastructure maintenance. If you want a production-grade AI agent without managing servers, OpenClaw Launch gives you cloud-hosted OpenClaw with leading models (Claude Opus 4, GPT-5.2, Gemini 2.5) in 10 seconds — no GPU required.

No GPU? No Problem.

Deploy your OpenClaw AI agent in 10 seconds with cloud-hosted models. No hardware, no setup, plans from $3/mo.

Deploy Now