Guide
Hermes Agent + vLLM: Production GPU Serving for Open-Weight Models
vLLM is the production-grade engine for serving open-weight LLMs — the same system used by major labs and inference providers under the hood. Pair it with Hermes Agent when you need throughput, prefix caching, and OpenAI-compatible HTTP on your own GPUs.
What Is vLLM?
vLLM is an open-source LLM serving engine built around PagedAttention — a memory management scheme that gives much higher throughput than naive transformer serving. It supports nearly every open-weight model on Hugging Face, exposes an OpenAI-compatible HTTP API, and handles batching, prefix caching, speculative decoding, and tensor / pipeline parallelism out of the box.
For Hermes Agent specifically, vLLM is the right backend when you outgrow Ollama or LM Studio — typically when you need to serve 10+ concurrent users, batch throughput above a few requests per second, or models that don't fit on a single consumer GPU.
When to Use vLLM with Hermes
- Concurrent users. 10+ simultaneous chats — PagedAttention keeps GPU memory packed efficiently.
- Big models. Llama 70B, Mixtral 8x22B, DeepSeek-V3 671B — vLLM handles multi-GPU tensor / pipeline parallelism.
- Prefix caching. Agent workloads with shared system prompts get massive wins from vLLM's automatic prefix cache.
- Data residency. Sensitive workloads that can't leave your VPC.
Step 1: Install vLLM
# CUDA 12.x recommended
pip install vllm
# Or run via Docker
docker run --gpus all -p 8000:8000 \
--ipc=host vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-InstructvLLM also supports AMD GPUs (ROCm), Intel CPUs/GPUs, and Apple Silicon (via the MLX integration), though CUDA on H100 / H200 / B200 remains the highest-throughput path.
Step 2: Serve a Model
# Single-GPU
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--port 8000 \
--max-model-len 32768
# Two-GPU tensor parallel (e.g. 2x A100 80GB for 70B FP16)
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 \
--port 8000
# With prefix caching for agent workloads (huge speedup)
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--enable-prefix-caching \
--port 8000Step 3: Point Hermes Agent at vLLM
vLLM exposes an OpenAI-compatible /v1/chat/completions endpoint, so configure Hermes's OpenAI provider with a custom base URL:
# Hermes accepts OPENAI_API_BASE and OPENAI_API_KEY
export OPENAI_API_BASE=http://your-vllm-host:8000/v1
export OPENAI_API_KEY=local # vLLM accepts any token by default
hermes inference set openai
hermes model set meta-llama/Llama-3.3-70B-Instruct
# config.yaml equivalent:
# inference:
# provider: openai
# base_url: http://your-vllm-host:8000/v1
# model:
# default: meta-llama/Llama-3.3-70B-InstructThe model name must exactly match what vLLM was launched with — vLLM doesn't do model aliasing.
Hardware Sizing for Common Models
| Model | Minimum Hardware | Recommended |
|---|---|---|
| Llama 3.1 8B (FP16) | 1× A10 24 GB | 1× A100 80GB |
| Llama 3.3 70B (FP16) | 2× A100 80GB | 2× H100 80GB |
| Llama 3.3 70B (AWQ-INT4) | 1× A100 80GB | 1× H100 80GB |
| Mixtral 8x22B (FP16) | 4× A100 80GB | 4× H100 80GB |
| Qwen 3.5 32B (FP16) | 2× A100 40GB | 1× A100 80GB |
| DeepSeek V3 (FP8) | 8× H100 80GB | 8× H200 141GB |
Production Tips for Hermes Workloads
- Enable prefix caching. Hermes agents repeat large system prompts;
--enable-prefix-cachingtypically cuts TTFT by 5–10×. - Set
--max-num-seqshigh. Default is conservative. For chat workloads, 128+ is usually fine. - Use quantized weights. AWQ-INT4 or GPTQ-INT4 lets a 70B model fit on a single 80GB GPU with minimal quality loss.
- Pin your vLLM version. The serving engine evolves fast; pin to a specific tag in production.
- Watch
num_runningandnum_pending. vLLM exposes Prometheus metrics; alert on pending > 5 sustained.
What's Next?
- Hermes Agent + Llama — Llama-specific tuning
- Hermes Agent + Mistral — Mixtral / Codestral on vLLM
- Hermes Agent + Ollama — Lighter-weight local alternative
- Hermes Agent + LM Studio — GUI alternative for desktop