Guide

Hermes Agent + vLLM: Production GPU Serving for Open-Weight Models

vLLM is the production-grade engine for serving open-weight LLMs — the same system used by major labs and inference providers under the hood. Pair it with Hermes Agent when you need throughput, prefix caching, and OpenAI-compatible HTTP on your own GPUs.

What Is vLLM?

vLLM is an open-source LLM serving engine built around PagedAttention — a memory management scheme that gives much higher throughput than naive transformer serving. It supports nearly every open-weight model on Hugging Face, exposes an OpenAI-compatible HTTP API, and handles batching, prefix caching, speculative decoding, and tensor / pipeline parallelism out of the box.

For Hermes Agent specifically, vLLM is the right backend when you outgrow Ollama or LM Studio — typically when you need to serve 10+ concurrent users, batch throughput above a few requests per second, or models that don't fit on a single consumer GPU.

When to Use vLLM with Hermes

Concurrent users. 10+ simultaneous chats — PagedAttention keeps GPU memory packed efficiently.
Big models. Llama 70B, Mixtral 8x22B, DeepSeek-V3 671B — vLLM handles multi-GPU tensor / pipeline parallelism.
Prefix caching. Agent workloads with shared system prompts get massive wins from vLLM's automatic prefix cache.
Data residency. Sensitive workloads that can't leave your VPC.

Step 1: Install vLLM

# CUDA 12.x recommended
pip install vllm

# Or run via Docker
docker run --gpus all -p 8000:8000 \
  --ipc=host vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct

vLLM also supports AMD GPUs (ROCm), Intel CPUs/GPUs, and Apple Silicon (via the MLX integration), though CUDA on H100 / H200 / B200 remains the highest-throughput path.

Step 2: Serve a Model

# Single-GPU
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --port 8000 \
  --max-model-len 32768

# Two-GPU tensor parallel (e.g. 2x A100 80GB for 70B FP16)
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --port 8000

# With prefix caching for agent workloads (huge speedup)
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --enable-prefix-caching \
  --port 8000

Step 3: Point Hermes Agent at vLLM

vLLM exposes an OpenAI-compatible /v1/chat/completions endpoint, so configure Hermes's OpenAI provider with a custom base URL:

# Hermes accepts OPENAI_API_BASE and OPENAI_API_KEY
export OPENAI_API_BASE=http://your-vllm-host:8000/v1
export OPENAI_API_KEY=local   # vLLM accepts any token by default

hermes inference set openai
hermes model set meta-llama/Llama-3.3-70B-Instruct

# config.yaml equivalent:
# inference:
#   provider: openai
#   base_url: http://your-vllm-host:8000/v1
# model:
#   default: meta-llama/Llama-3.3-70B-Instruct

The model name must exactly match what vLLM was launched with — vLLM doesn't do model aliasing.

Hardware Sizing for Common Models

Model	Minimum Hardware	Recommended
Llama 3.1 8B (FP16)	1× A10 24 GB	1× A100 80GB
Llama 3.3 70B (FP16)	2× A100 80GB	2× H100 80GB
Llama 3.3 70B (AWQ-INT4)	1× A100 80GB	1× H100 80GB
Mixtral 8x22B (FP16)	4× A100 80GB	4× H100 80GB
Qwen 3.5 32B (FP16)	2× A100 40GB	1× A100 80GB
DeepSeek V3 (FP8)	8× H100 80GB	8× H200 141GB

Production Tips for Hermes Workloads

Enable prefix caching. Hermes agents repeat large system prompts; --enable-prefix-caching typically cuts TTFT by 5–10×.
Set --max-num-seqs high. Default is conservative. For chat workloads, 128+ is usually fine.
Use quantized weights. AWQ-INT4 or GPTQ-INT4 lets a 70B model fit on a single 80GB GPU with minimal quality loss.
Pin your vLLM version. The serving engine evolves fast; pin to a specific tag in production.
Watch num_running and num_pending. vLLM exposes Prometheus metrics; alert on pending > 5 sustained.

What's Next?

Hermes Agent + Llama — Llama-specific tuning
Hermes Agent + Mistral — Mixtral / Codestral on vLLM
Hermes Agent + Ollama — Lighter-weight local alternative
Hermes Agent + LM Studio — GUI alternative for desktop