Guide
Hermes Agent + Llama: Run Hermes on Meta's Open-Weight Models
Llama — Meta's open-weight model family — is one of the most natural fits for Hermes Agent. The same open-source philosophy underpins both projects, and Llama's permissive license means you can run Hermes end-to-end on your own hardware with no external API dependency.
What Is Llama?
Llama is Meta's family of open-weight large language models, released under a community license that permits commercial use up to 700M monthly active users. The current generation — Llama 4 — brings native multimodality, mixture-of-experts architecture, and a 10M-token context window in its largest variants.
Hermes Agent reaches Llama through three paths: self-hosted via Ollama, vLLM, or llama.cpp; hosted via Together AI, Groq, Fireworks, or Cerebras; or aggregated via OpenRouter (one key, auto-routed to the cheapest provider).
Llama Model Lineup for Hermes
| Model | Best For | Context | Notes |
|---|---|---|---|
| Llama 4 Maverick | Heavy reasoning, frontier-grade tool use | 1M tokens | MoE 400B total / 17B active |
| Llama 4 Scout | Long-context research, multi-doc agents | 10M tokens | MoE 109B / 17B active |
| Llama 3.3 70B | General agent default, strong instruction following | 128K tokens | Dense, runs on 2×A100 80GB |
| Llama 3.2 11B Vision | Multimodal chat with image input | 128K tokens | Runs on one consumer GPU |
| Llama 3.1 8B | Local self-host, low VRAM, fast | 128K tokens | Runs on 8GB VRAM (q4) |
Option 1: Hermes Agent on OpenClaw Launch (Easiest)
- Go to openclawlaunch.com/hermes-hosting and start a Hermes deploy.
- Select Llama 4 Maverick (or any other Llama variant) from the model dropdown.
- Connect Telegram, Discord, WhatsApp, or another channel.
- Click Deploy. Your Llama-powered Hermes Agent is live in roughly 30 seconds.
Option 2: Self-Hosted Llama via Ollama
Ollama is the easiest way to run Llama locally. Install Ollama, pull a model, and point Hermes at the local endpoint.
# Install Ollama (one-line install on macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a Llama model
ollama pull llama3.3:70b
# Tell Hermes to use Ollama
export OLLAMA_HOST=http://127.0.0.1:11434
hermes inference set ollama
hermes model set llama3.3:70bFor local-only deploys, see also Hermes Agent + Ollama for the full walkthrough including GPU sizing and memory tuning.
Option 3: Self-Hosted Llama via vLLM (Production)
vLLM is the production-grade serving engine for Llama. Use it when you need throughput, batch inference, or OpenAI-compatible HTTP for multiple clients.
# Run Llama 3.3 70B on vLLM (requires 2x A100 80GB or 1x H100)
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000
# Point Hermes at the vLLM endpoint (OpenAI-compatible)
export OPENAI_API_BASE=http://127.0.0.1:8000/v1
export OPENAI_API_KEY=local
hermes inference set openai
hermes model set meta-llama/Llama-3.3-70B-InstructOption 4: Hosted Llama via OpenRouter, Groq, or Together AI
If you don't want to manage GPUs, hosted Llama is competitive with frontier closed models on cost — especially via Groq (extreme speed) and Cerebras (extreme speed at scale).
# OpenRouter — one key, auto-routed to cheapest provider
export OPENROUTER_API_KEY=sk-or-...
hermes inference set openrouter
hermes model set meta-llama/llama-4-maverick
# Groq — ~500 tokens/sec on Llama 3.3 70B
export GROQ_API_KEY=gsk_...
hermes inference set groq
hermes model set llama-3.3-70b-versatileWhen to Choose Llama over Closed Models
Choose Llama when open weights matter: regulated industries that need on-prem deployment, research workflows where reproducibility requires the same weights tomorrow, or cost-sensitive high-volume bots where serving your own model is cheaper than per-token API spend at scale.
Choose Llama when data residency matters: messages never leave your infrastructure. With Hermes + self-hosted Llama, you can run a fully air-gapped agent.
What's Next?
- Hermes Agent + Ollama — Local-first deploy
- Hermes Agent + vLLM — Production GPU serving for Llama
- Hermes Agent + OpenRouter — Hosted Llama via single key
- Hermes Agent + Mistral — Another strong open-weight option