Can Hermes Agent run on self-hosted Llama?

Yes. Point Hermes at any Ollama, vLLM, or llama.cpp endpoint. Set inference.provider to ollama or openai-compatible and set the model name.

Which Llama model should I use with Hermes?

Llama 3.3 70B is the recommended default for hosted deploys — strong instruction following, balanced cost. Llama 4 Maverick for frontier-grade reasoning. Llama 3.1 8B for local self-host on consumer GPUs.

Is Llama free to use commercially with Hermes?

Yes, under Meta’s community license up to 700M monthly active users. Most Hermes deployments fall well below that threshold. Self-hosted Llama incurs only your own hardware or hosted-provider costs.

← Home

Guide

Hermes Agent + Llama: Run Hermes on Meta's Open-Weight Models

Llama — Meta's open-weight model family — is one of the most natural fits for Hermes Agent. The same open-source philosophy underpins both projects, and Llama's permissive license means you can run Hermes end-to-end on your own hardware with no external API dependency.

What Is Llama?

Llama is Meta's family of open-weight large language models, released under a community license that permits commercial use up to 700M monthly active users. The current generation — Llama 4 — brings native multimodality, mixture-of-experts architecture, and a 10M-token context window in its largest variants.

Hermes Agent reaches Llama through three paths: self-hosted via Ollama, vLLM, or llama.cpp; hosted via Together AI, Groq, Fireworks, or Cerebras; or aggregated via OpenRouter (one key, auto-routed to the cheapest provider).

Llama Model Lineup for Hermes

Model	Best For	Context	Notes
Llama 4 Maverick	Heavy reasoning, frontier-grade tool use	1M tokens	MoE 400B total / 17B active
Llama 4 Scout	Long-context research, multi-doc agents	10M tokens	MoE 109B / 17B active
Llama 3.3 70B	General agent default, strong instruction following	128K tokens	Dense, runs on 2×A100 80GB
Llama 3.2 11B Vision	Multimodal chat with image input	128K tokens	Runs on one consumer GPU
Llama 3.1 8B	Local self-host, low VRAM, fast	128K tokens	Runs on 8GB VRAM (q4)

Option 1: Hermes Agent on OpenClaw Launch (Easiest)

Go to openclawlaunch.com/hermes-hosting and start a Hermes deploy.
Select Llama 4 Maverick (or any other Llama variant) from the model dropdown.
Connect Telegram, Discord, WhatsApp, or another channel.
Click Deploy. Your Llama-powered Hermes Agent is live in roughly 30 seconds.

Tip: OpenClaw Launch routes Llama requests through OpenRouter, which auto-selects the cheapest healthy provider behind the scenes. AI credits are included.

Option 2: Self-Hosted Llama via Ollama

Ollama is the easiest way to run Llama locally. Install Ollama, pull a model, and point Hermes at the local endpoint.

# Install Ollama (one-line install on macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a Llama model
ollama pull llama3.3:70b

# Tell Hermes to use Ollama
export OLLAMA_HOST=http://127.0.0.1:11434
hermes inference set ollama
hermes model set llama3.3:70b

For local-only deploys, see also Hermes Agent + Ollama for the full walkthrough including GPU sizing and memory tuning.

Option 3: Self-Hosted Llama via vLLM (Production)

vLLM is the production-grade serving engine for Llama. Use it when you need throughput, batch inference, or OpenAI-compatible HTTP for multiple clients.

# Run Llama 3.3 70B on vLLM (requires 2x A100 80GB or 1x H100)
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000

# Point Hermes at the vLLM endpoint (OpenAI-compatible)
export OPENAI_API_BASE=http://127.0.0.1:8000/v1
export OPENAI_API_KEY=local
hermes inference set openai
hermes model set meta-llama/Llama-3.3-70B-Instruct

Option 4: Hosted Llama via OpenRouter, Groq, or Together AI

If you don't want to manage GPUs, hosted Llama is competitive with frontier closed models on cost — especially via Groq (extreme speed) and Cerebras (extreme speed at scale).

# OpenRouter — one key, auto-routed to cheapest provider
export OPENROUTER_API_KEY=sk-or-...
hermes inference set openrouter
hermes model set meta-llama/llama-4-maverick

# Groq — ~500 tokens/sec on Llama 3.3 70B
export GROQ_API_KEY=gsk_...
hermes inference set groq
hermes model set llama-3.3-70b-versatile

When to Choose Llama over Closed Models

Choose Llama when open weights matter: regulated industries that need on-prem deployment, research workflows where reproducibility requires the same weights tomorrow, or cost-sensitive high-volume bots where serving your own model is cheaper than per-token API spend at scale.

Choose Llama when data residency matters: messages never leave your infrastructure. With Hermes + self-hosted Llama, you can run a fully air-gapped agent.

What's Next?

Hermes Agent + Ollama — Local-first deploy
Hermes Agent + vLLM — Production GPU serving for Llama
Hermes Agent + OpenRouter — Hosted Llama via single key
Hermes Agent + Mistral — Another strong open-weight option