← Home

Guide

Hermes Agent + Llama: Run Hermes on Meta's Open-Weight Models

Llama — Meta's open-weight model family — is one of the most natural fits for Hermes Agent. The same open-source philosophy underpins both projects, and Llama's permissive license means you can run Hermes end-to-end on your own hardware with no external API dependency.

What Is Llama?

Llama is Meta's family of open-weight large language models, released under a community license that permits commercial use up to 700M monthly active users. The current generation — Llama 4 — brings native multimodality, mixture-of-experts architecture, and a 10M-token context window in its largest variants.

Hermes Agent reaches Llama through three paths: self-hosted via Ollama, vLLM, or llama.cpp; hosted via Together AI, Groq, Fireworks, or Cerebras; or aggregated via OpenRouter (one key, auto-routed to the cheapest provider).

Llama Model Lineup for Hermes

ModelBest ForContextNotes
Llama 4 MaverickHeavy reasoning, frontier-grade tool use1M tokensMoE 400B total / 17B active
Llama 4 ScoutLong-context research, multi-doc agents10M tokensMoE 109B / 17B active
Llama 3.3 70BGeneral agent default, strong instruction following128K tokensDense, runs on 2×A100 80GB
Llama 3.2 11B VisionMultimodal chat with image input128K tokensRuns on one consumer GPU
Llama 3.1 8BLocal self-host, low VRAM, fast128K tokensRuns on 8GB VRAM (q4)

Option 1: Hermes Agent on OpenClaw Launch (Easiest)

  1. Go to openclawlaunch.com/hermes-hosting and start a Hermes deploy.
  2. Select Llama 4 Maverick (or any other Llama variant) from the model dropdown.
  3. Connect Telegram, Discord, WhatsApp, or another channel.
  4. Click Deploy. Your Llama-powered Hermes Agent is live in roughly 30 seconds.
Tip: OpenClaw Launch routes Llama requests through OpenRouter, which auto-selects the cheapest healthy provider behind the scenes. AI credits are included.

Option 2: Self-Hosted Llama via Ollama

Ollama is the easiest way to run Llama locally. Install Ollama, pull a model, and point Hermes at the local endpoint.

# Install Ollama (one-line install on macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a Llama model
ollama pull llama3.3:70b

# Tell Hermes to use Ollama
export OLLAMA_HOST=http://127.0.0.1:11434
hermes inference set ollama
hermes model set llama3.3:70b

For local-only deploys, see also Hermes Agent + Ollama for the full walkthrough including GPU sizing and memory tuning.

Option 3: Self-Hosted Llama via vLLM (Production)

vLLM is the production-grade serving engine for Llama. Use it when you need throughput, batch inference, or OpenAI-compatible HTTP for multiple clients.

# Run Llama 3.3 70B on vLLM (requires 2x A100 80GB or 1x H100)
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000

# Point Hermes at the vLLM endpoint (OpenAI-compatible)
export OPENAI_API_BASE=http://127.0.0.1:8000/v1
export OPENAI_API_KEY=local
hermes inference set openai
hermes model set meta-llama/Llama-3.3-70B-Instruct

Option 4: Hosted Llama via OpenRouter, Groq, or Together AI

If you don't want to manage GPUs, hosted Llama is competitive with frontier closed models on cost — especially via Groq (extreme speed) and Cerebras (extreme speed at scale).

# OpenRouter — one key, auto-routed to cheapest provider
export OPENROUTER_API_KEY=sk-or-...
hermes inference set openrouter
hermes model set meta-llama/llama-4-maverick

# Groq — ~500 tokens/sec on Llama 3.3 70B
export GROQ_API_KEY=gsk_...
hermes inference set groq
hermes model set llama-3.3-70b-versatile

When to Choose Llama over Closed Models

Choose Llama when open weights matter: regulated industries that need on-prem deployment, research workflows where reproducibility requires the same weights tomorrow, or cost-sensitive high-volume bots where serving your own model is cheaper than per-token API spend at scale.

Choose Llama when data residency matters: messages never leave your infrastructure. With Hermes + self-hosted Llama, you can run a fully air-gapped agent.

What's Next?

Deploy Hermes with Llama

Run Hermes on Meta Llama — self-hosted or managed — from one dashboard.

Deploy Hermes