Guide
Run Hermes Agent with Ollama
Hermes Agent works against any OpenAI-compatible local model server — including Ollama. Keep your conversations on-device, pay nothing per token, and choose any open-weights model your hardware can run.
Why Use Ollama with Hermes
Ollama runs large language models locally and exposes them through an OpenAI-compatible HTTP endpoint. Hermes Agent can target that endpoint as its model backend, giving you a fully local agent stack:
- Data never leaves your machine — Prompts, tool call results, and session memory stay on your hardware. No cloud API sees your conversations.
- Zero token costs — Local inference is free to run. No per-token billing, no rate limits imposed by an external service.
- Works offline — Once a model is pulled, it runs without an internet connection (Hermes itself may still need connectivity for web search or other external tools you enable).
- Full Hermes capability — You keep Hermes's complete feature set: Telegram, Discord, Slack, and other platform integrations; web search; shell execution; image generation (FAL.ai); approval policies; cron jobs; and the analytics dashboard.
Requirements
- Ollama installed and running (default endpoint:
http://localhost:11434) - At least one model pulled — for example
ollama pull qwen3:32borollama pull llama3.2 - Hermes Agent installed — see the Hermes install guide if you haven't done this yet
- Enough RAM or VRAM for the model you choose (see the table below)
Step 1: Install and Start Ollama
Download Ollama from ollama.com/download for macOS, Linux, or Windows. After installation, start the server:
ollama serve
On macOS and Windows, Ollama typically starts automatically as a background service after install. Then pull a model:
ollama pull qwen3:32b
Verify the server is up by visiting http://localhost:11434 in your browser — it should return a short status response.
Step 2: Point Hermes at Your Ollama Endpoint
Hermes Agent accepts any OpenAI-compatible base URL as its model provider endpoint. Ollama exposes exactly this interface at /v1. In your Hermes config.yaml, set the provider base URL to your Ollama server:
# In your Hermes config.yaml — provider section # Use the base URL that Ollama exposes (OpenAI-compatible): # http://localhost:11434/v1 (when Hermes is on the same host as Ollama) # http://host.docker.internal:11434/v1 (when Hermes runs in Docker on Mac/Windows) # # Refer to the upstream Hermes README for the exact field names in your version: # https://github.com/NousResearch/hermes-agent
The specific config field names depend on the Hermes version you have installed. Check the upstream README for the provider configuration section. Look for an option that sets a custom OpenAI API base URL — this is what you point at http://localhost:11434/v1.
After saving the config, restart Hermes to apply the change (Hermes does not hot-reload config — a container restart is required).
Step 3: Pick a Good Model for Agent Use
Not all local models are equal for agentic workflows. Agents rely on tool calling — the model must reliably emit structured function calls when needed rather than explaining what it would do in prose. Models that reliably tool-call tend to perform well; models trained primarily for chat often fail silently by ignoring tool schemas.
Here are open-weights models with solid tool-calling track records that Ollama can run:
| Model | Parameters | VRAM Needed | Notable Strength |
|---|---|---|---|
| Qwen3 32B | 32B | 20 GB | Reliable tool-calling, multilingual |
| Llama 3.3 70B | 70B | 40 GB | Strong all-round agent performance |
| Llama 3.2 8B | 8B | 5 GB | Fast and lightweight |
| Mistral Small 3.1 | 24B | 14 GB | Solid reasoning, low VRAM cost |
| DeepSeek R1 14B | 14B | 9 GB | Strong coding and structured output |
If you are on Apple Silicon, unified memory counts toward VRAM — an M3 Max with 64 GB can run 70B models comfortably. On Linux with an NVIDIA GPU, check available VRAM with nvidia-smi before pulling a large model.
Docker Networking Gotcha
When Hermes runs inside a Docker container and Ollama runs on the host machine,localhost resolves to the container's own network namespace — not the host. Use these alternatives depending on your OS:
- macOS and Windows — Docker Desktop provides the special hostname
host.docker.internal. Usehttp://host.docker.internal:11434/v1as your base URL. - Linux — Either start the Hermes container with
--network host(thenlocalhost:11434works as expected), or use the host's LAN IP address (e.g.http://192.168.1.x:11434/v1). Find the LAN IP withip route get 1 | awk '{print $7}'.
Also make sure Ollama is listening on the right interface. By default it binds to 127.0.0.1 only. To allow connections from Docker containers, set the environment variable OLLAMA_HOST=0.0.0.0 before starting Ollama, or export it in your shell profile.
Performance Tips
- GPU offload — Ollama automatically uses your GPU if drivers are installed. For NVIDIA, install the CUDA toolkit and verify with
ollama run <model>— you should see GPU utilization innvidia-smi. Apple Silicon offloads via Metal automatically. - Quantization — Larger quantizations (Q8, F16) give better quality but need more VRAM. Smaller ones (Q4_K_M, Q4_0) fit tighter hardware with a modest quality trade-off. Ollama model tags like
qwen3:32b-q4_K_Mselect a specific quantization. - Concurrency — Ollama processes one request at a time by default. If multiple Hermes tools fire in parallel (which is common in agentic workflows), requests queue. Set
OLLAMA_NUM_PARALLELto allow concurrent generation if your VRAM budget allows it. - Keep model loaded — Ollama unloads models after an idle timeout. Set
OLLAMA_KEEP_ALIVE=-1to keep the model in memory indefinitely, which eliminates reload latency between agent turns.
OpenClaw Launch with a Remote Ollama
If you want Hermes hosted and managed — without running your own server — but still want to back it with your own local models, you can point an OpenClaw Launch Hermes instance at a publicly-reachable Ollama endpoint. This requires:
- Your Ollama server exposed over HTTPS with a valid TLS certificate (a self-signed cert will be rejected by most HTTP clients).
- An authentication layer in front of Ollama — the default Ollama server has no authentication, so anyone who can reach the URL can make requests.
- Sufficient upstream bandwidth, since model inference tokens now travel over the internet between the managed instance and your machine.
For most users, the simpler path is either fully local (self-host Hermes + Ollama on the same machine) or fully managed (OpenClaw Launch with a cloud model provider). The hybrid approach is possible but adds operational overhead.
Local vs. Managed: How They Compare
| Local Ollama + self-hosted Hermes | OpenClaw Launch + hosted models | |
|---|---|---|
| Cost | Free (electricity + hardware) | From $3/mo + per-token API costs |
| Data privacy | Complete — stays on your hardware | Encrypted at rest, routed via API |
| Latency | Depends on your GPU | Fast cloud inference |
| Maintenance | You manage updates, restarts, Docker | Fully managed — zero ops |
| Model choice | Any model Ollama supports | Claude, GPT, Gemini, and others via OpenRouter |
Frequently Asked Questions
Does Hermes natively support Ollama as a named provider?
Hermes Agent is designed to work with any OpenAI-compatible endpoint, which is the interface Ollama exposes at /v1. Rather than a dedicated “Ollama” provider name, you configure the custom base URL for your local server. Check the upstream README for the exact field in your installed version.
Will tool calling work with local models?
It depends on the model. Models fine-tuned for tool use — such as recent Qwen3, Llama 3.x, and Mistral Small variants — support the OpenAI function-calling schema that Hermes uses. Models trained primarily for chat often ignore tool schemas. If you see Hermes describing actions in prose instead of executing tools, try a different model.
Can I switch between Ollama and a cloud model without reinstalling?
Yes. Changing the provider base URL in your Hermes config and restarting the container is all that's needed. You can maintain separate config files for your local and cloud setups and swap them as needed.
What if Ollama is slow?
The most common cause is CPU-only inference — check that Ollama is using your GPU. On Linux with NVIDIA, install the CUDA toolkit and the Ollama Linux package (which includes the CUDA backend). Also consider a smaller or more aggressively quantized model — a Q4_K_M 14B model running on a GPU is usually faster for agent use than a Q8 70B model running on CPU.
Does this work with Hermes hosted on OpenClaw Launch?
OpenClaw Launch managed instances use cloud model providers by default. You can point a managed instance at a public Ollama endpoint, but you must handle TLS and authentication yourself (see the section above). For full local privacy, self-host both Hermes and Ollama on your own machine.