← Home

Guide

Run Hermes Agent with Ollama

Hermes Agent works against any OpenAI-compatible local model server — including Ollama. Keep your conversations on-device, pay nothing per token, and choose any open-weights model your hardware can run.

Why Use Ollama with Hermes

Ollama runs large language models locally and exposes them through an OpenAI-compatible HTTP endpoint. Hermes Agent can target that endpoint as its model backend, giving you a fully local agent stack:

  • Data never leaves your machine — Prompts, tool call results, and session memory stay on your hardware. No cloud API sees your conversations.
  • Zero token costs — Local inference is free to run. No per-token billing, no rate limits imposed by an external service.
  • Works offline — Once a model is pulled, it runs without an internet connection (Hermes itself may still need connectivity for web search or other external tools you enable).
  • Full Hermes capability — You keep Hermes's complete feature set: Telegram, Discord, Slack, and other platform integrations; web search; shell execution; image generation (FAL.ai); approval policies; cron jobs; and the analytics dashboard.

Requirements

  • Ollama installed and running (default endpoint: http://localhost:11434)
  • At least one model pulled — for example ollama pull qwen3:32b or ollama pull llama3.2
  • Hermes Agent installed — see the Hermes install guide if you haven't done this yet
  • Enough RAM or VRAM for the model you choose (see the table below)

Step 1: Install and Start Ollama

Download Ollama from ollama.com/download for macOS, Linux, or Windows. After installation, start the server:

ollama serve

On macOS and Windows, Ollama typically starts automatically as a background service after install. Then pull a model:

ollama pull qwen3:32b

Verify the server is up by visiting http://localhost:11434 in your browser — it should return a short status response.

Step 2: Point Hermes at Your Ollama Endpoint

Hermes Agent accepts any OpenAI-compatible base URL as its model provider endpoint. Ollama exposes exactly this interface at /v1. In your Hermes config.yaml, set the provider base URL to your Ollama server:

# In your Hermes config.yaml — provider section
# Use the base URL that Ollama exposes (OpenAI-compatible):
#   http://localhost:11434/v1        (when Hermes is on the same host as Ollama)
#   http://host.docker.internal:11434/v1  (when Hermes runs in Docker on Mac/Windows)
#
# Refer to the upstream Hermes README for the exact field names in your version:
# https://github.com/NousResearch/hermes-agent

The specific config field names depend on the Hermes version you have installed. Check the upstream README for the provider configuration section. Look for an option that sets a custom OpenAI API base URL — this is what you point at http://localhost:11434/v1.

After saving the config, restart Hermes to apply the change (Hermes does not hot-reload config — a container restart is required).

Step 3: Pick a Good Model for Agent Use

Not all local models are equal for agentic workflows. Agents rely on tool calling — the model must reliably emit structured function calls when needed rather than explaining what it would do in prose. Models that reliably tool-call tend to perform well; models trained primarily for chat often fail silently by ignoring tool schemas.

Here are open-weights models with solid tool-calling track records that Ollama can run:

ModelParametersVRAM NeededNotable Strength
Qwen3 32B32B20 GBReliable tool-calling, multilingual
Llama 3.3 70B70B40 GBStrong all-round agent performance
Llama 3.2 8B8B5 GBFast and lightweight
Mistral Small 3.124B14 GBSolid reasoning, low VRAM cost
DeepSeek R1 14B14B9 GBStrong coding and structured output

If you are on Apple Silicon, unified memory counts toward VRAM — an M3 Max with 64 GB can run 70B models comfortably. On Linux with an NVIDIA GPU, check available VRAM with nvidia-smi before pulling a large model.

Docker Networking Gotcha

When Hermes runs inside a Docker container and Ollama runs on the host machine,localhost resolves to the container's own network namespace — not the host. Use these alternatives depending on your OS:

  • macOS and Windows — Docker Desktop provides the special hostname host.docker.internal. Use http://host.docker.internal:11434/v1 as your base URL.
  • Linux — Either start the Hermes container with --network host (then localhost:11434 works as expected), or use the host's LAN IP address (e.g. http://192.168.1.x:11434/v1). Find the LAN IP with ip route get 1 | awk '{print $7}'.

Also make sure Ollama is listening on the right interface. By default it binds to 127.0.0.1 only. To allow connections from Docker containers, set the environment variable OLLAMA_HOST=0.0.0.0 before starting Ollama, or export it in your shell profile.

Performance Tips

  • GPU offload — Ollama automatically uses your GPU if drivers are installed. For NVIDIA, install the CUDA toolkit and verify with ollama run <model> — you should see GPU utilization in nvidia-smi. Apple Silicon offloads via Metal automatically.
  • Quantization — Larger quantizations (Q8, F16) give better quality but need more VRAM. Smaller ones (Q4_K_M, Q4_0) fit tighter hardware with a modest quality trade-off. Ollama model tags like qwen3:32b-q4_K_M select a specific quantization.
  • Concurrency — Ollama processes one request at a time by default. If multiple Hermes tools fire in parallel (which is common in agentic workflows), requests queue. Set OLLAMA_NUM_PARALLEL to allow concurrent generation if your VRAM budget allows it.
  • Keep model loaded — Ollama unloads models after an idle timeout. Set OLLAMA_KEEP_ALIVE=-1 to keep the model in memory indefinitely, which eliminates reload latency between agent turns.

OpenClaw Launch with a Remote Ollama

If you want Hermes hosted and managed — without running your own server — but still want to back it with your own local models, you can point an OpenClaw Launch Hermes instance at a publicly-reachable Ollama endpoint. This requires:

  • Your Ollama server exposed over HTTPS with a valid TLS certificate (a self-signed cert will be rejected by most HTTP clients).
  • An authentication layer in front of Ollama — the default Ollama server has no authentication, so anyone who can reach the URL can make requests.
  • Sufficient upstream bandwidth, since model inference tokens now travel over the internet between the managed instance and your machine.

For most users, the simpler path is either fully local (self-host Hermes + Ollama on the same machine) or fully managed (OpenClaw Launch with a cloud model provider). The hybrid approach is possible but adds operational overhead.

Local vs. Managed: How They Compare

Local Ollama + self-hosted HermesOpenClaw Launch + hosted models
CostFree (electricity + hardware)From $3/mo + per-token API costs
Data privacyComplete — stays on your hardwareEncrypted at rest, routed via API
LatencyDepends on your GPUFast cloud inference
MaintenanceYou manage updates, restarts, DockerFully managed — zero ops
Model choiceAny model Ollama supportsClaude, GPT, Gemini, and others via OpenRouter

Frequently Asked Questions

Does Hermes natively support Ollama as a named provider?

Hermes Agent is designed to work with any OpenAI-compatible endpoint, which is the interface Ollama exposes at /v1. Rather than a dedicated “Ollama” provider name, you configure the custom base URL for your local server. Check the upstream README for the exact field in your installed version.

Will tool calling work with local models?

It depends on the model. Models fine-tuned for tool use — such as recent Qwen3, Llama 3.x, and Mistral Small variants — support the OpenAI function-calling schema that Hermes uses. Models trained primarily for chat often ignore tool schemas. If you see Hermes describing actions in prose instead of executing tools, try a different model.

Can I switch between Ollama and a cloud model without reinstalling?

Yes. Changing the provider base URL in your Hermes config and restarting the container is all that's needed. You can maintain separate config files for your local and cloud setups and swap them as needed.

What if Ollama is slow?

The most common cause is CPU-only inference — check that Ollama is using your GPU. On Linux with NVIDIA, install the CUDA toolkit and the Ollama Linux package (which includes the CUDA backend). Also consider a smaller or more aggressively quantized model — a Q4_K_M 14B model running on a GPU is usually faster for agent use than a Q8 70B model running on CPU.

Does this work with Hermes hosted on OpenClaw Launch?

OpenClaw Launch managed instances use cloud model providers by default. You can point a managed instance at a public Ollama endpoint, but you must handle TLS and authentication yourself (see the section above). For full local privacy, self-host both Hermes and Ollama on your own machine.

Related Guides

Try Hermes Agent on OpenClaw Launch

No GPU needed. Deploy a managed Hermes instance in seconds with leading cloud models — full agent capability, zero infrastructure to manage.

Deploy Now