← Home

Guide

Hermes Agent + LM Studio: Run Hermes on Fully Local Models

LM Studio is the friendliest way to run open-weight LLMs on your own machine. With its OpenAI-compatible local server, Hermes Agent can run end-to-end without ever calling an external API — ideal for sensitive data, offline use, or just keeping costs at zero.

What Is LM Studio?

LM Studio is a desktop app that downloads, runs, and serves open-weight LLMs on macOS, Windows, and Linux. It bundles a model browser, a chat UI, and most importantly an OpenAI-compatible HTTP server that any tool — including Hermes Agent — can call as a drop-in replacement for the OpenAI API.

LM Studio handles the hard parts of local inference: GPU offload, quantization, flash attention, and KV cache management. Most modern Macs with 16+ GB unified memory can run 8–14B models comfortably; a 70B model needs 64+ GB or a beefy Linux box with two consumer GPUs.

Why Pair Hermes with LM Studio?

  • Full privacy. Every message stays on your machine. No API key, no telemetry.
  • Zero per-token cost. Once the model is downloaded, inference is free forever.
  • Offline operation. Run Hermes on a laptop with no internet.
  • Easy model swapping. Switch from Llama to Qwen to DeepSeek in a click; Hermes's /model command picks them up.

Step 1: Install LM Studio and Download a Model

  1. Download LM Studio from lmstudio.ai and install it.
  2. Open LM Studio, click the search icon, and download a model. Good starter picks for Hermes:
    • Llama-3.3-70B-Instruct (q4_K_M) — best general agent quality, needs ~40 GB RAM
    • Qwen3.5-32B-Instruct (q4_K_M) — strong tool use, ~18 GB RAM
    • Llama-3.1-8B-Instruct (q4_K_M) — fast, ~5 GB RAM
    • DeepSeek-Coder-V3 (q4_K_M) — coding-focused agents
  3. Once downloaded, click Local Server, load your model, and click Start Server. The default endpoint is http://127.0.0.1:1234/v1.

Step 2: Point Hermes Agent at LM Studio

LM Studio's local server is OpenAI-compatible, so configure Hermes's OpenAI provider with a custom base URL:

# Hermes accepts OPENAI_API_BASE and OPENAI_API_KEY
export OPENAI_API_BASE=http://127.0.0.1:1234/v1
export OPENAI_API_KEY=lm-studio   # any non-empty string works

# Set Hermes to use OpenAI-compatible mode
hermes inference set openai
hermes model set llama-3.3-70b-instruct

# Or configure /opt/data/config.yaml directly:
# inference:
#   provider: openai
#   base_url: http://127.0.0.1:1234/v1
# model:
#   default: llama-3.3-70b-instruct

The model name should match what LM Studio shows in its server panel — copy it verbatim.

Step 3: Verify with a Quick Chat

# From Hermes CLI:
hermes chat "Say hi and confirm you are running locally."

If Hermes replies and LM Studio's server panel shows the request, you're wired up correctly. Now connect a channel (Telegram, Discord, WhatsApp) and your local model is reachable from a phone.

Hardware Sizing Quick Reference

HardwareComfortable Model SizeNotes
M2 / M3 MacBook (16 GB)3–8B q4Llama 3.1 8B at ~20 tokens/sec
M3 / M4 Pro Mac (36 GB)14–32B q4Qwen3.5 32B at ~15 tokens/sec
M3 Max / M4 Max (64+ GB)70B q4Llama 3.3 70B at ~8 tokens/sec
1× RTX 4090 (24 GB)14–32B q4Codestral 22B fits comfortably
2× RTX 4090 (48 GB total)70B q4Llama 3.3 70B at ~30 tokens/sec

Switching Models at Runtime

Load multiple models in LM Studio and serve any of them. Hermes's /model command switches between them without restart:

/model llama-3.3-70b-instruct
/model qwen3.5-32b-instruct
/model deepseek-coder-v3

LM Studio vs Ollama vs vLLM

LM StudioOllamavLLM
UIFull desktop GUICLICLI
Best forPersonal, easiestHeadless servers, scriptingProduction GPU throughput
PlatformsMac, Windows, LinuxMac, Windows, LinuxLinux (CUDA / ROCm)
OpenAI-compatibleYesYesYes

What's Next?

Prefer Managed?

If you don't want to run inference locally, OpenClaw Launch hosts Hermes with bundled AI credits.

Deploy Managed Hermes