Guide

Hermes Agent + LM Studio: Run Hermes on Fully Local Models

LM Studio is the friendliest way to run open-weight LLMs on your own machine. With its OpenAI-compatible local server, Hermes Agent can run end-to-end without ever calling an external API — ideal for sensitive data, offline use, or just keeping costs at zero.

What Is LM Studio?

LM Studio is a desktop app that downloads, runs, and serves open-weight LLMs on macOS, Windows, and Linux. It bundles a model browser, a chat UI, and most importantly an OpenAI-compatible HTTP server that any tool — including Hermes Agent — can call as a drop-in replacement for the OpenAI API.

LM Studio handles the hard parts of local inference: GPU offload, quantization, flash attention, and KV cache management. Most modern Macs with 16+ GB unified memory can run 8–14B models comfortably; a 70B model needs 64+ GB or a beefy Linux box with two consumer GPUs.

Why Pair Hermes with LM Studio?

Full privacy. Every message stays on your machine. No API key, no telemetry.
Zero per-token cost. Once the model is downloaded, inference is free forever.
Offline operation. Run Hermes on a laptop with no internet.
Easy model swapping. Switch from Llama to Qwen to DeepSeek in a click; Hermes's /model command picks them up.

Step 1: Install LM Studio and Download a Model

Download LM Studio from lmstudio.ai and install it.
Open LM Studio, click the search icon, and download a model. Good starter picks for Hermes:
- Llama-3.3-70B-Instruct (q4_K_M) — best general agent quality, needs ~40 GB RAM
- Qwen3.5-32B-Instruct (q4_K_M) — strong tool use, ~18 GB RAM
- Llama-3.1-8B-Instruct (q4_K_M) — fast, ~5 GB RAM
- DeepSeek-Coder-V3 (q4_K_M) — coding-focused agents
Once downloaded, click Local Server, load your model, and click Start Server. The default endpoint is http://127.0.0.1:1234/v1.

Step 2: Point Hermes Agent at LM Studio

LM Studio's local server is OpenAI-compatible, so configure Hermes's OpenAI provider with a custom base URL:

# Hermes accepts OPENAI_API_BASE and OPENAI_API_KEY
export OPENAI_API_BASE=http://127.0.0.1:1234/v1
export OPENAI_API_KEY=lm-studio   # any non-empty string works

# Set Hermes to use OpenAI-compatible mode
hermes inference set openai
hermes model set llama-3.3-70b-instruct

# Or configure /opt/data/config.yaml directly:
# inference:
#   provider: openai
#   base_url: http://127.0.0.1:1234/v1
# model:
#   default: llama-3.3-70b-instruct

The model name should match what LM Studio shows in its server panel — copy it verbatim.

Step 3: Verify with a Quick Chat

# From Hermes CLI:
hermes chat "Say hi and confirm you are running locally."

If Hermes replies and LM Studio's server panel shows the request, you're wired up correctly. Now connect a channel (Telegram, Discord, WhatsApp) and your local model is reachable from a phone.

Hardware Sizing Quick Reference

Hardware	Comfortable Model Size	Notes
M2 / M3 MacBook (16 GB)	3–8B q4	Llama 3.1 8B at ~20 tokens/sec
M3 / M4 Pro Mac (36 GB)	14–32B q4	Qwen3.5 32B at ~15 tokens/sec
M3 Max / M4 Max (64+ GB)	70B q4	Llama 3.3 70B at ~8 tokens/sec
1× RTX 4090 (24 GB)	14–32B q4	Codestral 22B fits comfortably
2× RTX 4090 (48 GB total)	70B q4	Llama 3.3 70B at ~30 tokens/sec

Switching Models at Runtime

Load multiple models in LM Studio and serve any of them. Hermes's /model command switches between them without restart:

/model llama-3.3-70b-instruct
/model qwen3.5-32b-instruct
/model deepseek-coder-v3

LM Studio vs Ollama vs vLLM

	LM Studio	Ollama	vLLM
UI	Full desktop GUI	CLI	CLI
Best for	Personal, easiest	Headless servers, scripting	Production GPU throughput
Platforms	Mac, Windows, Linux	Mac, Windows, Linux	Linux (CUDA / ROCm)
OpenAI-compatible	Yes	Yes	Yes

What's Next?

Hermes Agent + Ollama — CLI-only local inference
Hermes Agent + vLLM — Production-grade GPU serving
Hermes Agent + Llama — Meta's open-weight family
Hermes Agent + Mistral — Mixtral / Codestral for local use