Guide
Hermes Agent + LM Studio: Run Hermes on Fully Local Models
LM Studio is the friendliest way to run open-weight LLMs on your own machine. With its OpenAI-compatible local server, Hermes Agent can run end-to-end without ever calling an external API — ideal for sensitive data, offline use, or just keeping costs at zero.
What Is LM Studio?
LM Studio is a desktop app that downloads, runs, and serves open-weight LLMs on macOS, Windows, and Linux. It bundles a model browser, a chat UI, and most importantly an OpenAI-compatible HTTP server that any tool — including Hermes Agent — can call as a drop-in replacement for the OpenAI API.
LM Studio handles the hard parts of local inference: GPU offload, quantization, flash attention, and KV cache management. Most modern Macs with 16+ GB unified memory can run 8–14B models comfortably; a 70B model needs 64+ GB or a beefy Linux box with two consumer GPUs.
Why Pair Hermes with LM Studio?
- Full privacy. Every message stays on your machine. No API key, no telemetry.
- Zero per-token cost. Once the model is downloaded, inference is free forever.
- Offline operation. Run Hermes on a laptop with no internet.
- Easy model swapping. Switch from Llama to Qwen to DeepSeek in a click; Hermes's
/modelcommand picks them up.
Step 1: Install LM Studio and Download a Model
- Download LM Studio from lmstudio.ai and install it.
- Open LM Studio, click the search icon, and download a model. Good starter picks for Hermes:
- Llama-3.3-70B-Instruct (q4_K_M) — best general agent quality, needs ~40 GB RAM
- Qwen3.5-32B-Instruct (q4_K_M) — strong tool use, ~18 GB RAM
- Llama-3.1-8B-Instruct (q4_K_M) — fast, ~5 GB RAM
- DeepSeek-Coder-V3 (q4_K_M) — coding-focused agents
- Once downloaded, click Local Server, load your model, and click Start Server. The default endpoint is
http://127.0.0.1:1234/v1.
Step 2: Point Hermes Agent at LM Studio
LM Studio's local server is OpenAI-compatible, so configure Hermes's OpenAI provider with a custom base URL:
# Hermes accepts OPENAI_API_BASE and OPENAI_API_KEY
export OPENAI_API_BASE=http://127.0.0.1:1234/v1
export OPENAI_API_KEY=lm-studio # any non-empty string works
# Set Hermes to use OpenAI-compatible mode
hermes inference set openai
hermes model set llama-3.3-70b-instruct
# Or configure /opt/data/config.yaml directly:
# inference:
# provider: openai
# base_url: http://127.0.0.1:1234/v1
# model:
# default: llama-3.3-70b-instructThe model name should match what LM Studio shows in its server panel — copy it verbatim.
Step 3: Verify with a Quick Chat
# From Hermes CLI:
hermes chat "Say hi and confirm you are running locally."If Hermes replies and LM Studio's server panel shows the request, you're wired up correctly. Now connect a channel (Telegram, Discord, WhatsApp) and your local model is reachable from a phone.
Hardware Sizing Quick Reference
| Hardware | Comfortable Model Size | Notes |
|---|---|---|
| M2 / M3 MacBook (16 GB) | 3–8B q4 | Llama 3.1 8B at ~20 tokens/sec |
| M3 / M4 Pro Mac (36 GB) | 14–32B q4 | Qwen3.5 32B at ~15 tokens/sec |
| M3 Max / M4 Max (64+ GB) | 70B q4 | Llama 3.3 70B at ~8 tokens/sec |
| 1× RTX 4090 (24 GB) | 14–32B q4 | Codestral 22B fits comfortably |
| 2× RTX 4090 (48 GB total) | 70B q4 | Llama 3.3 70B at ~30 tokens/sec |
Switching Models at Runtime
Load multiple models in LM Studio and serve any of them. Hermes's /model command switches between them without restart:
/model llama-3.3-70b-instruct
/model qwen3.5-32b-instruct
/model deepseek-coder-v3LM Studio vs Ollama vs vLLM
| LM Studio | Ollama | vLLM | |
|---|---|---|---|
| UI | Full desktop GUI | CLI | CLI |
| Best for | Personal, easiest | Headless servers, scripting | Production GPU throughput |
| Platforms | Mac, Windows, Linux | Mac, Windows, Linux | Linux (CUDA / ROCm) |
| OpenAI-compatible | Yes | Yes | Yes |
What's Next?
- Hermes Agent + Ollama — CLI-only local inference
- Hermes Agent + vLLM — Production-grade GPU serving
- Hermes Agent + Llama — Meta's open-weight family
- Hermes Agent + Mistral — Mixtral / Codestral for local use