Gemma 4 12B is Google's open-source 12-billion-parameter language model released June 3, 2026. It features a unified multimodal decoder, 256K-token context window, and Multi-Token Prediction for lower latency. Licensed Apache 2.0.

Guide

Run Hermes Agent with Gemma 4 12B

Q: How much VRAM does Gemma 4 12B need?

At Q4_K_M quantization (Ollama default), Gemma 4 12B uses approximately 6.7 GB VRAM and downloads about 7.4 GB. At 8-bit it requires ~13.4 GB; at BF16 (full precision) ~26.7 GB. Google recommends 16 GB VRAM or unified memory for laptop use.

Q: Is Gemma 4 12B free to use?

Yes. Gemma 4 is released under the Apache 2.0 license. You can download, run, modify, and build commercial products on top of it at no cost. Running locally via Ollama incurs only electricity and hardware costs.

Q: Can Gemma 4 12B run a full Hermes Agent workflow?

Yes. Gemma 4 12B supports tool-calling via its instruct tuning, and its 256K context window handles long multi-step agentic workflows. Point Hermes at Ollama running gemma4:12b using the OpenAI-compatible endpoint at localhost:11434/v1.

Q: How do I run Gemma 4 locally with Ollama?

Install Ollama from ollama.com/download, then run "ollama run gemma4:12b". Ollama downloads the Q4_K_M model (~7.4 GB) and starts an OpenAI-compatible server on localhost:11434. Point Hermes at that endpoint as your provider base URL.

Gemma 4 12B is the hottest new open-source model of 2026 — multimodal, 256K context, and fast enough to run on a laptop GPU. Connect it to Hermes Agent via Ollama for a fully local AI agent that costs nothing per token and keeps every conversation on your hardware.

Why Gemma 4 12B Is Exploding Right Now

Google released Gemma 4 on June 3, 2026. Within days, “gemma 4 12b” became the single hottest rising query in Google Trends — up over 12,000% week-over-week. The reason is straightforward: it punches far above its weight class.

The 12B variant scores ~77.2% on MMLU Pro and ~78.8% on GPQA Diamond — beating the previous-generation Gemma 3 27B on both benchmarks while fitting on hardware that most developers already own. It is the first model in the Gemma family to use a unified multimodal decoder, meaning it natively understands text, image, and audio inputs without a bolted-on adapter. And it ships under Apache 2.0 — no usage restrictions, no licensing fees, fully open.

For Hermes Agent users, this matters because Hermes already exposes an OpenAI-compatible provider endpoint. Any model Ollama can serve, Hermes can use — including Gemma 4 12B the day it drops.

Gemma 4 Family at a Glance

Choose the variant that fits your hardware. The 12B is the sweet spot for local agent use; the MoE 26B A4B model is worth trying if you want near-12B quality at lower active VRAM cost.

Variant	Parameters	Context	VRAM (Q4)	Best For
E2B / E4B	2–4B (edge)	128K	1–3 GB	Mobile, edge devices
12B	11.9B + vision	256K	6.7 GB (Q4) / 13.4 GB (8-bit) / 26.7 GB (BF16)	Laptop-class local agent
26B A4B (MoE)	26B total, 4B active	256K	~8 GB (Q4)	Low-VRAM, near-12B quality
31B dense	31B	256K	~18 GB (Q4)	Flagship, best accuracy

Google recommends 16 GB VRAM or unified memory for the 12B in laptop use. In practice, Q4_K_M fits in 8 GB with careful VRAM management — see the quantization table below. The OpenClaw + Gemma 4 guide covers the OpenClaw-specific config if you run that framework instead.

What You Need

Ollama installed (macOS, Linux, or Windows) — the local model server that exposes an OpenAI-compatible endpoint at http://localhost:11434
At least 8 GB VRAM or 16 GB unified memory (Apple Silicon) for the 12B at Q4_K_M
Hermes Agent installed — see the What is Hermes Agent? and Deploy Hermes Agent guides if you are starting from scratch

Step 1: Pull Gemma 4 12B with Ollama

Open a terminal and run:

ollama run gemma4:12b

Ollama downloads the Q4_K_M quantization by default — approximately 7.4 GB. This command also starts an interactive session so you can verify the model responds before wiring it into Hermes. Press Ctrl+D to exit.

If you want a specific quantization, append the tag:

ollama pull gemma4:12b-q8_0   # higher quality, ~13 GB
ollama pull gemma4:12b-fp16   # full precision, ~24 GB

Verify the model is loaded and the server is running by visiting http://localhost:11434 in your browser — it returns a short status response when Ollama is up.

Step 2: Point Hermes at Your Ollama Endpoint

Hermes Agent works against any OpenAI-compatible base URL. Ollama exposes exactly this interface at /v1. In your Hermes config.yaml, set the provider base URL to your Ollama server:

# In your Hermes config.yaml — provider section
# Point the OpenAI-compatible base URL at Ollama:
#   http://localhost:11434/v1           (Hermes on the same host as Ollama)
#   http://host.docker.internal:11434/v1  (Hermes in Docker on Mac/Windows)
#
# Then set the model name to match what you pulled:
#   gemma4:12b
#
# See the upstream Hermes README for the exact field names in your version:
# https://github.com/NousResearch/hermes-agent

The specific config field names depend on your Hermes version. Look in the upstream README for the provider configuration section — the option that sets a custom OpenAI API base URL is what you point at http://localhost:11434/v1. Set the model name to gemma4:12b (the tag must match what ollama list shows).

After saving the config, restart Hermes. Config changes require a container restart — Hermes does not hot-reload the provider base URL.

Step 3: Test the Connection

With Hermes restarted, open the Hermes dashboard or send a message on whichever platform you have connected (Telegram, Discord, etc.). You should see a response within a few seconds — local inference on a laptop GPU typically produces 15–40 tokens per second at Q4_K_M.

If the agent is unresponsive, check two things: first, that Ollama is running (ollama list should show gemma4:12b); second, the Docker networking section below if Hermes runs inside a container.

Docker Networking: localhost vs. the Container

When Hermes runs inside a Docker container but Ollama runs on your host machine, localhost resolves to the container's own network — not your host. Use these alternatives:

macOS and Windows — Docker Desktop provides host.docker.internal. Use http://host.docker.internal:11434/v1 as your base URL.
Linux — Either start the Hermes container with --network host (then localhost:11434 works), or use the host's LAN IP. Find it with ip route get 1 | awk '{print $7}'.

Also confirm Ollama is listening beyond loopback. By default it binds to 127.0.0.1 only. To allow Docker container connections, set OLLAMA_HOST=0.0.0.0 before starting Ollama (or export it in your shell profile).

Quantization and VRAM Requirements

Gemma 4 12B is 11.9B parameters plus roughly 550M in the vision encoder. Here is what each quantization costs in VRAM:

Quantization	VRAM / Disk Size	Notes
BF16 (full precision)	26.7 GB	Highest quality; needs A100 / H100
8-bit	13.4 GB	Good quality; RTX 3090, 4090 or M3 Max 40GB
Q4_K_M (Ollama default)	~7.4 GB download / ~6.7 GB VRAM	Recommended for 8–16 GB VRAM
Q4_0	6.7 GB	Smallest 12B option; fits 8 GB GPUs with care

Q4_K_M is the practical default for most laptop-class hardware. Apple Silicon users on M3 Pro (18 GB) or M3 Max (36 GB+) can run BF16 comfortably — unified memory counts as VRAM on Apple Silicon.

Gemma 4 12B as an Agent Model

Gemma 4 12B introduces Multi-Token Prediction (MTP) drafters, which reduce generation latency by predicting multiple tokens per forward pass. For agentic workflows — where the model is invoked many times per task for planning, tool calling, and synthesizing results — this matters. Fewer round trips and faster token generation translate directly to shorter task completion times.

The model's 256K-token context window is also unusually large for a 12B class model. This means Hermes can include long session histories, large file contents, or multi-turn conversation memory without hitting context limits — a common pain point with smaller local models.

Multimodal support is native: Gemma 4 12B uses an encoder-free unified decoder architecture (the first in the Gemma family) that projects image and audio inputs directly into the token stream. If you connect Hermes to channels that send images or voice messages, Gemma 4 12B can process them without an external vision adapter. Note: video support is unconfirmed — do not assume it.

For tool-calling specifically, check the Hermes upstream repo for any Gemma-specific notes. If you see Hermes describing actions in prose instead of executing tools, the model may not be reliably emitting structured function calls — try gemma4:12b-instruct if a separate instruct variant is available in the Ollama library, as instruct-tuned versions typically have better tool-call adherence.

Gemma 4 12B vs. 26B: Which Should You Run?

The 26B A4B is a Mixture-of-Experts model: 26B total parameters, but only 4B are active per forward pass. This makes it faster and lighter than a dense 12B sounds — VRAM use at Q4 is comparable to the 12B or slightly lower, while reasoning quality is closer to a 26B dense model.

If your hardware can run the 12B comfortably, it is worth testing the 26B A4B once Ollama adds the tag — the active-parameter cost is similar but the effective model capacity is higher. The 12B remains the safer starting point because it is already widely available and benchmarked.

For comparison with the OpenClaw framework, see the OpenClaw + Gemma 4 guide, which covers the same model family from the OpenClaw configuration side.

Performance Tips

GPU offload — Ollama auto-detects your GPU. On NVIDIA, install the CUDA toolkit first. On Apple Silicon, Metal offload is automatic. Verify GPU use with nvidia-smi (NVIDIA) or Activity Monitor → GPU History (Mac).
Keep model in memory — Ollama unloads models after idle timeout. Set OLLAMA_KEEP_ALIVE=-1 to hold Gemma 4 in memory permanently, eliminating the ~10-second reload between agent sessions.
Parallelism — Agentic workflows often fire multiple tool calls concurrently. Set OLLAMA_NUM_PARALLEL=2 if your VRAM budget allows it; otherwise requests queue and add latency between tool steps.
Context length — Gemma 4 supports 256K tokens but Ollama's default context window is smaller. Set OLLAMA_CTX=65536 (or higher) to unlock longer contexts, at the cost of additional VRAM per active session.

Local vs. Hosted: When Your Laptop Can't Stay Online 24/7

	Local: Ollama + self-hosted Hermes	Hosted: OpenClaw Launch + cloud model
Cost	Free (electricity + hardware)	Subscription + per-token API costs — see pricing
Data privacy	Complete — stays on your hardware	Encrypted at rest, routed via API
Uptime	Only while your machine is on	24/7 managed
Model choice	Any model Ollama supports, incl. Gemma 4	Claude, GPT, Gemini, Gemma via OpenRouter
Setup time	20–40 min (install + download)	Under 2 min (visual configurator)
Maintenance	You manage updates, restarts, Docker	Zero ops — fully managed

The main limitation of local Gemma 4 + Hermes is that your agent goes offline when your laptop does. If you want your Telegram or Discord bot to respond at 3 AM while your machine is asleep, a managed instance on OpenClaw Launch is the practical path — no GPU required, and you can bring your own API key for any cloud model including Gemma 4 via OpenRouter. See the Hermes Agent BYOK guide for details.

Frequently Asked Questions

What is Gemma 4 12B?

Gemma 4 12B is Google's open-source 12-billion-parameter language model released June 3, 2026. It features a unified multimodal decoder (text, image, and audio inputs), a 256K-token context window, and Multi-Token Prediction for lower latency. It is licensed Apache 2.0 — free to download, modify, and use commercially.

How much VRAM does Gemma 4 12B need?

At Q4_K_M quantization (Ollama's default), the model uses approximately 6.7 GB VRAM and downloads about 7.4 GB. At 8-bit it requires ~13.4 GB; at BF16 (full precision) ~26.7 GB. Google recommends 16 GB VRAM or unified memory for laptop-class use, but Q4_K_M fits on 8 GB GPUs with careful management.

Is Gemma 4 12B free to use?

Yes. Gemma 4 is released under the Apache 2.0 license, which means you can download, run, modify, and build commercial products on top of it at no cost. Running it locally via Ollama incurs only the cost of electricity and hardware.

Can Gemma 4 12B run a full Hermes Agent workflow?

Yes, with the caveat that tool-calling reliability depends on how well the instruct tuning follows the function-calling schema Hermes uses. Gemma 4 12B's instruction tuning is strong, and the large context window (256K tokens) makes it well-suited for multi-step agentic tasks. If you encounter issues with tool execution, verify you are using the instruct variant and check the Hermes upstream docs for any model-specific notes.

How does Gemma 4 12B compare to the 26B variant?

The 26B A4B is a Mixture-of-Experts model with only 4B active parameters per forward pass, making it fast and VRAM-efficient despite its larger total parameter count. It delivers reasoning quality closer to a 26B dense model while using similar VRAM to the 12B at equivalent quantization. The 12B remains the safer starting point because it is more widely benchmarked and available today; try the 26B A4B once Ollama publishes a stable tag.

How do I run Gemma 4 locally with Ollama?

Install Ollama from ollama.com/download, then run ollama run gemma4:12b in your terminal. Ollama downloads the Q4_K_M model (~7.4 GB) and starts an OpenAI-compatible server on localhost:11434. Point Hermes at that endpoint as described in the steps above.