Guide
How to Use NVIDIA Nemotron 3 Ultra with Hermes Agent
NVIDIA's Nemotron 3 Ultra is a mixture-of-experts open model built for long-running agentic workflows — up to 1 million tokens of context, strong multi-step reasoning, and reliable tool use. This guide shows you how to wire it into Hermes Agent via OpenRouter or run it fully offline with Ollama.
What Is NVIDIA Nemotron 3 Ultra?
Nemotron 3 Ultra is an open-weights model from NVIDIA optimized for agentic use cases. Key specs:
- Mixture-of-experts architecture — roughly 550 B total parameters with ~55 B active per forward pass, so inference cost is much lower than a dense model of the same nominal size.
- Up to 1 million tokens of context window, making it practical for tasks that require reading large codebases, long documents, or extended conversation histories without truncation.
- Strong performance on tool-use benchmarks and multi-step reasoning chains, which is exactly what an autonomous agent like Hermes needs to complete complex goals reliably.
- Available under a permissive open license, so you can run it locally or via hosted APIs without usage restrictions tied to a commercial agreement.
Because Hermes Agent is designed to be model-agnostic — routing your chosen model through providers like OpenRouter — swapping in Nemotron 3 Ultra is mostly a config change rather than any code modification.
Managed Hermes on OpenClaw Launch (Easiest Path)
If you're using managed Hermes hosting on OpenClaw Launch, you don't need to edit any config files directly. The model picker in the dashboard handles everything:
- Open your dashboard and click your Hermes instance.
- Go to the Models tab and search for Nemotron in the model list.
- Select Nemotron 3 Ultra and save. Hermes hot-reloads the model selection — no restart needed.
If you want to supply your own OpenRouter API key (BYOK) to access Nemotron on your own quota, add it under Settings → API Keys in the same dashboard. Your key is encrypted at rest and never logged.
No Hermes instance yet? Deploy one in about 30 seconds at OpenClaw Launch → Hermes Hosting. The platform pre-configures OpenRouter routing so you can start testing Nemotron immediately.
Self-Hosted Hermes via OpenRouter
If you're running Hermes yourself, the recommended way to access Nemotron 3 Ultra is through OpenRouter, which hosts the model and provides an OpenAI-compatible endpoint.
1. Get an OpenRouter API key
Create a free account at openrouter.ai and copy your API key from the dashboard. Nemotron 3 Ultra has a free tier with rate limits — see the cost section below for details.
2. Find the current model slug
Model slugs on OpenRouter occasionally change between versions. Search for “Nemotron” on the OpenRouter models page and copy the exact slug shown (it will look something like nvidia/llama-3.1-nemotron-ultra-253b-v1). Always verify the slug directly on OpenRouter rather than relying on any guide — slugs are versioned and the one shown here may be outdated by the time you read this.
3. Configure Hermes
Edit your Hermes config (typically ~/.hermes/hermes.json or the bind-mounted /opt/data/.env depending on your deployment) to set OpenRouter as the provider and Nemotron as the primary model:
# In your Hermes environment or config
OPENROUTER_API_KEY=sk-or-v1-...
# In hermes.json agents.defaults section (representative — verify slug on openrouter.ai/models)
{
"agents": {
"defaults": {
"model": {
"primary": "openrouter/nvidia/llama-3.1-nemotron-ultra-253b-v1"
}
}
}
}The openrouter/ prefix tells Hermes to route the request through OpenRouter's API rather than hitting the provider directly.
4. Restart the gateway
# If running via Docker
docker restart your-hermes-container
# If running via PM2
pm2 reload hermesSend a test message to your bot. The first response may be slower as the model warms up on the OpenRouter side.
Running Nemotron 3 Ultra Locally via Ollama
For fully offline operation or when you want to avoid API costs, you can run Nemotron 3 Ultra locally using Ollama. Note that the full MoE model requires significant VRAM (or CPU RAM for quantized versions) — check the Ollama model page for hardware requirements before pulling.
1. Pull the model
ollama pull nemotron-3-ultraThis downloads the quantized weights. The download may be several tens of gigabytes depending on the quant level available.
2. Verify it runs
ollama run nemotron-3-ultra "Hello, what can you do?"3. Point Hermes at your local Ollama instance
# In your Hermes config
{
"agents": {
"defaults": {
"model": {
"primary": "ollama/nemotron-3-ultra"
}
}
},
"models": {
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434"
}
}
}
}If Hermes is running in Docker and Ollama is on the host machine, replace localhost with host.docker.internal (Mac/Windows) or the host's Docker bridge IP (Linux, typically 172.17.0.1).
Free Tier and Cost Notes
Nemotron 3 Ultra is available on OpenRouter's free tier, which means you can test it without a paid plan. Free-tier access typically comes with:
- Rate limits on requests per minute and tokens per day.
- Potential queue delays during peak hours, since free requests are lower priority than paid ones.
- No SLA guarantees for uptime or latency.
For production agents or high-volume workflows, add credits to your OpenRouter account and the rate limits lift substantially. Because Nemotron is an open model, the per-token cost on OpenRouter is typically lower than frontier closed models — check the current pricing on the model page since rates are adjusted periodically.
Running locally via Ollama is effectively free after hardware costs, but you trade API convenience for setup complexity and the need to have a machine powerful enough to serve the model at acceptable speeds.
Nemotron vs Other Models for Hermes
Choosing a model for Hermes depends on your use case. Here's a rough comparison to help orient your decision:
- Nemotron 3 Ultra — Best for long-context agentic tasks (reading large codebases, multi-document synthesis, extended reasoning chains). Open weights, MoE efficiency. Use when you need maximum context and reliable tool use without a per-token premium.
- Claude Sonnet / Opus (via Anthropic BYOK) — Best for nuanced instruction following and safety-sensitive applications. Closed model, higher per-token cost but strong instruction adherence.
- OpenRouter free-tier models — Good for prototyping and low-volume bots where cost is the primary constraint.
- Local Ollama models (see Hermes + Ollama guide) — Best for privacy-sensitive workloads or environments without internet access.
Nemotron 3 Ultra sits in a sweet spot: open, efficient (MoE keeps inference cost low), very long context, and purpose-built for the kind of multi-step tool-calling that Hermes was designed around.
Troubleshooting
- Model slug not found — OpenRouter renames models on major version bumps. Search for “Nemotron” on openrouter.ai/models and update your config with the current slug.
- Rate-limit errors on free tier — Add credits to your OpenRouter account or reduce the concurrency in your Hermes agent settings.
- Ollama connection refused from Docker — On Linux, replace
localhostwith the Docker bridge IP (172.17.0.1). On Mac or Windows Desktop, usehost.docker.internal. - Very slow first response — Expected for large MoE models, especially on CPU-offloaded Ollama. The model loads layers into memory on the first call; subsequent calls are faster.
- Hermes ignoring the model change — Some config fields require a gateway restart to apply. Restart your container or PM2 process after editing the primary model field.
Further Reading
- What is Hermes Agent? — overview of the framework and its capabilities.
- Deploy Hermes Agent — full self-hosting walkthrough.
- Hermes + OpenRouter — detailed BYOK OpenRouter setup guide.
- Hermes + Ollama — running Hermes entirely offline with local models.
- Hermes Agent Memory — how Hermes stores and retrieves long-term context across conversations.