Can I use NVIDIA Nemotron 3 Ultra with Hermes Agent for free?

Yes. Nemotron 3 Ultra is available on OpenRouter's free tier, which lets you run it inside Hermes Agent without a paid plan. Free-tier access comes with rate limits and lower priority during peak hours. For production use, adding credits to your OpenRouter account removes most rate constraints. You can also run Nemotron locally for free using Ollama if you have hardware with sufficient VRAM or RAM.

What makes Nemotron 3 Ultra good for agentic AI tasks?

Nemotron 3 Ultra uses a mixture-of-experts (MoE) architecture with roughly 550B total parameters and ~55B active per inference, so it delivers strong reasoning at lower compute cost than a dense model of the same size. It supports up to 1 million tokens of context, handles multi-step tool-use chains reliably, and is released as open weights — all properties that align well with how Hermes Agent orchestrates long-running autonomous tasks.

How do I find the correct Nemotron model slug for OpenRouter?

Go to openrouter.ai/models and search for "Nemotron". Copy the exact slug shown next to the model you want (e.g. nvidia/llama-3.1-nemotron-ultra-253b-v1). Slugs are versioned and can change between OpenRouter releases, so always verify directly on the OpenRouter site rather than copying from a guide.

← Home

Guide

How to Use NVIDIA Nemotron 3 Ultra with Hermes Agent

NVIDIA's Nemotron 3 Ultra is a mixture-of-experts open model built for long-running agentic workflows — up to 1 million tokens of context, strong multi-step reasoning, and reliable tool use. This guide shows you how to wire it into Hermes Agent via OpenRouter or run it fully offline with Ollama.

What Is NVIDIA Nemotron 3 Ultra?

Nemotron 3 Ultra is an open-weights model from NVIDIA optimized for agentic use cases. Key specs:

Mixture-of-experts architecture — roughly 550 B total parameters with ~55 B active per forward pass, so inference cost is much lower than a dense model of the same nominal size.
Up to 1 million tokens of context window, making it practical for tasks that require reading large codebases, long documents, or extended conversation histories without truncation.
Strong performance on tool-use benchmarks and multi-step reasoning chains, which is exactly what an autonomous agent like Hermes needs to complete complex goals reliably.
Available under a permissive open license, so you can run it locally or via hosted APIs without usage restrictions tied to a commercial agreement.

Because Hermes Agent is designed to be model-agnostic — routing your chosen model through providers like OpenRouter — swapping in Nemotron 3 Ultra is mostly a config change rather than any code modification.

Managed Hermes on OpenClaw Launch (Easiest Path)

If you're using managed Hermes hosting on OpenClaw Launch, you don't need to edit any config files directly. The model picker in the dashboard handles everything:

Open your dashboard and click your Hermes instance.
Go to the Models tab and search for Nemotron in the model list.
Select Nemotron 3 Ultra and save. Hermes hot-reloads the model selection — no restart needed.

If you want to supply your own OpenRouter API key (BYOK) to access Nemotron on your own quota, add it under Settings → API Keys in the same dashboard. Your key is encrypted at rest and never logged.

No Hermes instance yet? Deploy one in about 30 seconds at OpenClaw Launch → Hermes Hosting. The platform pre-configures OpenRouter routing so you can start testing Nemotron immediately.

Self-Hosted Hermes via OpenRouter

If you're running Hermes yourself, the recommended way to access Nemotron 3 Ultra is through OpenRouter, which hosts the model and provides an OpenAI-compatible endpoint.

1. Get an OpenRouter API key

Create a free account at openrouter.ai and copy your API key from the dashboard. Nemotron 3 Ultra has a free tier with rate limits — see the cost section below for details.

2. Find the current model slug

Model slugs on OpenRouter occasionally change between versions. Search for “Nemotron” on the OpenRouter models page and copy the exact slug shown (it will look something like nvidia/llama-3.1-nemotron-ultra-253b-v1). Always verify the slug directly on OpenRouter rather than relying on any guide — slugs are versioned and the one shown here may be outdated by the time you read this.

3. Configure Hermes

Edit your Hermes config (typically ~/.hermes/hermes.json or the bind-mounted /opt/data/.env depending on your deployment) to set OpenRouter as the provider and Nemotron as the primary model:

# In your Hermes environment or config
OPENROUTER_API_KEY=sk-or-v1-...

# In hermes.json agents.defaults section (representative — verify slug on openrouter.ai/models)
{
  "agents": {
    "defaults": {
      "model": {
        "primary": "openrouter/nvidia/llama-3.1-nemotron-ultra-253b-v1"
      }
    }
  }
}

The openrouter/ prefix tells Hermes to route the request through OpenRouter's API rather than hitting the provider directly.

4. Restart the gateway

# If running via Docker
docker restart your-hermes-container

# If running via PM2
pm2 reload hermes

Send a test message to your bot. The first response may be slower as the model warms up on the OpenRouter side.

Running Nemotron 3 Ultra Locally via Ollama

For fully offline operation or when you want to avoid API costs, you can run Nemotron 3 Ultra locally using Ollama. Note that the full MoE model requires significant VRAM (or CPU RAM for quantized versions) — check the Ollama model page for hardware requirements before pulling.

1. Pull the model

ollama pull nemotron-3-ultra

This downloads the quantized weights. The download may be several tens of gigabytes depending on the quant level available.

2. Verify it runs

ollama run nemotron-3-ultra "Hello, what can you do?"

3. Point Hermes at your local Ollama instance

# In your Hermes config
{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/nemotron-3-ultra"
      }
    }
  },
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://localhost:11434"
      }
    }
  }
}

If Hermes is running in Docker and Ollama is on the host machine, replace localhost with host.docker.internal (Mac/Windows) or the host's Docker bridge IP (Linux, typically 172.17.0.1).

Free Tier and Cost Notes

Nemotron 3 Ultra is available on OpenRouter's free tier, which means you can test it without a paid plan. Free-tier access typically comes with:

Rate limits on requests per minute and tokens per day.
Potential queue delays during peak hours, since free requests are lower priority than paid ones.
No SLA guarantees for uptime or latency.

For production agents or high-volume workflows, add credits to your OpenRouter account and the rate limits lift substantially. Because Nemotron is an open model, the per-token cost on OpenRouter is typically lower than frontier closed models — check the current pricing on the model page since rates are adjusted periodically.

Running locally via Ollama is effectively free after hardware costs, but you trade API convenience for setup complexity and the need to have a machine powerful enough to serve the model at acceptable speeds.

Nemotron vs Other Models for Hermes

Choosing a model for Hermes depends on your use case. Here's a rough comparison to help orient your decision:

Nemotron 3 Ultra — Best for long-context agentic tasks (reading large codebases, multi-document synthesis, extended reasoning chains). Open weights, MoE efficiency. Use when you need maximum context and reliable tool use without a per-token premium.
Claude Sonnet / Opus (via Anthropic BYOK) — Best for nuanced instruction following and safety-sensitive applications. Closed model, higher per-token cost but strong instruction adherence.
OpenRouter free-tier models — Good for prototyping and low-volume bots where cost is the primary constraint.
Local Ollama models (see Hermes + Ollama guide) — Best for privacy-sensitive workloads or environments without internet access.

Nemotron 3 Ultra sits in a sweet spot: open, efficient (MoE keeps inference cost low), very long context, and purpose-built for the kind of multi-step tool-calling that Hermes was designed around.

Troubleshooting

Model slug not found — OpenRouter renames models on major version bumps. Search for “Nemotron” on openrouter.ai/models and update your config with the current slug.
Rate-limit errors on free tier — Add credits to your OpenRouter account or reduce the concurrency in your Hermes agent settings.
Ollama connection refused from Docker — On Linux, replace localhost with the Docker bridge IP (172.17.0.1). On Mac or Windows Desktop, use host.docker.internal.
Very slow first response — Expected for large MoE models, especially on CPU-offloaded Ollama. The model loads layers into memory on the first call; subsequent calls are faster.
Hermes ignoring the model change — Some config fields require a gateway restart to apply. Restart your container or PM2 process after editing the primary model field.