NVIDIA NIM (Inference Microservices) is a platform that serves optimized LLMs via OpenAI-compatible REST APIs using NVIDIA's TensorRT-LLM stack. It is available as a hosted cloud service at build.nvidia.com or as self-hosted containers on NVIDIA GPU servers.

The hosted build.nvidia.com service provides free starter credits on sign-up. After those are used, billing is per token at rates listed on each model's page. Self-hosted NIM is free at inference time but requires compatible NVIDIA GPU hardware.

Guide

OpenClaw + NVIDIA NIM: Use NVIDIA Inference Microservices with OpenClaw

Q: Can OpenClaw use NVIDIA NIM models?

Yes. OpenClaw supports any OpenAI-compatible API. Configure a custom provider with apiBase "https://integrate.api.nvidia.com/v1" and your NIM API key, then set the model to any NIM model ID such as meta/llama-4-scout-17b-16e-instruct.

Q: How is NVIDIA NIM different from other OpenAI-compatible APIs?

NIM uses NVIDIA's TensorRT-LLM runtime, which is optimized for NVIDIA GPU hardware. This typically delivers higher throughput and lower latency than generic inference backends for the same model. The API surface is standard OpenAI-compatible.

Q: Can I run NIM on my own GPU server?

Yes. NVIDIA publishes NIM containers on the NGC registry. Run the container on a compatible NVIDIA GPU and point OpenClaw's apiBase at your server's local endpoint instead of the hosted URL. The rest of the OpenClaw config stays the same.

NVIDIA NIM delivers GPU-accelerated, OpenAI-compatible inference for Llama 4, Nemotron, Mistral, and other top models. Configure it as a custom provider in OpenClaw and get NVIDIA-optimized throughput behind your AI agent.

What Is NVIDIA NIM?

NVIDIA NIM (Inference Microservices) is NVIDIA's managed inference platform that serves optimized LLMs via a fully OpenAI-compatible REST API. Instead of generic cloud inference, NIM routes every request through NVIDIA's TensorRT-LLM stack — purpose-built for NVIDIA GPUs — so you get higher throughput and lower latency than equivalent CPU or non-NVIDIA GPU backends.

NIM is available in two modes. The hosted cloud service at build.nvidia.com gives you a free API key with starter credits and no infrastructure to manage. The self-hosted path lets you run NIM containers on your own NVIDIA GPU servers for maximum control and data privacy. Either way, the API surface is identical: an OpenAI-compatible endpoint you point OpenClaw at.

Models Available via NIM

The NIM catalog covers models from Meta, NVIDIA, Mistral, Microsoft, and others. All are served with NVIDIA-optimized inference:

Model	Family	Context	Best For
Llama 4 Scout 17B	Meta	128K	Fast reasoning, low latency
Llama 4 Maverick 17B	Meta	128K	Multi-turn chat, coding
Llama 3.3 70B Instruct	Meta	128K	Strong all-round performance
Nemotron-4 340B Instruct	NVIDIA	4K	NVIDIA flagship reasoning model
Mistral 7B Instruct	Mistral AI	32K	Lightweight and fast
Mistral Large 2	Mistral AI	128K	Strong multilingual and coding
Phi-3 Mini 128K	Microsoft	128K	Compact, long-context capable

The full catalog at build.nvidia.com/explore lists every available NIM model along with live API playground links and per-model pricing. New models are added regularly as NVIDIA certifies new NIM containers.

How to Deploy via OpenClaw Launch

If you want a fully managed setup — no NIM account required — deploy via OpenClaw Launch with AI credits included. For NVIDIA NIM specifically, you will need to bring your own NIM API key since NIM is not part of the default OpenRouter catalog.

Get a NIM API key from build.nvidia.com. Sign in with an NVIDIA account and generate a key under “Get API Key.” Free credits are included on sign-up.
Go to openclawlaunch.com and open the configurator.
Select BYOK (Bring Your Own Key), choose Custom OpenAI-compatible provider, and enter:
- Base URL: https://integrate.api.nvidia.com/v1
- API key: your NIM key from build.nvidia.com
- Model: meta/llama-4-scout-17b-16e-instruct or another NIM model ID
Pick your chat platform (Telegram, Discord, WhatsApp, WeChat, or browser gateway).
Click Deploy. Your NIM-powered agent is live in about 10 seconds.

Tip: NIM model IDs on build.nvidia.com use the format org/model-name, for example meta/llama-4-maverick-17b-128e-instruct or nvidia/nemotron-4-340b-instruct. Copy the exact ID from the model's page on build.nvidia.com to avoid typos.

Self-Hosted Configuration

For a self-hosted OpenClaw instance, add NVIDIA NIM as a custom OpenAI-compatible provider in your openclaw.json. The NIM hosted cloud endpoint is at https://integrate.api.nvidia.com/v1 — just override the base URL:

{
  "models": {
    "providers": {
      "nvidia-nim": {
        "type": "openai-compatible",
        "apiBase": "https://integrate.api.nvidia.com/v1",
        "apiKey": "nvapi-..."
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "nvidia-nim/meta/llama-4-scout-17b-16e-instruct"
      }
    }
  }
}

Replace nvapi-... with your NIM API key from build.nvidia.com. Swap the model ID to any model shown in the table above — find the exact slug on the model's detail page at build.nvidia.com.

If you run NIM self-hosted on your own NVIDIA GPU server, replace the apiBase with your local NIM container's endpoint, typicallyhttp://localhost:8000/v1 or your server's LAN address. The rest of the config is identical.

Which NIM Model Should You Pick?

A straightforward heuristic based on workload:

Llama 4 Scout 17B — Best default for low-latency chat and Telegram/Discord bots. Fast token generation, 128K context, and the efficiency of the Llama 4 architecture at a fraction of the cost of larger variants.
Llama 4 Maverick 17B — Stronger multi-turn reasoning and coding than Scout. Good choice if your agent runs agentic tool loops or handles longer coding sessions.
Llama 3.3 70B Instruct — The proven workhorse from the Llama 3 family. Excellent all-round performance and broad compatibility with system prompts and tool-call formats.
Nemotron-4 340B — NVIDIA's own flagship reasoning model. Use it for complex analysis, synthesis, and tasks that benefit from a much larger parameter count. Note the shorter 4K context window.
Mistral variants — Mistral 7B is the lightest and cheapest per token. Mistral Large 2 is the stronger multilingual and coding choice when you need French, German, Spanish, or other languages alongside English.

Pricing Notes

NVIDIA NIM on build.nvidia.com uses a free-credits-then-pay-per-token model. Every new account receives a set of free inference credits on sign-up — enough for meaningful exploration and prototyping. After credits run out, billing is per-token at rates that vary by model size.

For exact per-model pricing, check the pricing tab on each model's page at build.nvidia.com/explore. As a general rule, smaller models (Mistral 7B, Llama 4 Scout) are significantly cheaper per million tokens than large models (Nemotron-4 340B).

Self-hosted NIM containers require an NVIDIA GPU and an NGC entitlement for the specific model you want to run. After the initial hardware and entitlement cost, per-token billing goes to zero — useful for high-volume production workloads.

FAQ

What is NVIDIA NIM?

NVIDIA NIM (Inference Microservices) is a platform that serves optimized LLMs via OpenAI-compatible REST APIs, backed by NVIDIA's TensorRT-LLM inference stack. It is available as a hosted cloud service at build.nvidia.com or as self-hosted containers on NVIDIA GPU servers.

Can OpenClaw use NVIDIA NIM models?

Yes. OpenClaw supports any OpenAI-compatible API endpoint. Configure a custom provider with apiBase: “https://integrate.api.nvidia.com/v1” and your NIM API key, then set the model to any NIM model ID such as meta/llama-4-scout-17b-16e-instruct.

Is NVIDIA NIM free?

The hosted service at build.nvidia.com provides free starter credits on sign-up. After those credits are exhausted, usage is billed per token at rates listed on each model's page. Self-hosted NIM is free at inference time but requires compatible NVIDIA GPU hardware and an NGC entitlement.

How is NVIDIA NIM different from other OpenAI-compatible APIs?

NIM uses NVIDIA's TensorRT-LLM runtime, which is specifically optimized for NVIDIA GPU hardware. This typically delivers higher throughput and lower latency than generic inference backends for the same model, particularly at larger batch sizes. The API surface is standard OpenAI-compatible, so integration is identical to any other custom provider in OpenClaw.

Can I run NIM on my own GPU server instead of build.nvidia.com?

Yes. NVIDIA publishes NIM containers on the NGC registry. Pull the container for your target model, run it on a compatible NVIDIA GPU, and point OpenClaw's apiBase at your server's local endpoint instead of the hosted URL. The rest of the OpenClaw config stays the same.

What's Next?

OpenClaw + Ollama — Run open-source models locally with zero API costs and full data privacy
OpenClaw + DeepSeek V4 — 1M context, frontier reasoning, ultra-low cost via OpenRouter
Compare all models — See how NIM models stack up against Claude, GPT, Gemini, and others
Deploy Now — Get an AI agent running in 10 seconds on OpenClaw Launch