Can I run DeepSeek V4 on Ollama?

Yes. DeepSeek V4 Flash quantized to Q4 runs on a single consumer GPU with 24 GB+ VRAM (3090, 4090, or workstation cards). DeepSeek V4 Pro requires multi-GPU setups (140 GB+ VRAM) and is not realistic for most home labs.

How much VRAM does DeepSeek V4 need?

DeepSeek V4 Flash Q4 needs roughly 24-32 GB of VRAM. Q8 needs 40-48 GB. V4 Pro Q4 needs 140 GB+ — typically multi-A100 or H100 setups.

How do I connect OpenClaw to a local Ollama instance?

Set the Ollama provider in openclaw.json with baseURL pointing at your Ollama host (default port 11434), then set the default model to ollama/deepseek-v4-flash. If OpenClaw runs in Docker on the same machine as Ollama, use http://host.docker.internal:11434 on macOS/Windows or the host IP on Linux.

Should I run DeepSeek V4 locally or use OpenRouter?

Local Ollama is best for sensitive data, high-volume bots where API spend would dominate, and on-prem requirements — provided you have 24 GB+ VRAM. Hosted via OpenRouter (OpenClaw Launch handles this in one click) is best for zero-ops, instant deploy, and access to V4 Pro without owning H100s.

← All Guides

Guide

OpenClaw + DeepSeek V4 on Ollama: Run DeepSeek V4 Locally with OpenClaw

Run a quantized DeepSeek V4 on your own GPU through Ollama and point your OpenClaw agent at it. No OpenRouter, no per-token bill, full data control. Here's the full setup, including the hardware reality check.

Why Run DeepSeek V4 on Ollama?

The hosted route through OpenRouter and OpenClaw Launch is the fastest way to use DeepSeek V4 — one click and you're live. But there are good reasons to run it locally with Ollama instead:

Sensitive data that can't leave your network
High-volume bots where API spend would dominate the budget
Air-gapped or on-prem deployments
You already have a GPU box sitting idle

The trade-off is hardware: DeepSeek V4 Flash quantized to 4-bit needs roughly 24 GB+ of VRAM to run comfortably, and Pro is much heavier. If your hardware can't take it, stick with the hosted OpenRouter route.

Hardware Reality Check

Variant	Recommended VRAM	Disk	Realistic on consumer GPU?
V4 Flash Q4	24–32 GB	~25 GB	RTX 3090 / 4090 yes; 4080 tight
V4 Flash Q8	40–48 GB	~45 GB	A6000 / dual 3090 only
V4 Pro Q4	140 GB+	~250 GB	Multi-A100 / H100 only

Running V4 Pro at home is unrealistic for most setups. V4 Flash Q4 is the practical local target.

Step 1: Install Ollama and Pull DeepSeek V4 Flash

On macOS or Linux:

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull DeepSeek V4 Flash (Q4 quantization)
ollama pull deepseek-v4-flash

# Verify it loads
ollama run deepseek-v4-flash "hello"

On the first run, Ollama downloads the model (~25 GB for Q4 Flash). The second response should be near-instant on a 3090/4090.

Step 2: Expose Ollama to OpenClaw

By default, Ollama listens on 127.0.0.1:11434. If your OpenClaw container runs on the same host, that's reachable as http://host.docker.internal:11434 from inside the container (on macOS/Windows Docker Desktop) or via the host network on Linux.

To listen on all interfaces (so a remote OpenClaw can reach it), set:

# macOS
launchctl setenv OLLAMA_HOST 0.0.0.0
# Then restart Ollama

# Linux (systemd)
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl restart ollama

Step 3: Point OpenClaw at the Ollama Endpoint

In your OpenClaw openclaw.json, configure Ollama as the provider and set DeepSeek V4 Flash as the default model:

{
  "models": {
    "providers": {
      "ollama": {
        "baseURL": "http://host.docker.internal:11434"
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/deepseek-v4-flash"
      }
    }
  }
}

Replace the baseURL with the actual address of the Ollama box if it's on a different machine (e.g. http://192.168.1.50:11434).

Step 4: Restart and Test

Restart your OpenClaw agent (a SIGUSR1 reload is enough for model swaps), then send a test message via Telegram, Discord, or the gateway web chat. The first reply on a fresh load takes a few seconds while Ollama warms up; subsequent replies should be near-instant on a 3090/4090.

Tip: Keep OLLAMA_KEEP_ALIVE=24h in the Ollama environment so the model stays loaded between bot replies. Otherwise Ollama unloads after a few minutes of idle and the next reply triggers a 5–15 second cold start.

Hosted vs Local: Which Should You Pick?

Pick hosted (OpenRouter via OpenClaw Launch) if you want zero ops, instant deploy, and access to V4 Pro at $1.74/$3.48 per 1M tokens. See the DeepSeek V4 on OpenClaw Launch guide.
Pick Ollama local if you have the hardware (24 GB+ VRAM), want zero per-token cost, or need to keep data on-prem. V4 Flash is realistic locally; V4 Pro is not for most setups.
Run both — default to hosted Pro for hard tasks, route everyday chat to local Flash via the /model command.

What's Next?

Hosted DeepSeek V4 setup — The one-click route via OpenRouter
OpenClaw + Ollama — General Ollama setup guide for any model
Ollama Cloud — Run hosted Ollama if you don't have a GPU
Compare all models — See where DeepSeek V4 sits vs Claude, GPT, Gemini