Guides
14 min

OpenClaw with Local LLM: The Complete Guide (Ollama, vLLM, LM Studio)

Keep your code on your network. Pay $0 in API fees. Run Llama 4, Qwen 3, or DeepSeek V3 locally and connect it to OpenClaw. Here's every method that works.

Clawctl Team

Product & Engineering

OpenClaw with Local LLM: The Complete Guide

A startup founder messaged me last week:

"I love OpenClaw but I can't send proprietary code to Claude's servers. Legal will kill me."

Fair. Most enterprise policies prohibit sending source code to third-party AI providers. Healthcare can't send patient data. Finance can't send trading algorithms. Defense can't send anything.

But here's the thing: OpenClaw doesn't care where your LLM lives.

You can run Llama 4, Qwen 3, DeepSeek V3, or any OpenAI-compatible model on your own hardware—and connect it to OpenClaw in 5 minutes.

No API costs. No data leaving your network. Full agent capabilities.

This guide covers every method that works.

Why Local LLMs + OpenClaw?

ConcernCloud APILocal LLM
Data privacyData leaves your networkStays on your hardware
API costs$0.015–0.06 per 1K tokens$0 after hardware
Rate limitsYesNone
Latency500ms–2s50–200ms
Offline capabilityNoYes
ComplianceDepends on vendorYou control everything

For agents that touch sensitive data, local is often the only option.

Method 1: Ollama (Easiest)

Ollama is the Docker of LLMs. One command to install, one command to run.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull a model:

# Fast and capable (12GB VRAM)
ollama pull llama4-scout

# Best for coding (20GB VRAM)
ollama pull qwen2.5-coder:32b-q4_K_M

# Strong general-purpose (16GB VRAM)
ollama pull mistral-small3.1

Start the server:

ollama serve

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1.

Configure OpenClaw:

llm:
  name: local-ollama
  type: openai-compatible
  base_url: http://localhost:11434/v1
  model: llama4-scout
  timeout_ms: 60000

That's it. Your agent now uses a local model.

Method 2: vLLM (Best Performance)

vLLM is built for production. It's up to 24x faster than Hugging Face Transformers and supports continuous batching for multiple concurrent requests.

Install vLLM:

pip install vllm

Start the server:

vllm serve Qwen/Qwen3-32B \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --tensor-parallel-size 2  # For multi-GPU

Configure OpenClaw:

llm:
  name: local-vllm
  type: openai-compatible
  base_url: http://localhost:8000/v1
  model: Qwen/Qwen3-32B
  timeout_ms: 30000

vLLM shines when you need:

  • Multiple agents hitting the same model
  • High throughput (hundreds of requests/minute)
  • Multi-GPU setups

Method 3: LM Studio (GUI-based)

LM Studio is Ollama with a UI. Great for experimenting with models before committing.

  1. Download from lmstudio.ai
  2. Search and download a model
  3. Click "Start Server" in the Local Server tab
  4. Configure OpenClaw to use http://localhost:1234/v1

Configure OpenClaw:

llm:
  name: local-lmstudio
  type: openai-compatible
  base_url: http://localhost:1234/v1
  model: local-model
  timeout_ms: 60000

Method 4: llama.cpp (Maximum Control)

llama.cpp gives you raw inference with no overhead. It runs GGUF models on CPU, GPU, or mixed — and powers most other local LLM tools under the hood.

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Start OpenAI-compatible server
./llama-server -m your-model.gguf --port 8080

API available at http://localhost:8080/v1. Useful when you need custom quantizations or models not yet in Ollama's library.

Which Local LLM Should You Use?

The local model landscape moves fast. Here's what's worth running as of April 2026:

General purpose:

ModelVRAMStrengthBest For
Llama 4 Scout (109B MoE, 17B active)30GB+ (Int4)Fast, multimodal, 10M contextQuick tasks, triage, vision
Qwen 3 32B20GBStrong reasoning, tool useComplex agentic tasks
Gemma 3 (27B)18GBGoogle quality, 128K contextBest mid-range option
Mistral Small 3.1 (24B)16GBFast, 128K contextGeneral tasks
DeepSeek V3 (quantized)24GB+GPT-4 class reasoningHeavy analysis

Coding specialists:

ModelVRAMStrengthBest For
Qwen 2.5 Coder 32B20GB92.7% HumanEval — matches GPT-4oCode review, generation
Qwen 2.5 Coder 14B10GBBest quality-per-VRAM for codingSweet spot for most GPUs
Qwen 2.5 Coder 7B6GB88.4% HumanEval — beats models 5x its sizeQuick code tasks on limited hardware

Power user tier (128GB+ unified memory or multi-GPU):

ModelRAM/VRAMStrengthBest For
Qwen 3.5 (397B MoE, 17B active)~200GB (Q4)76.4% SWE-Bench, native multimodal, agentic-trainedFull-stack agent workflows
MiniMax M2.5 (230B MoE, 10B active)101GB (3-bit)Benchmarks alongside Claude SonnetAgentic coding, tool use
Kimi K2.5 (1T MoE, 32B active)240GB+ (1.8-bit)Native multimodal, Agent SwarmResearch, multi-agent workflows

Qwen 3.5 (released Feb 2026) is the newest option here — 397B total with 17B active params, 256K context, and agentic training focus. Needs enterprise hardware (~200GB at Q4). MiniMax M2.5 is more accessible — 10B active params means it's fast despite 230B total, and it scores 80.2% on SWE-Bench Verified. Runs on a 128GB M3/M4 Max. Kimi K2.5 needs 256GB+ RAM, so it's realistically an API model for most people.

Hardware reality check:

GPUVRAMMax Model
RTX 306012GB7–8B models
RTX 309024GB32B models (quantized)
RTX 409024GB32B models (quantized)
A100 40GB40GB70B models (quantized)
2x A100 / H10080–160GBFull-precision large models
Mac M3/M4 Max (128GB)128GB unifiedMiniMax M2.5 (3-bit), most MoE models

No GPU? Use CPU inference with llama.cpp — just expect 10–20x slower responses. Apple Silicon Macs with 32GB+ unified memory are surprisingly capable.

The Security Gap You're Not Thinking About

Running a local LLM solves the data privacy problem.

But you still have the agent security problem.

Your local LLM is private. Great. But the agent connected to it can still:

  • Execute arbitrary shell commands
  • Read/write any file on the system
  • Make HTTP requests to any domain
  • Access your API keys and credentials

Security researcher Maor Dayan's Shodan scan found 42,665 exposed OpenClaw instances in January 2026. 93.4% had authentication bypasses. The LLM location didn't matter — the deployment security did.

This is where Clawctl's managed deployment comes in.

Without Clawctl (Raw OpenClaw):

  • Local LLM ✓
  • Data stays on network ✓
  • Agent can run arbitrary code ⚠️
  • No audit trail ⚠️
  • No kill switch ⚠️
  • Credentials in plaintext ⚠️
  • No approval workflow ⚠️

With Clawctl Managed Deployment:

  • Local LLM ✓
  • Data stays on network ✓
  • Sandbox isolation — Agent can't escape its container
  • Full audit trail — Every action searchable, exportable
  • One-click kill switch — Stop everything instantly
  • Encrypted secrets vault — API keys encrypted at rest
  • Human-in-the-loop — 70+ risky actions blocked until you approve
  • Egress control — Only approved domains reachable
  • Prompt injection defense — Attack patterns detected and blocked

Example: Local LLM + Clawctl

# Start Ollama
ollama serve &

# Deploy OpenClaw with Clawctl
# Sign up at clawctl.com/checkout, pick a plan, and your agent is provisioned automatically

Configure your agent to use the local model:

llm:
  name: local
  type: openai-compatible
  base_url: http://host.docker.internal:11434/v1
  model: qwen3:32b

Now you have:

  • Zero API costs
  • Data on your network
  • Agent security from Clawctl
  • Full audit trail
  • Human approval for risky actions

Common Issues

"Connection refused to localhost"

Docker containers can't reach localhost the same way. Use:

  • host.docker.internal (Docker Desktop)
  • Your machine's LAN IP
  • --network=host flag

"Model too slow"

  • Quantize: Use Q4_K_M instead of full precision
  • Batch: Enable continuous batching in vLLM
  • Upgrade: More VRAM = bigger context = better results

"Tool calling doesn't work"

Not all models support structured tool calls. These have native tool-use support:

  • Qwen 3 / Qwen 2.5 Coder (robust tool calling)
  • Llama 4 Scout / Maverick (native tool calling)
  • Mistral Small 3.1 (function calling)
  • MiniMax M2.5 (agentic tool use)

Cost Comparison

Cloud API (1M tokens/month, output pricing):

ProviderOutput per 1M tokens
Claude Sonnet 4.5$15
GPT-4o$10
Gemini 2.5 Pro$10

Local LLM (1M tokens/month):

SetupCost
RTX 3090 (used)~$800 one-time + electricity
Cloud GPU (A100)$1–3/hour
MacBook M3/M4 (32GB+)$0 (already own it)

At 1M tokens/month, a used RTX 3090 pays for itself in 5–6 months.

At 10M tokens/month, it pays for itself in 3 weeks.

Don't Want to Manage Infrastructure?

Running your own LLM server, configuring Docker networking, setting up SSL, maintaining uptime — it adds up fast.

Clawctl handles the hard parts. You get a managed OpenClaw deployment with sandbox isolation, audit logging, and human-in-the-loop approvals. Bring your own local LLM or use a cloud API — Clawctl works with both.

The difference: KiloClaw and other managed hosts start at $9/mo but give you a shared environment with no sandbox isolation. Clawctl gives you a dedicated, isolated tenant with per-container Docker socket proxies, encrypted secrets, and egress filtering. When your agent touches customer data or production APIs, that isolation matters.

See plans and deploy in 60 seconds →

FAQ

Can I use a local LLM with OpenClaw?

Yes. OpenClaw supports any LLM that exposes an OpenAI-compatible API endpoint. This includes Ollama, vLLM, LM Studio, and llama.cpp. You configure it by setting type: openai-compatible and pointing base_url to your local server (e.g., http://localhost:11434/v1 for Ollama). No code changes needed.

What is the best local LLM for OpenClaw in 2026?

For most setups, Qwen 3 32B (20GB VRAM) offers the best balance of reasoning, tool calling, and speed. For coding-focused agents, Qwen 2.5 Coder 14B (10GB VRAM) is the sweet spot. On limited hardware (8GB), Gemma 3 9B is the best option. For enterprise setups with 128GB+ unified memory, Qwen 3.5 (397B MoE) and MiniMax M2.5 deliver near-Claude-level performance locally.

How much VRAM do I need to run a local LLM with OpenClaw?

A 7B model needs ~6GB VRAM. A 32B model (quantized to Q4) needs ~20GB. Most consumer GPUs (RTX 3090, RTX 4090) handle 32B models well. Apple Silicon Macs with 32GB+ unified memory can run 32B models and even some MoE models. For 70B+ models, you need 40GB+ VRAM or multi-GPU setups.

Is running a local LLM with OpenClaw secure?

The LLM itself is private — no data leaves your network. But the OpenClaw agent still has system access (shell commands, file operations, HTTP requests). A Shodan scan found 42,665 exposed OpenClaw instances, 93.4% with authentication bypasses. For production use, pair your local LLM with a managed deployment like Clawctl that provides sandbox isolation, audit trails, and human-in-the-loop approvals.

Can I use Ollama with OpenClaw in Docker?

Yes, but Docker containers can't reach localhost directly. Use host.docker.internal as the hostname (e.g., http://host.docker.internal:11434/v1). On Linux, you may need to add --add-host=host.docker.internal:host-gateway to your Docker run command. Alternatively, use your machine's LAN IP or run with --network=host.

How does Clawctl compare to self-hosting OpenClaw with a local LLM?

Self-hosting gives you full control but requires managing Docker, SSL certificates, firewall rules, security patches, and uptime yourself. Clawctl handles deployment infrastructure — sandbox isolation, encrypted secrets, egress filtering, auto-recovery — while you keep full control over your LLM choice. You can point Clawctl at a local Ollama instance or a cloud API. The tradeoff: $49/month for Clawctl vs. your time maintaining infrastructure.

What models support tool calling for OpenClaw agents?

Not all local models handle structured tool calls well. As of April 2026, the best options are: Qwen 3 / Qwen 2.5 Coder (robust tool calling), Llama 4 Scout / Maverick (native tool calling), Mistral Small 3.1 (function calling), Gemma 3 27B (tool use support), and MiniMax M2.5 (agentic tool use). Avoid older models without explicit tool-use training — they'll hallucinate function calls.

Deploy Your Local LLM Agent Securely

Running a local LLM is step one. Running it safely in production is step two.

Clawctl gives you a managed, secure OpenClaw deployment in 60 seconds. Sign up at clawctl.com/checkout, pick a plan, and your agent is provisioned automatically.

What you get:

  • Gateway authentication (256-bit, formally verified)
  • Container sandbox isolation
  • Network egress control (domain allowlist)
  • Human-in-the-loop approvals for 70+ risky actions
  • Full audit logging (searchable, exportable)
  • One-click kill switch
  • Prompt injection defense
  • Automatic security updates

Your model. Your data. Our guardrails. $49/month — cheaper than one incident.

Deploy securely with Clawctl →


More resources:

This content is for informational purposes only and does not constitute financial, legal, medical, tax, or other professional advice. Individual results vary. See our Terms of Service for important disclaimers.

Done researching? See how the options compare.

Self-hosting, cloud VMs, or managed hosting — we broke down the real costs side by side.