OpenClaw with Local LLM: The Complete Guide
A startup founder messaged me last week:
"I love OpenClaw but I can't send proprietary code to Claude's servers. Legal will kill me."
Fair. Most enterprise policies prohibit sending source code to third-party AI providers. Healthcare can't send patient data. Finance can't send trading algorithms. Defense can't send anything.
But here's the thing: OpenClaw doesn't care where your LLM lives.
You can run Llama 4, Qwen 3, DeepSeek V3, or any OpenAI-compatible model on your own hardware—and connect it to OpenClaw in 5 minutes.
No API costs. No data leaving your network. Full agent capabilities.
This guide covers every method that works.
Why Local LLMs + OpenClaw?
| Concern | Cloud API | Local LLM |
|---|---|---|
| Data privacy | Data leaves your network | Stays on your hardware |
| API costs | $0.015–0.06 per 1K tokens | $0 after hardware |
| Rate limits | Yes | None |
| Latency | 500ms–2s | 50–200ms |
| Offline capability | No | Yes |
| Compliance | Depends on vendor | You control everything |
For agents that touch sensitive data, local is often the only option.
Method 1: Ollama (Easiest)
Ollama is the Docker of LLMs. One command to install, one command to run.
Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Pull a model:
# Fast and capable (12GB VRAM)
ollama pull llama4-scout
# Best for coding (20GB VRAM)
ollama pull qwen2.5-coder:32b-q4_K_M
# Strong general-purpose (16GB VRAM)
ollama pull mistral-small3.1
Start the server:
ollama serve
Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1.
Configure OpenClaw:
llm:
name: local-ollama
type: openai-compatible
base_url: http://localhost:11434/v1
model: llama4-scout
timeout_ms: 60000
That's it. Your agent now uses a local model.
Method 2: vLLM (Best Performance)
vLLM is built for production. It's up to 24x faster than Hugging Face Transformers and supports continuous batching for multiple concurrent requests.
Install vLLM:
pip install vllm
Start the server:
vllm serve Qwen/Qwen3-32B \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--tensor-parallel-size 2 # For multi-GPU
Configure OpenClaw:
llm:
name: local-vllm
type: openai-compatible
base_url: http://localhost:8000/v1
model: Qwen/Qwen3-32B
timeout_ms: 30000
vLLM shines when you need:
- Multiple agents hitting the same model
- High throughput (hundreds of requests/minute)
- Multi-GPU setups
Method 3: LM Studio (GUI-based)
LM Studio is Ollama with a UI. Great for experimenting with models before committing.
- Download from lmstudio.ai
- Search and download a model
- Click "Start Server" in the Local Server tab
- Configure OpenClaw to use
http://localhost:1234/v1
Configure OpenClaw:
llm:
name: local-lmstudio
type: openai-compatible
base_url: http://localhost:1234/v1
model: local-model
timeout_ms: 60000
Method 4: llama.cpp (Maximum Control)
llama.cpp gives you raw inference with no overhead. It runs GGUF models on CPU, GPU, or mixed — and powers most other local LLM tools under the hood.
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
# Start OpenAI-compatible server
./llama-server -m your-model.gguf --port 8080
API available at http://localhost:8080/v1. Useful when you need custom quantizations or models not yet in Ollama's library.
Which Local LLM Should You Use?
The local model landscape moves fast. Here's what's worth running as of April 2026:
General purpose:
| Model | VRAM | Strength | Best For |
|---|---|---|---|
| Llama 4 Scout (109B MoE, 17B active) | 30GB+ (Int4) | Fast, multimodal, 10M context | Quick tasks, triage, vision |
| Qwen 3 32B | 20GB | Strong reasoning, tool use | Complex agentic tasks |
| Gemma 3 (27B) | 18GB | Google quality, 128K context | Best mid-range option |
| Mistral Small 3.1 (24B) | 16GB | Fast, 128K context | General tasks |
| DeepSeek V3 (quantized) | 24GB+ | GPT-4 class reasoning | Heavy analysis |
Coding specialists:
| Model | VRAM | Strength | Best For |
|---|---|---|---|
| Qwen 2.5 Coder 32B | 20GB | 92.7% HumanEval — matches GPT-4o | Code review, generation |
| Qwen 2.5 Coder 14B | 10GB | Best quality-per-VRAM for coding | Sweet spot for most GPUs |
| Qwen 2.5 Coder 7B | 6GB | 88.4% HumanEval — beats models 5x its size | Quick code tasks on limited hardware |
Power user tier (128GB+ unified memory or multi-GPU):
| Model | RAM/VRAM | Strength | Best For |
|---|---|---|---|
| Qwen 3.5 (397B MoE, 17B active) | ~200GB (Q4) | 76.4% SWE-Bench, native multimodal, agentic-trained | Full-stack agent workflows |
| MiniMax M2.5 (230B MoE, 10B active) | 101GB (3-bit) | Benchmarks alongside Claude Sonnet | Agentic coding, tool use |
| Kimi K2.5 (1T MoE, 32B active) | 240GB+ (1.8-bit) | Native multimodal, Agent Swarm | Research, multi-agent workflows |
Qwen 3.5 (released Feb 2026) is the newest option here — 397B total with 17B active params, 256K context, and agentic training focus. Needs enterprise hardware (~200GB at Q4). MiniMax M2.5 is more accessible — 10B active params means it's fast despite 230B total, and it scores 80.2% on SWE-Bench Verified. Runs on a 128GB M3/M4 Max. Kimi K2.5 needs 256GB+ RAM, so it's realistically an API model for most people.
Hardware reality check:
| GPU | VRAM | Max Model |
|---|---|---|
| RTX 3060 | 12GB | 7–8B models |
| RTX 3090 | 24GB | 32B models (quantized) |
| RTX 4090 | 24GB | 32B models (quantized) |
| A100 40GB | 40GB | 70B models (quantized) |
| 2x A100 / H100 | 80–160GB | Full-precision large models |
| Mac M3/M4 Max (128GB) | 128GB unified | MiniMax M2.5 (3-bit), most MoE models |
No GPU? Use CPU inference with llama.cpp — just expect 10–20x slower responses. Apple Silicon Macs with 32GB+ unified memory are surprisingly capable.
The Security Gap You're Not Thinking About
Running a local LLM solves the data privacy problem.
But you still have the agent security problem.
Your local LLM is private. Great. But the agent connected to it can still:
- Execute arbitrary shell commands
- Read/write any file on the system
- Make HTTP requests to any domain
- Access your API keys and credentials
Security researcher Maor Dayan's Shodan scan found 42,665 exposed OpenClaw instances in January 2026. 93.4% had authentication bypasses. The LLM location didn't matter — the deployment security did.
This is where Clawctl's managed deployment comes in.
Without Clawctl (Raw OpenClaw):
- Local LLM ✓
- Data stays on network ✓
- Agent can run arbitrary code ⚠️
- No audit trail ⚠️
- No kill switch ⚠️
- Credentials in plaintext ⚠️
- No approval workflow ⚠️
With Clawctl Managed Deployment:
- Local LLM ✓
- Data stays on network ✓
- Sandbox isolation — Agent can't escape its container
- Full audit trail — Every action searchable, exportable
- One-click kill switch — Stop everything instantly
- Encrypted secrets vault — API keys encrypted at rest
- Human-in-the-loop — 70+ risky actions blocked until you approve
- Egress control — Only approved domains reachable
- Prompt injection defense — Attack patterns detected and blocked
Example: Local LLM + Clawctl
# Start Ollama
ollama serve &
# Deploy OpenClaw with Clawctl
# Sign up at clawctl.com/checkout, pick a plan, and your agent is provisioned automatically
Configure your agent to use the local model:
llm:
name: local
type: openai-compatible
base_url: http://host.docker.internal:11434/v1
model: qwen3:32b
Now you have:
- Zero API costs
- Data on your network
- Agent security from Clawctl
- Full audit trail
- Human approval for risky actions
Common Issues
"Connection refused to localhost"
Docker containers can't reach localhost the same way. Use:
host.docker.internal(Docker Desktop)- Your machine's LAN IP
--network=hostflag
"Model too slow"
- Quantize: Use Q4_K_M instead of full precision
- Batch: Enable continuous batching in vLLM
- Upgrade: More VRAM = bigger context = better results
"Tool calling doesn't work"
Not all models support structured tool calls. These have native tool-use support:
- Qwen 3 / Qwen 2.5 Coder (robust tool calling)
- Llama 4 Scout / Maverick (native tool calling)
- Mistral Small 3.1 (function calling)
- MiniMax M2.5 (agentic tool use)
Cost Comparison
Cloud API (1M tokens/month, output pricing):
| Provider | Output per 1M tokens |
|---|---|
| Claude Sonnet 4.5 | $15 |
| GPT-4o | $10 |
| Gemini 2.5 Pro | $10 |
Local LLM (1M tokens/month):
| Setup | Cost |
|---|---|
| RTX 3090 (used) | ~$800 one-time + electricity |
| Cloud GPU (A100) | $1–3/hour |
| MacBook M3/M4 (32GB+) | $0 (already own it) |
At 1M tokens/month, a used RTX 3090 pays for itself in 5–6 months.
At 10M tokens/month, it pays for itself in 3 weeks.
Don't Want to Manage Infrastructure?
Running your own LLM server, configuring Docker networking, setting up SSL, maintaining uptime — it adds up fast.
Clawctl handles the hard parts. You get a managed OpenClaw deployment with sandbox isolation, audit logging, and human-in-the-loop approvals. Bring your own local LLM or use a cloud API — Clawctl works with both.
The difference: KiloClaw and other managed hosts start at $9/mo but give you a shared environment with no sandbox isolation. Clawctl gives you a dedicated, isolated tenant with per-container Docker socket proxies, encrypted secrets, and egress filtering. When your agent touches customer data or production APIs, that isolation matters.
See plans and deploy in 60 seconds →
FAQ
Can I use a local LLM with OpenClaw?
Yes. OpenClaw supports any LLM that exposes an OpenAI-compatible API endpoint. This includes Ollama, vLLM, LM Studio, and llama.cpp. You configure it by setting type: openai-compatible and pointing base_url to your local server (e.g., http://localhost:11434/v1 for Ollama). No code changes needed.
What is the best local LLM for OpenClaw in 2026?
For most setups, Qwen 3 32B (20GB VRAM) offers the best balance of reasoning, tool calling, and speed. For coding-focused agents, Qwen 2.5 Coder 14B (10GB VRAM) is the sweet spot. On limited hardware (8GB), Gemma 3 9B is the best option. For enterprise setups with 128GB+ unified memory, Qwen 3.5 (397B MoE) and MiniMax M2.5 deliver near-Claude-level performance locally.
How much VRAM do I need to run a local LLM with OpenClaw?
A 7B model needs ~6GB VRAM. A 32B model (quantized to Q4) needs ~20GB. Most consumer GPUs (RTX 3090, RTX 4090) handle 32B models well. Apple Silicon Macs with 32GB+ unified memory can run 32B models and even some MoE models. For 70B+ models, you need 40GB+ VRAM or multi-GPU setups.
Is running a local LLM with OpenClaw secure?
The LLM itself is private — no data leaves your network. But the OpenClaw agent still has system access (shell commands, file operations, HTTP requests). A Shodan scan found 42,665 exposed OpenClaw instances, 93.4% with authentication bypasses. For production use, pair your local LLM with a managed deployment like Clawctl that provides sandbox isolation, audit trails, and human-in-the-loop approvals.
Can I use Ollama with OpenClaw in Docker?
Yes, but Docker containers can't reach localhost directly. Use host.docker.internal as the hostname (e.g., http://host.docker.internal:11434/v1). On Linux, you may need to add --add-host=host.docker.internal:host-gateway to your Docker run command. Alternatively, use your machine's LAN IP or run with --network=host.
How does Clawctl compare to self-hosting OpenClaw with a local LLM?
Self-hosting gives you full control but requires managing Docker, SSL certificates, firewall rules, security patches, and uptime yourself. Clawctl handles deployment infrastructure — sandbox isolation, encrypted secrets, egress filtering, auto-recovery — while you keep full control over your LLM choice. You can point Clawctl at a local Ollama instance or a cloud API. The tradeoff: $49/month for Clawctl vs. your time maintaining infrastructure.
What models support tool calling for OpenClaw agents?
Not all local models handle structured tool calls well. As of April 2026, the best options are: Qwen 3 / Qwen 2.5 Coder (robust tool calling), Llama 4 Scout / Maverick (native tool calling), Mistral Small 3.1 (function calling), Gemma 3 27B (tool use support), and MiniMax M2.5 (agentic tool use). Avoid older models without explicit tool-use training — they'll hallucinate function calls.
Deploy Your Local LLM Agent Securely
Running a local LLM is step one. Running it safely in production is step two.
Clawctl gives you a managed, secure OpenClaw deployment in 60 seconds. Sign up at clawctl.com/checkout, pick a plan, and your agent is provisioned automatically.
What you get:
- Gateway authentication (256-bit, formally verified)
- Container sandbox isolation
- Network egress control (domain allowlist)
- Human-in-the-loop approvals for 70+ risky actions
- Full audit logging (searchable, exportable)
- One-click kill switch
- Prompt injection defense
- Automatic security updates
Your model. Your data. Our guardrails. $49/month — cheaper than one incident.
Deploy securely with Clawctl →
More resources: