OpenClaw with Local LLM: The Complete Guide
A startup founder messaged me last week:
"I love OpenClaw but I can't send proprietary code to Claude's servers. Legal will kill me."
Fair. Most enterprise policies prohibit sending source code to third-party AI providers. Healthcare can't send patient data. Finance can't send trading algorithms. Defense can't send anything.
But here's the thing: OpenClaw doesn't care where your LLM lives.
You can run Llama 4, Qwen 3, DeepSeek V3, or any OpenAI-compatible model on your own hardware—and connect it to OpenClaw in 5 minutes.
No API costs. No data leaving your network. Full agent capabilities.
This guide covers every method that works.
Why Local LLMs + OpenClaw?
| Concern | Cloud API | Local LLM |
|---|---|---|
| Data privacy | Data leaves your network | Stays on your hardware |
| API costs | $0.015–0.06 per 1K tokens | $0 after hardware |
| Rate limits | Yes | None |
| Latency | 500ms–2s | 50–200ms |
| Offline capability | No | Yes |
| Compliance | Depends on vendor | You control everything |
For agents that touch sensitive data, local is often the only option.
Method 1: Ollama (Easiest)
Ollama is the Docker of LLMs. One command to install, one command to run.
Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Pull a model:
# Fast and capable (12GB VRAM)
ollama pull llama4-scout
# Best for coding (20GB VRAM)
ollama pull qwen2.5-coder:32b-q4_K_M
# Strong general-purpose (16GB VRAM)
ollama pull mistral-small3.1
Start the server:
ollama serve
Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1.
Configure OpenClaw:
llm:
name: local-ollama
type: openai-compatible
base_url: http://localhost:11434/v1
model: llama4-scout
timeout_ms: 60000
That's it. Your agent now uses a local model.
Method 2: vLLM (Best Performance)
vLLM is built for production. It's up to 24x faster than Hugging Face Transformers and supports continuous batching for multiple concurrent requests.
Install vLLM:
pip install vllm
Start the server:
vllm serve Qwen/Qwen3-32B \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--tensor-parallel-size 2 # For multi-GPU
Configure OpenClaw:
llm:
name: local-vllm
type: openai-compatible
base_url: http://localhost:8000/v1
model: Qwen/Qwen3-32B
timeout_ms: 30000
vLLM shines when you need:
- Multiple agents hitting the same model
- High throughput (hundreds of requests/minute)
- Multi-GPU setups
Method 3: LM Studio (GUI-based)
LM Studio is Ollama with a UI. Great for experimenting with models before committing.
- Download from lmstudio.ai
- Search and download a model
- Click "Start Server" in the Local Server tab
- Configure OpenClaw to use
http://localhost:1234/v1
Configure OpenClaw:
llm:
name: local-lmstudio
type: openai-compatible
base_url: http://localhost:1234/v1
model: local-model
timeout_ms: 60000
Method 4: llama.cpp (Maximum Control)
llama.cpp gives you raw inference with no overhead. It runs GGUF models on CPU, GPU, or mixed — and powers most other local LLM tools under the hood.
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
# Start OpenAI-compatible server
./llama-server -m your-model.gguf --port 8080
API available at http://localhost:8080/v1. Useful when you need custom quantizations or models not yet in Ollama's library.
Which Local LLM Should You Use?
The local model landscape moves fast. Here's what's worth running as of February 2026:
General purpose:
| Model | VRAM | Strength | Best For |
|---|---|---|---|
| Llama 4 Scout (109B MoE, 17B active) | 30GB+ (Int4) | Fast, multimodal, 10M context | Quick tasks, triage, vision |
| Qwen 3 32B | 20GB | Strong reasoning, tool use | Complex agentic tasks |
| Mistral Small 3.1 (24B) | 16GB | Fast, 128K context | General tasks |
| DeepSeek V3 (quantized) | 24GB+ | GPT-4 class reasoning | Heavy analysis |
Coding specialists:
| Model | VRAM | Strength | Best For |
|---|---|---|---|
| Qwen 2.5 Coder 32B | 20GB | 92.7% HumanEval — matches GPT-4o | Code review, generation |
| Qwen 2.5 Coder 7B | 6GB | 88.4% HumanEval — beats models 5x its size | Quick code tasks on limited hardware |
Power user tier (128GB+ unified memory or multi-GPU):
| Model | RAM/VRAM | Strength | Best For |
|---|---|---|---|
| Qwen 3.5 (397B MoE, 17B active) | ~200GB (Q4) | 76.4% SWE-Bench, native multimodal, agentic-trained | Full-stack agent workflows |
| MiniMax M2.5 (230B MoE, 10B active) | 101GB (3-bit) | Benchmarks alongside Claude Sonnet | Agentic coding, tool use |
| Kimi K2.5 (1T MoE, 32B active) | 240GB+ (1.8-bit) | Native multimodal, Agent Swarm | Research, multi-agent workflows |
Qwen 3.5 (released Feb 2026) is the newest option here — 397B total with 17B active params, 256K context, and agentic training focus. Needs enterprise hardware (~200GB at Q4). MiniMax M2.5 is more accessible — 10B active params means it's fast despite 230B total, and it scores 80.2% on SWE-Bench Verified. Runs on a 128GB M3/M4 Max. Kimi K2.5 needs 256GB+ RAM, so it's realistically an API model for most people.
Hardware reality check:
| GPU | VRAM | Max Model |
|---|---|---|
| RTX 3060 | 12GB | 7–8B models |
| RTX 3090 | 24GB | 32B models (quantized) |
| RTX 4090 | 24GB | 32B models (quantized) |
| A100 40GB | 40GB | 70B models (quantized) |
| 2x A100 / H100 | 80–160GB | Full-precision large models |
| Mac M3/M4 Max (128GB) | 128GB unified | MiniMax M2.5 (3-bit), most MoE models |
No GPU? Use CPU inference with llama.cpp — just expect 10–20x slower responses. Apple Silicon Macs with 32GB+ unified memory are surprisingly capable.
The Security Gap You're Not Thinking About
Running a local LLM solves the data privacy problem.
But you still have the agent security problem.
Your local LLM is private. Great. But the agent connected to it can still:
- Execute arbitrary shell commands
- Read/write any file on the system
- Make HTTP requests to any domain
- Access your API keys and credentials
Security researcher Maor Dayan's Shodan scan found 42,665 exposed OpenClaw instances in January 2026. 93.4% had authentication bypasses. The LLM location didn't matter — the deployment security did.
This is where Clawctl's managed deployment comes in.
Without Clawctl (Raw OpenClaw):
- Local LLM ✓
- Data stays on network ✓
- Agent can run arbitrary code ⚠️
- No audit trail ⚠️
- No kill switch ⚠️
- Credentials in plaintext ⚠️
- No approval workflow ⚠️
With Clawctl Managed Deployment:
- Local LLM ✓
- Data stays on network ✓
- Sandbox isolation — Agent can't escape its container
- Full audit trail — Every action searchable, exportable
- One-click kill switch — Stop everything instantly
- Encrypted secrets vault — API keys encrypted at rest
- Human-in-the-loop — 70+ risky actions blocked until you approve
- Egress control — Only approved domains reachable
- Prompt injection defense — Attack patterns detected and blocked
Example: Local LLM + Clawctl
# Start Ollama
ollama serve &
# Deploy OpenClaw with Clawctl
# Sign up at clawctl.com/checkout, pick a plan, and your agent is provisioned automatically
Configure your agent to use the local model:
llm:
name: local
type: openai-compatible
base_url: http://host.docker.internal:11434/v1
model: qwen3:32b
Now you have:
- Zero API costs
- Data on your network
- Agent security from Clawctl
- Full audit trail
- Human approval for risky actions
Common Issues
"Connection refused to localhost"
Docker containers can't reach localhost the same way. Use:
host.docker.internal(Docker Desktop)- Your machine's LAN IP
--network=hostflag
"Model too slow"
- Quantize: Use Q4_K_M instead of full precision
- Batch: Enable continuous batching in vLLM
- Upgrade: More VRAM = bigger context = better results
"Tool calling doesn't work"
Not all models support structured tool calls. These have native tool-use support:
- Qwen 3 / Qwen 2.5 Coder (robust tool calling)
- Llama 4 Scout / Maverick (native tool calling)
- Mistral Small 3.1 (function calling)
- MiniMax M2.5 (agentic tool use)
Cost Comparison
Cloud API (1M tokens/month, output pricing):
| Provider | Output per 1M tokens |
|---|---|
| Claude Sonnet 4.5 | $15 |
| GPT-4o | $10 |
| Gemini 2.5 Pro | $10 |
Local LLM (1M tokens/month):
| Setup | Cost |
|---|---|
| RTX 3090 (used) | ~$800 one-time + electricity |
| Cloud GPU (A100) | $1–3/hour |
| MacBook M3/M4 (32GB+) | $0 (already own it) |
At 1M tokens/month, a used RTX 3090 pays for itself in 5–6 months.
At 10M tokens/month, it pays for itself in 3 weeks.
Deploy Your Local LLM Agent Securely
Running a local LLM is step one. Running it safely in production is step two.
Clawctl gives you a managed, secure OpenClaw deployment in 60 seconds. Sign up at clawctl.com/checkout, pick a plan, and your agent is provisioned automatically.
What you get:
- Gateway authentication (256-bit, formally verified)
- Container sandbox isolation
- Network egress control (domain allowlist)
- Human-in-the-loop approvals for 70+ risky actions
- Full audit logging (searchable, exportable)
- One-click kill switch
- Prompt injection defense
- Automatic security updates
Your model. Your data. Our guardrails. $49/month — cheaper than one incident.
Deploy securely with Clawctl →
More resources: