Clawctl
Guides
11 min

OpenClaw with Local LLM: The Complete Guide (Ollama, vLLM, LM Studio)

Keep your code on your network. Pay $0 in API fees. Run Llama 4, Qwen 3, or DeepSeek V3 locally and connect it to OpenClaw. Here's every method that works.

Clawctl Team

Product & Engineering

OpenClaw with Local LLM: The Complete Guide

A startup founder messaged me last week:

"I love OpenClaw but I can't send proprietary code to Claude's servers. Legal will kill me."

Fair. Most enterprise policies prohibit sending source code to third-party AI providers. Healthcare can't send patient data. Finance can't send trading algorithms. Defense can't send anything.

But here's the thing: OpenClaw doesn't care where your LLM lives.

You can run Llama 4, Qwen 3, DeepSeek V3, or any OpenAI-compatible model on your own hardware—and connect it to OpenClaw in 5 minutes.

No API costs. No data leaving your network. Full agent capabilities.

This guide covers every method that works.

Why Local LLMs + OpenClaw?

ConcernCloud APILocal LLM
Data privacyData leaves your networkStays on your hardware
API costs$0.015–0.06 per 1K tokens$0 after hardware
Rate limitsYesNone
Latency500ms–2s50–200ms
Offline capabilityNoYes
ComplianceDepends on vendorYou control everything

For agents that touch sensitive data, local is often the only option.

Method 1: Ollama (Easiest)

Ollama is the Docker of LLMs. One command to install, one command to run.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull a model:

# Fast and capable (12GB VRAM)
ollama pull llama4-scout

# Best for coding (20GB VRAM)
ollama pull qwen2.5-coder:32b-q4_K_M

# Strong general-purpose (16GB VRAM)
ollama pull mistral-small3.1

Start the server:

ollama serve

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1.

Configure OpenClaw:

llm:
  name: local-ollama
  type: openai-compatible
  base_url: http://localhost:11434/v1
  model: llama4-scout
  timeout_ms: 60000

That's it. Your agent now uses a local model.

Method 2: vLLM (Best Performance)

vLLM is built for production. It's up to 24x faster than Hugging Face Transformers and supports continuous batching for multiple concurrent requests.

Install vLLM:

pip install vllm

Start the server:

vllm serve Qwen/Qwen3-32B \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --tensor-parallel-size 2  # For multi-GPU

Configure OpenClaw:

llm:
  name: local-vllm
  type: openai-compatible
  base_url: http://localhost:8000/v1
  model: Qwen/Qwen3-32B
  timeout_ms: 30000

vLLM shines when you need:

  • Multiple agents hitting the same model
  • High throughput (hundreds of requests/minute)
  • Multi-GPU setups

Method 3: LM Studio (GUI-based)

LM Studio is Ollama with a UI. Great for experimenting with models before committing.

  1. Download from lmstudio.ai
  2. Search and download a model
  3. Click "Start Server" in the Local Server tab
  4. Configure OpenClaw to use http://localhost:1234/v1

Configure OpenClaw:

llm:
  name: local-lmstudio
  type: openai-compatible
  base_url: http://localhost:1234/v1
  model: local-model
  timeout_ms: 60000

Method 4: llama.cpp (Maximum Control)

llama.cpp gives you raw inference with no overhead. It runs GGUF models on CPU, GPU, or mixed — and powers most other local LLM tools under the hood.

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Start OpenAI-compatible server
./llama-server -m your-model.gguf --port 8080

API available at http://localhost:8080/v1. Useful when you need custom quantizations or models not yet in Ollama's library.

Which Local LLM Should You Use?

The local model landscape moves fast. Here's what's worth running as of February 2026:

General purpose:

ModelVRAMStrengthBest For
Llama 4 Scout (109B MoE, 17B active)30GB+ (Int4)Fast, multimodal, 10M contextQuick tasks, triage, vision
Qwen 3 32B20GBStrong reasoning, tool useComplex agentic tasks
Mistral Small 3.1 (24B)16GBFast, 128K contextGeneral tasks
DeepSeek V3 (quantized)24GB+GPT-4 class reasoningHeavy analysis

Coding specialists:

ModelVRAMStrengthBest For
Qwen 2.5 Coder 32B20GB92.7% HumanEval — matches GPT-4oCode review, generation
Qwen 2.5 Coder 7B6GB88.4% HumanEval — beats models 5x its sizeQuick code tasks on limited hardware

Power user tier (128GB+ unified memory or multi-GPU):

ModelRAM/VRAMStrengthBest For
Qwen 3.5 (397B MoE, 17B active)~200GB (Q4)76.4% SWE-Bench, native multimodal, agentic-trainedFull-stack agent workflows
MiniMax M2.5 (230B MoE, 10B active)101GB (3-bit)Benchmarks alongside Claude SonnetAgentic coding, tool use
Kimi K2.5 (1T MoE, 32B active)240GB+ (1.8-bit)Native multimodal, Agent SwarmResearch, multi-agent workflows

Qwen 3.5 (released Feb 2026) is the newest option here — 397B total with 17B active params, 256K context, and agentic training focus. Needs enterprise hardware (~200GB at Q4). MiniMax M2.5 is more accessible — 10B active params means it's fast despite 230B total, and it scores 80.2% on SWE-Bench Verified. Runs on a 128GB M3/M4 Max. Kimi K2.5 needs 256GB+ RAM, so it's realistically an API model for most people.

Hardware reality check:

GPUVRAMMax Model
RTX 306012GB7–8B models
RTX 309024GB32B models (quantized)
RTX 409024GB32B models (quantized)
A100 40GB40GB70B models (quantized)
2x A100 / H10080–160GBFull-precision large models
Mac M3/M4 Max (128GB)128GB unifiedMiniMax M2.5 (3-bit), most MoE models

No GPU? Use CPU inference with llama.cpp — just expect 10–20x slower responses. Apple Silicon Macs with 32GB+ unified memory are surprisingly capable.

The Security Gap You're Not Thinking About

Running a local LLM solves the data privacy problem.

But you still have the agent security problem.

Your local LLM is private. Great. But the agent connected to it can still:

  • Execute arbitrary shell commands
  • Read/write any file on the system
  • Make HTTP requests to any domain
  • Access your API keys and credentials

Security researcher Maor Dayan's Shodan scan found 42,665 exposed OpenClaw instances in January 2026. 93.4% had authentication bypasses. The LLM location didn't matter — the deployment security did.

This is where Clawctl's managed deployment comes in.

Without Clawctl (Raw OpenClaw):

  • Local LLM ✓
  • Data stays on network ✓
  • Agent can run arbitrary code ⚠️
  • No audit trail ⚠️
  • No kill switch ⚠️
  • Credentials in plaintext ⚠️
  • No approval workflow ⚠️

With Clawctl Managed Deployment:

  • Local LLM ✓
  • Data stays on network ✓
  • Sandbox isolation — Agent can't escape its container
  • Full audit trail — Every action searchable, exportable
  • One-click kill switch — Stop everything instantly
  • Encrypted secrets vault — API keys encrypted at rest
  • Human-in-the-loop — 70+ risky actions blocked until you approve
  • Egress control — Only approved domains reachable
  • Prompt injection defense — Attack patterns detected and blocked

Example: Local LLM + Clawctl

# Start Ollama
ollama serve &

# Deploy OpenClaw with Clawctl
# Sign up at clawctl.com/checkout, pick a plan, and your agent is provisioned automatically

Configure your agent to use the local model:

llm:
  name: local
  type: openai-compatible
  base_url: http://host.docker.internal:11434/v1
  model: qwen3:32b

Now you have:

  • Zero API costs
  • Data on your network
  • Agent security from Clawctl
  • Full audit trail
  • Human approval for risky actions

Common Issues

"Connection refused to localhost"

Docker containers can't reach localhost the same way. Use:

  • host.docker.internal (Docker Desktop)
  • Your machine's LAN IP
  • --network=host flag

"Model too slow"

  • Quantize: Use Q4_K_M instead of full precision
  • Batch: Enable continuous batching in vLLM
  • Upgrade: More VRAM = bigger context = better results

"Tool calling doesn't work"

Not all models support structured tool calls. These have native tool-use support:

  • Qwen 3 / Qwen 2.5 Coder (robust tool calling)
  • Llama 4 Scout / Maverick (native tool calling)
  • Mistral Small 3.1 (function calling)
  • MiniMax M2.5 (agentic tool use)

Cost Comparison

Cloud API (1M tokens/month, output pricing):

ProviderOutput per 1M tokens
Claude Sonnet 4.5$15
GPT-4o$10
Gemini 2.5 Pro$10

Local LLM (1M tokens/month):

SetupCost
RTX 3090 (used)~$800 one-time + electricity
Cloud GPU (A100)$1–3/hour
MacBook M3/M4 (32GB+)$0 (already own it)

At 1M tokens/month, a used RTX 3090 pays for itself in 5–6 months.

At 10M tokens/month, it pays for itself in 3 weeks.

Deploy Your Local LLM Agent Securely

Running a local LLM is step one. Running it safely in production is step two.

Clawctl gives you a managed, secure OpenClaw deployment in 60 seconds. Sign up at clawctl.com/checkout, pick a plan, and your agent is provisioned automatically.

What you get:

  • Gateway authentication (256-bit, formally verified)
  • Container sandbox isolation
  • Network egress control (domain allowlist)
  • Human-in-the-loop approvals for 70+ risky actions
  • Full audit logging (searchable, exportable)
  • One-click kill switch
  • Prompt injection defense
  • Automatic security updates

Your model. Your data. Our guardrails. $49/month — cheaper than one incident.

Deploy securely with Clawctl →


More resources:

This content is for informational purposes only and does not constitute financial, legal, medical, tax, or other professional advice. Individual results vary. See our Terms of Service for important disclaimers.

Ready to deploy your OpenClaw securely?

Get your OpenClaw running in production with Clawctl's enterprise-grade security.