Clawctl
Guides
8 min

OpenClaw + GPU: Run AI Agents on Your Local Hardware

That RTX 4090 sitting under your desk can run a 70B parameter model. Connect it to OpenClaw and your AI agent runs locally, privately, and fast. Here's the complete GPU setup.

Clawctl Team

Product & Engineering

OpenClaw + GPU: Run AI Agents on Your Local Hardware

You bought a GPU for gaming. Or machine learning. Or crypto (we don't judge).

Now it can run your personal AI agent.

No cloud API. No per-token fees. No data leaving your network.

Just your GPU, a local LLM, and OpenClaw doing the work while you sleep.

Why GPU + OpenClaw?

Modern GPUs are the standard for AI inference. The CUDA ecosystem is mature. The tooling works.

What your GPU gives you:

FeatureBenefit
Parallel coresFast matrix math (fast inference)
Tensor cores (RTX 30/40)2-4x faster for AI workloads
VRAMDetermines max model size
Local executionZero latency to cloud, zero data exposure

What OpenClaw gives you:

FeatureBenefit
Tool executionYour LLM can run shell commands, hit APIs, read files
Workflow orchestrationMulti-step agent tasks
Trigger systemReact to webhooks, schedules, events
MCP integrationConnect to any Model Context Protocol server

Together: a private, fast, capable AI agent running entirely on your hardware.

GPU Requirements

GPUVRAMMax ModelNotes
RTX 306012GB8B modelsEntry point
RTX 3070 Ti8GB7B modelsVRAM limited
RTX 308010/12GB8B-13B modelsGood balance
RTX 309024GB34B models (Q4)Sweet spot for prosumers
RTX 4070 Ti12GB8B-13B modelsFaster than 3080
RTX 408016GB13B-22B modelsSolid upgrade
RTX 409024GB34B models (Q4), 70B (Q2)Best consumer GPU
A10040/80GB70B+ modelsDatacenter
H10080GB70B-405B modelsIf money isn't real

Rule of thumb: Model parameters × 1GB VRAM for Q4 quantization.

Step 1: Install GPU Drivers + CUDA

Ubuntu/Debian:

# Install GPU drivers
sudo apt install -y nvidia-driver-545
sudo reboot

# Verify
nvidia-smi

Docker with GPU support:

# Install Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify GPU is visible to Docker:

docker run --rm --gpus all nvidia/cuda:12.2-base nvidia-smi

Step 2: Run a Local LLM on Your GPU

Option A: Ollama (Easiest)

curl -fsSL https://ollama.ai/install.sh | sh
ollama serve

Pull a model that fits your VRAM:

# 8GB VRAM
ollama pull llama3.1:8b

# 12GB VRAM
ollama pull llama3.1:8b-instruct-q8_0

# 24GB VRAM
ollama pull llama3.1:70b-instruct-q4_K_M

Ollama auto-detects your GPU via CUDA.

Option B: vLLM (Highest Performance)

pip install vllm

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --gpu-memory-utilization 0.9

For multi-GPU setups:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice

Option C: TensorRT-LLM (Maximum Speed)

The fastest inference engine. More setup required:

# Requires building TensorRT engines
# See: https://github.com/NVIDIA/TensorRT-LLM

python build.py --model_dir ./llama3 --output_dir ./llama3-trt
python run.py --engine_dir ./llama3-trt

Only worth it if you need absolute maximum throughput.

Step 3: Connect to OpenClaw

Configure OpenClaw to use your local GPU-powered LLM:

llm:
  name: gpu-local
  type: openai-compatible
  base_url: http://localhost:11434/v1  # Ollama
  # base_url: http://localhost:8000/v1  # vLLM
  model: llama3.1:70b-instruct-q4_K_M
  timeout_ms: 60000

Test the connection:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hello"}]}'

Performance Tuning

Monitor GPU Usage

# Real-time monitoring
watch -n 1 nvidia-smi

# Detailed metrics
nvidia-smi dmon -s pucvmet

Optimize VRAM Usage

# Ollama: Set VRAM limit
OLLAMA_GPU_OVERHEAD=512 ollama serve

# vLLM: Control memory utilization
vllm serve ... --gpu-memory-utilization 0.85

Quantization Trade-offs

QuantizationVRAM ReductionQuality Loss
FP16BaselineNone
Q8_050%Minimal
Q6_K60%Small
Q4_K_M75%Noticeable on edge cases
Q2_K85%Significant

For agent workloads, Q4_K_M is the sweet spot—good enough for tool calling, small enough to fit larger models.

Benchmarks: Cloud vs Local GPU

MetricClaude APIRTX 4090 (Llama 70B Q4)
Latency (first token)800ms150ms
Throughput50 tok/s40 tok/s
Cost (1M tokens)$18$0.50 (electricity)
Data privacyData leaves networkStays local

Local GPUs are slower on raw throughput but faster to first token. And after the hardware cost, it's essentially free.

The Security Gap Your GPU Won't Fix

Running an LLM on your GPU is step one.

Running it safely in production is step two.

Your $1,600 GPU solves inference. It doesn't solve:

  • Credential exposure (API keys in plaintext)
  • Network exfiltration (agent can POST anywhere)
  • Execution boundaries (agent can rm -rf /)
  • Audit trail (what did it actually do?)
  • Kill switch (how do you stop it?)

January 2026: 42,665 exposed OpenClaw instances found. 93.4% vulnerable. Hardware investment didn't protect them—deployment security did.

Without Clawctl (Raw OpenClaw):

User → Local LLM → OpenClaw → Unrestricted execution → ???

With Clawctl Managed Deployment:

User → Local LLM → Clawctl → Sandbox → Egress filter → Audit → Approved execution
Security LayerRaw OpenClawClawctl Managed
Gateway authNone256-bit, verified
CredentialsPlaintext on diskEncrypted vault
SandboxDisabled by defaultAlways on
Egress controlNoneDomain allowlist
Audit trailNoneFull logging, searchable
ApprovalsNone70+ risky actions blocked
Kill switchSSH in and hopeOne click
Prompt injectionVulnerableDefense enabled

Full Setup: GPU + Clawctl

# 1. Start your local LLM
ollama serve &

# 2. Deploy with Clawctl
# Sign up at clawctl.com/checkout, pick a plan, and your agent is provisioned automatically

# 3. Configure local LLM in your agent
# Edit your agent config to use:
# base_url: http://host.docker.internal:11434/v1

Now you have:

  • ✅ Local inference on your GPU
  • ✅ Zero API costs
  • ✅ Data stays on your network
  • ✅ Production-grade security
  • ✅ Full audit trail

Troubleshooting

"CUDA out of memory"

  • Use a smaller model or higher quantization
  • Check for other processes using GPU: nvidia-smi
  • Set memory limits in your inference server

"Model not using GPU"

  • Verify CUDA is installed: nvcc --version
  • Check Ollama sees GPU: ollama ps
  • For vLLM, ensure torch.cuda.is_available() returns True

"Slow inference"

  • Tensor cores not enabled? Use FP16 models
  • Check for thermal throttling: nvidia-smi -q -d TEMPERATURE
  • Enable flash attention if supported

Deploy Your GPU-Powered Agent Securely

Your GPU is already paid for. Put it to work—safely.

Clawctl gives you a managed, secure OpenClaw deployment in 60 seconds. Sign up at clawctl.com/checkout, pick a plan, and your agent is provisioned automatically.

What Clawctl's managed deployment includes:

  • Gateway authentication (256-bit, formally verified)
  • Container sandbox isolation
  • Network egress control (Squid proxy, domain allowlist)
  • Human-in-the-loop approvals for 70+ risky actions
  • Full audit logging (searchable, exportable, up to 365 days)
  • One-click kill switch
  • Prompt injection defense (enabled by default)
  • Automatic security updates

Local inference on your GPU. Production-grade security from Clawctl. $49/month — cheaper than explaining a breach.

Deploy securely with Clawctl →


More resources:

Ready to deploy your OpenClaw securely?

Get your OpenClaw running in production with Clawctl's enterprise-grade security.