A month-by-month log of everything that broke while self-hosting OpenClaw in production. Every failure mode, every 2 AM page, and what it actually cost to keep running.

Self-Hosted OpenClaw Broke Me: 12 Incidents in 30 Days

I self-hosted OpenClaw for 30 days. I kept a log.

Here's every incident that took the agent down, and every fix that was supposed to be the last one.

If you're evaluating whether to self-host, read this first. If you already self-host, you'll recognize at least half of these.

The Setup

One Docker host. One OpenClaw gateway. One sandbox. Three channels — Telegram, WhatsApp, Discord.

Single tenant. Personal use. The "how hard could it be" tier.

I gave myself 30 days to get it stable. I failed.

Day 2: The Egress Proxy Dropped DNS

The agent called an API. The call timed out.

Root cause: the egress proxy container had no DNS. Docker's default bridge network doesn't forward DNS to sidecars unless you tell it to.

The fix took an hour. I had to add the proxy container to a user-defined network and set dns: in the compose file.

Time lost: 90 minutes.

Day 4: Workspace Permissions, Round One

The gateway runs as UID 1000. The sandbox runs as UID 1001.

Guess what happens when they share a bind-mounted workspace with umask 022?

The gateway wrote AGENTS.md owned by 1000:1000 with mode 644. The sandbox tried to read it and hit EACCES.

Fix: set umask 002 in both containers, then recursively chown -R 1000:1002 the workspace.

I thought I was done. I was not.

Time lost: 2 hours.

Day 6: gpt-5.4 Doesn't Exist

I tried to switch models from openai/gpt-5.1-codex to openai/gpt-5.4.

The gateway accepted the config. No error.

Then every request failed with "Unknown model." The gateway fell back to a Gemini model I hadn't configured.

OpenClaw's model validator checks format, not existence. If your provider's model catalog says "gpt-5.4" doesn't exist, you find out at runtime.

Fix: pin the exact model name from the provider's API response, not from marketing blog posts.

Time lost: 45 minutes plus 4 hours of confused debugging.

Day 8: WhatsApp QR Pairing, 9 Attempts

WhatsApp's Baileys library uses a QR code for pairing. The pairing window is short. The gateway's WebSocket to the Baileys runtime has timing sensitivities nobody documents.

I paired successfully on attempt 9 out of 10.

This isn't a "me" problem. Search r/openclaw for "whatsapp qr pairing" and you'll find the same story five times.

Time lost: 3 hours spread across two days.

Day 11: Docker Socket Permission Denied

I restarted the host. The sandbox failed to start.

The error: "Cannot connect to Docker daemon at unix:///var/run/docker.sock."

The sandbox used to work. Nothing changed on my side.

Root cause: a Docker update shifted the docker group GID. The sandbox container's bound user no longer had write access to the socket.

The fix involved a groupadd inside the sandbox image and a rebuild.

Time lost: 2.5 hours.

Day 13: The Traefik Route Went Stale

I ran docker compose up -d to apply a config change.

The gateway came up healthy on a new internal IP. Traefik kept routing to the old IP and returned 502.

Traefik picks a Docker network at startup. If your container has multiple networks, it picks arbitrarily. And it caches.

Fix: add traefik.docker.network=mynet as a container label, plus --providers.docker.network=mynet on the Traefik side. Restart Traefik.

Time lost: 90 minutes and a very confused 45 minutes where I thought DNS was broken.

Day 15: Egress Allowlist Wiped by Redeploy

I added three domains to the allowlist via the dashboard. Everything worked.

Then I redeployed the gateway for an unrelated fix. The allowlist reverted to the defaults.

Root cause: the compose generator didn't read egress_allowlist from state. Every redeploy started from defaults. Every tenant-specific domain vanished.

This one cost me a Friday. I thought the agent was "forgetting" the allowlist. It was being wiped by my own deploy tool.

Time lost: 5 hours including the rabbit hole.

Day 18: CRLF Line Endings in Bootstrap

The bootstrap script that seeds the workspace had CRLF line endings after a deploy.

The sandbox shell is /bin/sh, which chokes on CRLF.

The shell error was "syntax error near unexpected token," which is the least useful error in computing. It took me 40 minutes to notice the \r\n.

Root cause: a base64-encoded config passed through an encoder that normalized newlines to CRLF.

Time lost: 90 minutes.

Day 21: The "API Rate Limit" That Wasn't

Every outbound request started failing with "API rate limit exceeded."

I checked my OpenAI dashboard. I was nowhere near the limit.

Root cause: the egress proxy sidecar had died overnight. The gateway's HTTP_PROXY pointed at a dead container. Failed connections surfaced as "rate limit" in OpenClaw's error translator.

Fix: a proper health check on the egress proxy, plus a supervisor that restarts it.

Time lost: 4 hours and a near-purchase of a more expensive OpenAI tier I didn't need.

Day 23: The MCP Plugin That Wouldn't Load

I installed a custom MCP plugin for Google Drive access.

The plugin loaded at startup (I saw it in the logs). But the agent never called any of its tools. The mcp_call tool was missing from the agent's toolbox.

Root cause: OpenClaw's embedded runner discovers plugin tools independently from the gateway's plugin loader. The gateway had the plugin. The embedded runner didn't, because it needs a package.json with an openclaw.extensions field — which my plugin directory didn't have.

This is documented. Nowhere.

Time lost: 6 hours across three sessions.

Day 26: Bare Sandbox Image

OpenClaw's ensureDockerImage() helper falls back to debian:bookworm-slim if the sandbox image isn't pre-built.

I'd built the sandbox image on my old host. I migrated to a new host. The fallback kicked in silently.

The agent worked. Sort of. Until it tried to call python3 or curl. Neither exists in bare Debian bookworm-slim.

Fix: rebuild the sandbox image with tools, push it to a registry, pull on the new host.

Time lost: 2 hours plus a half-written blog post about Python being broken.

Day 28: Workspace Permissions, Round Two

tar extraction preserved file permissions from a backup archive. The umask fix from Day 4 didn't apply to already-extracted files.

Cue the same EACCES errors. Different files this time.

Fix: recursive chmod in the entrypoint, applied every start.

Time lost: 2 hours.

Day 30: The Total

Twelve incidents. 39 hours of unplanned work in 30 days.

At $50/hour (below market for anyone who can debug this stuff), that's $1,950.

Plus the VPS cost. Plus the OpenAI bill I'd already committed to. Plus three missed calls from the agent during outages. Plus the fact that I never actually used the agent for anything productive, because I was too busy fixing it.

Self-hosted OpenClaw cost me roughly $2,000 for zero value delivered.

What I Missed

This list is not comprehensive. Here are the things I haven't covered because they didn't happen to me — but they will, to somebody:

CVE-2026-25253 — an unauthenticated RCE in OpenClaw's gateway RPC. Over 4,000 instances got hit before the patch landed.
Shodan-exposed instances — 42,665 OpenClaw instances are publicly reachable, and 93.4% have auth misconfigurations.
CertiK's attack surface report — third-party security firm published four risk categories: gateway takeover, identity bypass, prompt injection, supply chain risk. None mitigated by default.
Prompt injection via tool output — indirect prompt injection through emails, web pages, and MCP responses. OpenClaw's default tool policy doesn't defend against it.
Secret exfil via egress — if your allowlist isn't tight, the agent can POST your API keys to an attacker-controlled domain and you'll never see it.
Container escape via docker.sock — the default OpenClaw setup mounts /var/run/docker.sock into the sandbox, giving the agent root access to your host. The fix is a per-tenant Docker socket proxy.
Firecracker microVM isolation — some IT pros are building their own Firecracker-isolated runtimes (h0mm, CongaLine). 132-232 hours of engineering work.

Every one of these is a Day 31.

Why I Switched to Clawctl

After 30 days of self-hosting, I added up what Clawctl would have cost me and laughed.

Their Starter plan is $49/month. Their Team plan is $299/month.

Clawctl ships with every fix I just described, already applied:

Egress proxy with DNS. Works on boot, every time. DNS configured automatically.
Workspace permissions. umask, chown, and recursive fixup in the entrypoint. No Day 4.
Model allowlist validation. The dashboard only shows models that exist in your provider's live catalog. No Day 6.
WhatsApp pairing retries. Server-side pairing loop with backoff. Paired on attempt one out of ten tests.
Managed Docker runtime. No socket permission hell. You don't touch Docker.
Traefik routing locked to the correct network. Label set by the platform. No stale routes.
Egress allowlist persisted in the database and re-applied on every redeploy. No Day 15.
Bootstrap files normalized to LF at build time. No Day 18.
Egress proxy health checks + auto-restart inside a 60-second window. No Day 21.
MCP plugin registry with pre-validated package.json schemas. Plugins that load actually work. No Day 23.
Pre-built sandbox image embedded in the gateway image. No Day 26.
SOC2-aligned audit logs for every tool call, every egress attempt, and every approval gate.

Plus a kill switch. Plus human-in-the-loop approvals. Plus 2FA. Plus a dashboard that actually shows you what the agent is doing.

The $49/month is insurance against another 39 hours of debugging.

Who Should Still Self-Host

I'm not going to pretend managed is the right answer for everyone.

Self-host if:

You enjoy infrastructure work as a hobby
You're running bleeding-edge OpenClaw builds the managed providers don't support
You're in an air-gapped environment where managed isn't an option
You only need the agent to run for a few hours a week and can afford downtime

Don't self-host if:

You're running OpenClaw for business use
Your time is worth more than $50/hour
You need the agent to be reliable
You handle any sensitive data
You need audit trails
You don't want to be paged at 2 AM

If you're reading this and recognize three or more of my incidents, you already know which group you're in.

The CTA

Here's what I'd do if I were starting over today:

Spin up a Clawctl Starter account. $49/month. Deploys in 60 seconds. Start here.
Bring your existing API keys. OpenRouter, OpenAI, Anthropic — whatever you're already paying for. Clawctl just hosts the runtime.
Configure your channels in the dashboard. Telegram and Discord pair in under a minute. WhatsApp QR works the first time.
Close this tab. Go build the thing you wanted to build before you got stuck debugging Docker networks.

Or self-host. I'll save you a seat in the club.

We have jackets. They're made of kernel log output.

Related reading:

This content is for informational purposes only and does not constitute financial, legal, medical, tax, or other professional advice. Individual results vary. See our Terms of Service for important disclaimers.

Self-Hosted OpenClaw Broke Me: 12 Incidents in 30 Days