Clawctl
Tutorial
10 min

How to Build a Voice Chat Interface for OpenClaw

Tutorial: Add real-time voice conversation to your OpenClaw agent using Deepgram, ElevenLabs, and Pipecat. Full code included.

Clawctl Team

Product & Engineering

How to Build a Voice Chat Interface for OpenClaw

OpenClaw's Telegram and Slack integrations work well for text. But what if you want to talk to your agent?

This tutorial shows how to build a real-time voice interface using:

  • Deepgram for speech-to-text
  • ElevenLabs for text-to-speech
  • Pipecat for pipeline orchestration
  • OpenClaw's Gateway as the backend

The result: open a browser, click a button, and have a voice conversation with your full-context agent.

Architecture Overview

Browser (mic) → Deepgram STT → OpenClaw Gateway → Agent → ElevenLabs TTS → Browser (speaker)

Key insight: OpenClaw's gateway exposes an OpenAI-compatible /v1/chat/completions endpoint. This means any tool that works with OpenAI's API can work with your agent—including voice pipelines.

The voice interface isn't a separate, simpler bot. It's the same agent with all its context, tools, and memory.

Prerequisites

  • OpenClaw running with the chatCompletions endpoint enabled
  • Deepgram API key (200 hours free tier)
  • ElevenLabs API key
  • Python 3.10+

Step 1: Enable the Chat Completions Endpoint

Add this to your openclaw.json:

{
  "gateway": {
    "http": {
      "endpoints": {
        "chatCompletions": {
          "enabled": true
        }
      }
    }
  }
}

This exposes /v1/chat/completions on your gateway port.

Step 2: Add a Voice Agent Entry

The Pipecat config will use a model name like openclaw:voice. The part after the colon maps to an agent ID:

{
  "agents": {
    "list": [
      {
        "id": "voice",
        "workspace": "/path/to/your/openclaw",
        "model": "anthropic/claude-sonnet-4-5"
      }
    ]
  }
}

Sonnet is recommended for voice—fast enough for conversational flow. Opus gives better reasoning but adds noticeable latency.

Step 3: The Server Code

The server is ~100 lines of Python using FastAPI and Pipecat:

async def run_bot(webrtc_connection):
    stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))

    tts = ElevenLabsTTSService(
        api_key=os.getenv("ELEVENLABS_API_KEY"),
        voice_id=os.getenv("ELEVENLABS_VOICE_ID"),
    )

    llm = OpenAILLMService(
        api_key=os.getenv("CLAWCTL_GATEWAY_TOKEN"),
        model="openclaw:voice",
        base_url=f"{os.getenv('CLAWCTL_GATEWAY_URL')}/v1",
    )

    pipeline = Pipeline([
        transport.input(),       # Browser audio in
        stt,                     # Speech to text
        user_aggregator,         # Accumulate user turns
        llm,                     # Your agent via gateway
        tts,                     # Text to speech
        transport.output(),      # Audio back to browser
        assistant_aggregator,    # Track assistant turns
    ])

The key line: OpenAILLMService points to OpenClaw's gateway instead of OpenAI. Pipecat treats it identically—same API format—but you get your full agent behind it.

Step 4: Voice System Prompt

For voice conversations, you want the agent to respond conversationally:

VOICE_SYSTEM = (
    "This conversation is happening via real-time voice chat. "
    "Keep responses concise and conversational — a few sentences "
    "at most unless the topic genuinely needs depth. "
    "No markdown, bullet points, code blocks, or special formatting."
)

Step 5: The Frontend

A single HTML file with one button. Click to connect, click to disconnect.

The browser captures your mic via WebRTC, streams audio to the server, and plays back the response through a standard audio element.

No framework, no build step. About 80 lines of JavaScript handling WebRTC signaling and ICE candidates.

Running It

cd voice-chat
cp .env.example .env
# Fill in your Deepgram + ElevenLabs keys + gateway URL

uv run server.py

Open http://localhost:7860, click the mic, and talk.

Remote Access

To use voice chat from your phone or another computer, you need the server accessible remotely.

Options:

  • Tailscale — Expose to your personal network, no port forwarding
  • Cloudflare Tunnel — Public URL with authentication
  • VPN — Access via your existing VPN

The voice chat server runs on the same machine as OpenClaw. Audio streams to the server, gets processed, and streams back.

Voice Activity Detection

Pipecat uses Silero VAD (voice activity detection) to determine when you've stopped talking. The stop_secs parameter controls the pause duration that triggers a send.

Recommended: 0.4 seconds. Short enough for natural conversation, long enough that it doesn't cut you off mid-thought.

No wake word. No push-to-talk. Just talk naturally and pause when done.

Latency Expectations

Be realistic about latency:

  • Speech-to-text (Deepgram): Fast, minimal delay
  • Text-to-speech (ElevenLabs): Fast, streaming
  • Model inference: Depends on the model

With Sonnet, the total round-trip is reasonable for conversation. With Opus or thinking models, there's a noticeable pause—more like talking to someone who considers their response before speaking.

For quick commands, the latency can feel slow. For brainstorming, status updates, or longer discussions, it works well.

With Clawctl

If you're running Clawctl:

  • The gateway is already secured with token auth
  • The /v1/chat/completions endpoint is available
  • Voice conversations appear in your audit trail

Point the voice server at your Clawctl gateway URL and use your gateway token for authentication.

Complete Files

The full implementation includes:

  • bot.py — Pipecat pipeline configuration
  • server.py — FastAPI server
  • index.html — Browser frontend
  • pyproject.toml — Dependencies
  • .env.example — Configuration template

Get the complete code →

Deploy your agent with Clawctl →

Ready to deploy your OpenClaw securely?

Get your OpenClaw running in production with Clawctl's enterprise-grade security.