How to Build a Voice Chat Interface for OpenClaw
OpenClaw's Telegram and Slack integrations work well for text. But what if you want to talk to your agent?
This tutorial shows how to build a real-time voice interface using:
- Deepgram for speech-to-text
- ElevenLabs for text-to-speech
- Pipecat for pipeline orchestration
- OpenClaw's Gateway as the backend
The result: open a browser, click a button, and have a voice conversation with your full-context agent.
Architecture Overview
Browser (mic) → Deepgram STT → OpenClaw Gateway → Agent → ElevenLabs TTS → Browser (speaker)
Key insight: OpenClaw's gateway exposes an OpenAI-compatible /v1/chat/completions endpoint. This means any tool that works with OpenAI's API can work with your agent—including voice pipelines.
The voice interface isn't a separate, simpler bot. It's the same agent with all its context, tools, and memory.
Prerequisites
- OpenClaw running with the chatCompletions endpoint enabled
- Deepgram API key (200 hours free tier)
- ElevenLabs API key
- Python 3.10+
Step 1: Enable the Chat Completions Endpoint
Add this to your openclaw.json:
{
"gateway": {
"http": {
"endpoints": {
"chatCompletions": {
"enabled": true
}
}
}
}
}
This exposes /v1/chat/completions on your gateway port.
Step 2: Add a Voice Agent Entry
The Pipecat config will use a model name like openclaw:voice. The part after the colon maps to an agent ID:
{
"agents": {
"list": [
{
"id": "voice",
"workspace": "/path/to/your/openclaw",
"model": "anthropic/claude-sonnet-4-5"
}
]
}
}
Sonnet is recommended for voice—fast enough for conversational flow. Opus gives better reasoning but adds noticeable latency.
Step 3: The Server Code
The server is ~100 lines of Python using FastAPI and Pipecat:
async def run_bot(webrtc_connection):
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
tts = ElevenLabsTTSService(
api_key=os.getenv("ELEVENLABS_API_KEY"),
voice_id=os.getenv("ELEVENLABS_VOICE_ID"),
)
llm = OpenAILLMService(
api_key=os.getenv("CLAWCTL_GATEWAY_TOKEN"),
model="openclaw:voice",
base_url=f"{os.getenv('CLAWCTL_GATEWAY_URL')}/v1",
)
pipeline = Pipeline([
transport.input(), # Browser audio in
stt, # Speech to text
user_aggregator, # Accumulate user turns
llm, # Your agent via gateway
tts, # Text to speech
transport.output(), # Audio back to browser
assistant_aggregator, # Track assistant turns
])
The key line: OpenAILLMService points to OpenClaw's gateway instead of OpenAI. Pipecat treats it identically—same API format—but you get your full agent behind it.
Step 4: Voice System Prompt
For voice conversations, you want the agent to respond conversationally:
VOICE_SYSTEM = (
"This conversation is happening via real-time voice chat. "
"Keep responses concise and conversational — a few sentences "
"at most unless the topic genuinely needs depth. "
"No markdown, bullet points, code blocks, or special formatting."
)
Step 5: The Frontend
A single HTML file with one button. Click to connect, click to disconnect.
The browser captures your mic via WebRTC, streams audio to the server, and plays back the response through a standard audio element.
No framework, no build step. About 80 lines of JavaScript handling WebRTC signaling and ICE candidates.
Running It
cd voice-chat
cp .env.example .env
# Fill in your Deepgram + ElevenLabs keys + gateway URL
uv run server.py
Open http://localhost:7860, click the mic, and talk.
Remote Access
To use voice chat from your phone or another computer, you need the server accessible remotely.
Options:
- Tailscale — Expose to your personal network, no port forwarding
- Cloudflare Tunnel — Public URL with authentication
- VPN — Access via your existing VPN
The voice chat server runs on the same machine as OpenClaw. Audio streams to the server, gets processed, and streams back.
Voice Activity Detection
Pipecat uses Silero VAD (voice activity detection) to determine when you've stopped talking. The stop_secs parameter controls the pause duration that triggers a send.
Recommended: 0.4 seconds. Short enough for natural conversation, long enough that it doesn't cut you off mid-thought.
No wake word. No push-to-talk. Just talk naturally and pause when done.
Latency Expectations
Be realistic about latency:
- Speech-to-text (Deepgram): Fast, minimal delay
- Text-to-speech (ElevenLabs): Fast, streaming
- Model inference: Depends on the model
With Sonnet, the total round-trip is reasonable for conversation. With Opus or thinking models, there's a noticeable pause—more like talking to someone who considers their response before speaking.
For quick commands, the latency can feel slow. For brainstorming, status updates, or longer discussions, it works well.
With Clawctl
If you're running Clawctl:
- The gateway is already secured with token auth
- The /v1/chat/completions endpoint is available
- Voice conversations appear in your audit trail
Point the voice server at your Clawctl gateway URL and use your gateway token for authentication.
Complete Files
The full implementation includes:
- bot.py — Pipecat pipeline configuration
- server.py — FastAPI server
- index.html — Browser frontend
- pyproject.toml — Dependencies
- .env.example — Configuration template