Barge-In and the Wire-Protocol Gotchas of Real-Time Voice Agents
The exact wire knobs that make real-time voice AI work: barge-in, snake_case audio frames, the Constrained endpoint, and the message shapes that bite.
Real-time voice AI looks simple on a slide: mic in, audio out, a model in the middle. The reality is a WebSocket carrying a half-dozen message shapes where one wrong key — mimeType instead of mime_type — closes the socket on the first audio frame with an error string that has nothing to do with the actual bug.
If you read our companion war story, We Put Gemini Live on a Phone Line, you know the journey. This post is the reference card: the exact knobs that make a browser-direct, full-duplex voice agent work against the consumer Gemini Live API, centered on the hardest UX detail — barge-in — and a table of the message shapes you have to send byte-for-byte.
Everything here is what Matrix runs in production for its browser-direct voice page, where the browser holds the WebSocket straight to Gemini and the backend only mints ephemeral tokens.
The handshake: get auth right or nothing else matters
Three things have to line up before a single audio frame flows.
1. Mint a plain ephemeral token
Token minting is a single POST to the v1alpha auth_tokens endpoint:
POST https://generativelanguage.googleapis.com/v1alpha/auth_tokens?key=<API_KEY>
{ "uses": 1, "expireTime": ..., "newSessionExpireTime": ... }
The trap: the auth_tokens API accepts a bidiGenerateContentSetup field that pre-binds model, voice, and tools to the token. It's tempting — fewer round trips. Don't bake it in. A pre-bound token switches the server to expect an Authorization: Token <name> HTTP header, and browsers cannot set custom headers on a WebSocket upgrade. The entire surface of new WebSocket(url, [protocols]) is the URL and Sec-WebSocket-Protocol. So mint a plain token and send the setup from the browser instead.
One more timing gotcha: newSessionExpireTime defaults to 60 seconds. If a user opens the agent picker, browses for a minute, then clicks Start, the session-init window has already lapsed and the socket opens-then-closes. Extend it (Matrix uses 600s).
2. Connect to the Constrained endpoint
This is the browser-friendly RPC variant. It accepts query-param auth and needs no header:
wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContentConstrained?access_token=auth_tokens/<id>
Use BidiGenerateContentConstrained, not BidiGenerateContent. The plain method wants that Authorization header you can't send from a browser. And do not URL-encode the / in auth_tokens/<id> — the gateway treats the encoded slash as a path component and the handshake fails.
3. Send the setup message at onopen
The browser sends setup the moment the socket opens. setup.model must be prefixed models/...:
{ "setup": { "model": "models/gemini-3.1-flash-live-preview", "generationConfig": { "responseModalities": ["AUDIO"] } } }
The server replies setupComplete and you're live. One last check before you change the model name: GET https://generativelanguage.googleapis.com/v1beta/models?key=… and confirm the model is live-capable on your key. A wrong model name doesn't error politely — the server accepts the upgrade, accepts the setup, then closes silently (1006 in the browser).
The message-shape table
This is the part to bookmark. The casing is not uniform: the setup envelope is camelCase, but realtime_input is snake_case, and responses come back camelCase. Get one key wrong and the socket closes with a misleading 1008.
| Direction | Purpose | Shape |
|---|---|---|
| → send | Setup | { "setup": { "model": "models/...", "generationConfig": {...} } } (camelCase) |
| → send | Audio frame | { "realtime_input": { "audio": { "mime_type": "audio/pcm;rate=16000", "data": "<base64>" } } } (snake_case, singular audio) |
| ← recv | Setup ack | { "setupComplete": {} } |
| ← recv | Model audio | serverContent.modelTurn.parts[].inlineData.{ mimeType, data } (camelCase) |
| ← recv | Barge-in | serverContent.interrupted: true |
| ← recv | Tool call | toolCall.functionCalls[].{ id?, name, args } |
| → send | Tool response | { "toolResponse": { "functionResponses": [{ id, name, response: { content | error } }] } } |
Two specifics worth their own line, because both cost real debugging time:
- Audio frames are snake_case and singular.
realtime_input,mime_type— notrealtimeInput, notmimeType. And it's a singleaudioobject, not the olderrealtime_input.mediaChunks: [{…}]array, which the gateway rejects outright (with the same misleading1008you'd get from a bad token). - Inbound audio is camelCase and nested differently. Model speech arrives at
serverContent.modelTurn.parts[].inlineData.{mimeType, data}. Same concept as your outbound frame, opposite casing convention.
Barge-in: stop the sources, not the counter
Barge-in — the model going quiet the instant the human starts talking — is the single detail that separates "voice agent" from "press-and-hold walkie-talkie." It's also the easiest thing to get subtly, infuriatingly wrong.
The server does its half: when it detects the user speaking over the model, it clears its own pending turn and sends you serverContent.interrupted: true. Your job is the browser half, and the obvious implementation is a bug.
The wrong way
The first instinct is to reset your scheduling state — nextStartTime, an active flag, a counter. That does nothing audible. The browser has already called .start(t) on every queued AudioBufferSourceNode; the Web Audio graph keeps playing them out regardless of your bookkeeping. The symptom is unmistakable: the agent appears to talk over itself, finishing two or three sentences after the user has clearly interrupted.
The right way
Keep a Set<AudioBufferSourceNode> of every source you've queued. On interrupted: true, iterate the set, call .stop() and .disconnect() on each one, then clear the set. Audio cuts within ~50 ms.
function reset() {
for (const src of queuedSources) {
try { src.stop(); src.disconnect(); } catch {}
}
queuedSources.clear();
}
.stop() alone leaves the node connected; .disconnect() alone leaves a scheduled source ringing. You need both.
The 60 ms drop window
There's a final race. The server can pipeline a stale audio chunk onto the wire just before it processes the interrupt — so a fragment of the old turn arrives after you've already torn down playback. If you naively queue it, you get a clipped echo of the interrupted sentence.
The fix is a short drop window: for ~60 ms after interrupted=true, swallow any inbound audio chunk. The window is long enough to catch in-flight stale audio but short enough that the model's next turn — the brief verbal acknowledgement ("Haan boliye?", "go ahead") you want the user to hear — plays immediately. Tune it tight; too wide and the next reply feels laggy.
Takeaway: barge-in lives in two places that must agree. The server clears its turn and signals
interrupted; the browser must hard-stop every queued source (.stop()+.disconnect()) and ride out a ~60 ms drop window. Reset your counters too, but the counters were never what kept the audio playing.
The secure-context gotcha that blocks you before any of this
Before you can capture a single microphone sample, navigator.mediaDevices.getUserMedia has to exist — and on an insecure origin, mediaDevices itself is undefined. Browsers treat only these as secure:
https://anythinghttp://localhostandhttp://127.0.0.1
Nothing else. A plain-HTTP LAN address like http://10.x.x.x:3000 will not give you the mic API, which is a confusing failure when "it works on my laptop" but breaks the moment a teammate hits your dev box by IP. The clean answer is a reverse proxy that terminates TLS for any hostname — Matrix uses Caddy with tls internal and on-demand certs, so any LAN IP gets a cert on first connect, no DNS or pre-provisioned cert list. (The dev-only alternative is the chrome://flags/#unsafely-treat-insecure-origin-as-secure per-machine flag.)
How Matrix keeps this from rotting
These knobs are fragile by nature — a refactor that "just cleans up imports" can flip a casing or swap an endpoint and silently break voice. Matrix treats the wire path as sacred:
- The voice client (
useGeminiLive.ts, the playback queue, the capture worklet) is held byte-identical across moves; the wire protocol is documented as an invariant indocs/LEARNINGS.md. - The same composed prompt drives text chat, browser-direct voice, and the telephony bridge, so the agent behaves identically no matter how a contact reaches it — the channel changes, the brain doesn't.
- Audio sample-rate handling (48 kHz mic → 16 kHz capture, 24 kHz playback) is split across two
AudioContexts on purpose; the audio pipeline post walks through why one context can't cleanly do both.
The meta-lesson from all of this: catch-all auth errors lie. Gemini Live's "Method doesn't allow unregistered callers" surfaced for at least four distinct bugs — wrong model name, wrong endpoint variant, wrong field casing, missing header — none actually about credentials. When the error is too generic to act on, stop trusting it and instrument the wire: monkey-patch the official SDK's send/recv, print every frame, and diff its output against yours. The casing difference shows up in one line.
Ship it
Real-time voice is a stack of small, exact decisions: a plain token, the Constrained endpoint, models/-prefixed setup, snake_case audio frames, and barge-in that actually stops the sources. Get those right and the rest is tuning.
Matrix ships all of it — browser-direct and a telephony bridge, with barge-in, recording, and eight prebuilt voices — so you configure an agent instead of reverse-engineering a WebSocket. Create a workspace and put a voice agent on a page (or a phone line) this afternoon. Next up: one agent, two voice paths.
Build your first agent on Matrix
Spin up a workspace, wire up tools and knowledge, give your agent a voice, and talk to it in real time — no agent code required.