Voice Agents in Production: What Nobody Tells You

It's 2:47 AM. An alert fires. One of our voice agents — deployed three weeks ago for a German insurance company's inbound claims line — has been cutting callers off mid-sentence for the past 40 minutes. Not hanging up. Just interrupting them, repeatedly, with "I'm sorry, I didn't catch that." Seventeen calls. Seventeen confused, frustrated people.

The culprit? A hold music track. The client's PBX was playing 8 seconds of music before connecting calls to the agent. Our VAD (voice activity detection) was treating that music as speech, pre-empting before the caller even said a word, and entering a confused loop. We hadn't tested for it because we hadn't known to test for it. Nobody told us.

That's the job. Not the impressive demo. Not the latency benchmark. The 2:47 AM investigation into why a German insurance claimant is being interrupted by their own hold music.

Here's what I've learned from the other side of it.

What a production voice agent actually looks like

The marketing version: "Just connect your LLM to a phone number." The reality: five tightly coupled systems that all have to work in under 800 milliseconds, or the call sounds wrong.

The stack I run: Deepgram Nova-3 for STT (streaming, ~150ms to first word), Groq + Llama 3.3 70B for inference (fastest available path to a token, ~200-400ms for a short response), ElevenLabs Flash v2.5 for TTS (~75ms latency to first audio byte, though real-world TTFB with network overhead is closer to 200ms), and LiveKit for WebRTC orchestration and turn management. Phone connectivity via Twilio or a SIP trunk.

Target end-to-end: under 800ms. What you actually get in production, P50: 900-1100ms. P90: 1.4-1.8 seconds. The benchmarks are real. The production numbers are different, because production has background noise, packet loss, network jitter, and callers on speakerphones in moving cars.

For fast inference I also run Claude Sonnet for more complex reasoning tasks where I need tool use and careful judgment — but for raw voice latency, Groq wins the benchmark. The model matters less than how fast you can get the first token.

The silence problem

Silence detection is the number one source of broken voice experiences. Not model quality. Not voice synthesis. Silence.

Here's the problem: when a caller pauses — to think, to check their account number, to breathe after explaining a frustrating situation — your VAD has to decide: is this a natural pause within a sentence, or have they finished speaking? Get it wrong in one direction and you interrupt them. Get it wrong in the other and you sit in dead air for two full seconds while they wait for a response that isn't coming.

The key parameters in Deepgram's VAD: endpointing (how long to wait after silence before triggering turn-end, in milliseconds), and utterance_end_ms (minimum pause length to consider an utterance complete). Out of the box defaults are tuned for demos, not production. In a real call center context, 500ms endpointing feels aggressive — callers who pause to pull up a reference number get cut off. I typically run 700-900ms for patient contexts, 500ms for quick-turnaround flows like appointment confirmations.

There's a more sophisticated approach: semantic turn detection. Instead of purely audio-based silence thresholds, you run a lightweight classifier that looks at whether the transcribed text feels syntactically complete. "My policy number is—" is clearly incomplete. "My policy number is 4-4-7-2-1." is complete. Audio-only VAD can't tell the difference. Semantic VAD can. The latency cost is ~50-100ms. Worth it for complex call types.

The hold music failure I described earlier is a VAD problem. So is the agent that keeps talking because it didn't register the caller saying "wait, stop" — a barge-in detection failure. These are not AI problems. They are signal processing problems, and they eat most of your production debugging time.

Emotional callers

This is the thing nobody writes about, because the demo callers are always polite and clear. Real callers are not always polite and clear.

Someone calls angry about a billing error at 8 AM before their second coffee. Someone calls crying because their claim was denied and they don't know how they'll pay for the damage. Someone calls with the aggressive, pressured tone of someone who's already been transferred three times and is about to hang up.

Your LLM, without any additional handling, generates a technically correct response with zero emotional awareness. "I understand your concern. Let me look up your account." Delivered in a calm, pleasant voice, to someone who just said they're extremely upset. That response will make things worse, not better. The words are fine. The failure to acknowledge the emotional register is the problem.

What I build now: a lightweight emotion classifier as a preprocessing step. Before the LLM sees the transcript, a fast inference call (Haiku, ~50ms) classifies the emotional state: neutral, frustrated, distressed, or hostile. The result gets injected into the LLM's system prompt as a context flag.

The system prompt then has branching instructions: "If EMOTIONAL_STATE = frustrated: Begin your response by explicitly acknowledging the difficulty of the situation before any procedural steps. Do not use corporate language like 'I understand your concern.' Say something real." The difference in caller outcomes is measurable — we've tracked 18-22% reduction in call escalation rates in clients where this is deployed versus those without it.

There's also the deescalation script pattern: a set of pre-written TTS audio clips for high-emotion moments, bypassing LLM generation entirely. "I want to make sure we resolve this completely for you. Can you give me one moment?" — said in a warm, unhurried voice, pre-generated with a high-quality voice model at ideal emotional register. No LLM latency, no token cost, no hallucination risk. Just a real human-sounding moment of calm.

The latency tax on every tool call

Every time your voice agent needs to look something up — a customer record, a product availability, an appointment slot — you pay a latency tax. A single database lookup: 80-150ms. A CRM API call: 200-400ms. A third-party availability check: 300-600ms. Chain three of these together and you've added 600ms-1.1 seconds to your response before the LLM has even started generating.

1.5 seconds of silence on a phone call is a long time. Past ~1.8 seconds, callers start saying "hello?" or just hang up.

Three techniques that help:

Speculative prefetching. If you know that 80% of calls to your claims line will need the caller's policy information, fetch it the moment the call connects — before they've said anything. By the time they give you their name and policy number, you've already got the record and you just confirm the match. The API cost is a few cents per call. The latency savings are hundreds of milliseconds.

Parallel tool calls. Don't chain tools sequentially. If you need a customer record AND a product catalog lookup, fire both in parallel. Cut your tool latency roughly in half. Most orchestration frameworks support this; it's not used by default.

Streaming TTS that starts early. The LLM generates tokens sequentially. Your TTS can start synthesizing audio on the first sentence while the LLM is still writing the second. By the time the caller hears sentence one end, sentence two's audio is already queued. The perceived latency is the latency to the first sentence, not the full response. This is the single highest-impact optimization most teams aren't doing.

Language mixing (code-switching)

DACH callers — Germany, Austria, Switzerland — switch between German and English mid-sentence constantly. This isn't an edge case. It's standard behavior for any professional in a German-speaking country under 50. "Ich hab das schon submitted — gibt es ein Update zu meinem ticket?"

Most STT systems, configured for a single language, handle this badly. They either misrecognize the switched-language words entirely, or they transcribe them with such low confidence that the transcript becomes garbled. Your LLM then gets garbage input and produces a response that doesn't address what was actually asked.

Deepgram supports code-switching explicitly. AssemblyAI's Universal model handles it reasonably. OpenAI Whisper handles multilingual input but adds latency and isn't streaming-native. The fix: configure your STT with an explicit language list for your target market, enable code-switching mode, and test with real code-switched utterances — not just clean single-language test cases. German WER targets for production: under 12% is excellent, under 15% is acceptable. Test separately for German-English mixed utterances.

The 47 things that break

Not literally 47. But here are the real ones, in rough order of how often they'll ruin your week:

Hold music before connection. Your VAD triggers on the music. Agent starts in a confused state. (Happened to me.)
DTMF tones. Callers press "1" on their keypad. VoIP codecs compress audio in a way that can corrupt DTMF signals. Your STT may transcribe the tone as speech fragments. Handle DTMF out-of-band via RFC 2833, not in-audio.
Fax machine tones. A non-trivial percentage of inbound calls to any business phone line are fax machines dialing in. Your agent will attempt to converse with a fax machine. The fax machine will not cooperate. Build fax detection (distinctive tone pattern) and drop those calls immediately.
Background noise thresholds. A caller in a car with the window down, in a café, on a construction site. Real-world noise reduces STT accuracy by 30%+ versus clean office audio. Noise suppression (Krisp or built-in WebRTC noise cancellation) is not optional in production.
VoIP jitter. Packet loss of even 3-5% causes audio artifacts that your STT model wasn't trained on. Jitter buffers need tuning (30-50ms is standard, adaptive buffers up to 200ms for bad connections). Unconfigured defaults will cause intermittent transcription failures that are nearly impossible to reproduce.
Call recording laws. Germany requires all-party consent. Switzerland also. France requires notification. The US varies by state (some one-party, some two-party). Your "I'll record this call for quality" disclaimer must fire before the agent says anything substantive. Miss this and you're looking at GDPR enforcement.
Speakerphone + echo. When a caller is on speakerphone, the agent's voice leaks back into the microphone and your STT transcribes the agent's own words as a new caller utterance. Acoustic echo cancellation (AEC) in WebRTC handles this, but it has to be enabled explicitly and tested.

What I'd build differently

Human handoff as a first-class feature, not an afterthought. Every production voice agent needs a graceful transfer path to a human. Not a fallback. A designed transition. "I want to make sure you get the best help here — let me connect you with someone who can resolve this directly." Build the SIP transfer logic before you build anything else. Because you will need it, and needing it at 2 AM when you haven't built it yet is a very specific kind of bad.

Call recording and transcript review as your primary debugging tool. Every failed call tells you something. Build a transcript review UI — I use a basic Next.js page against a Postgres transcript store — and review the worst 20 calls every week. More signal than any benchmark or test suite.

A/B test your voice persona. Speed, pitch, gender presentation, accent, pacing — these are not soft decisions. They materially affect completion rates and caller trust. Run two voices against each other on the same call type for two weeks and measure. The results will surprise you. In one client deployment, a slower speaking pace increased task completion by 14% among callers over 55.

Gradual rollout starting with after-hours calls. No human backup available at 11 PM anyway — it's a low-stakes testing ground. Real callers, real edge cases, real transcripts to review, without the pressure of daytime volume. Run two weeks there before touching the main call queue.

The honest close

Voice agents are roughly three years behind text agents in maturity. Not because the models aren't capable — they're capable. But because the surface area of a phone call is enormous: audio quality, network infrastructure, human emotional variability, regional legal requirements, telephony quirks, VAD edge cases. The text chatbot sits inside a clean browser window with perfect Unicode input. The voice agent sits inside the actual phone network, which was built in the 1970s and has been patched ever since.

The tooling is catching up fast. ElevenLabs, Deepgram, LiveKit, Vapi, Retell — the infrastructure layer is genuinely good now. But if you deploy a voice agent today, plan to spend 40% of your time on edge cases that have nothing to do with your LLM. The model is fine. The real world is the problem.

The companies getting this right aren't the ones with the best AI. They're the ones who reviewed 200 call transcripts, fixed the DTMF handling, tuned their VAD parameters, built the emotional state classifier, and deployed to after-hours first. The unglamorous stuff. The stuff nobody demos.

Voice Agents in Production: What Nobody Tells You

What a production voice agent actually looks like

The silence problem

Emotional callers

The latency tax on every tool call

Language mixing (code-switching)

The 47 things that break

What I'd build differently

The honest close

Robert Kopi

Enjoyed this?

Your CRM Is Losing You Deals. Here's the Architecture That Fixes That.

The Multi-Agent Architecture That Powers Our AI Departments

I Replaced a 15-Person Sales Team with AI. Here's What Actually Happened.