iwy.ai – Video AI agents, plug-and-play deployment

1. the real latency budget is ~800 ms, not "sub‑second" hype

humans pause about 500 ms before replying in normal speech. the primer argues you should aim for ≤ 800 ms voice‑to‑voice to feel natural, then breaks down where those milliseconds go (mic, opus, stt, llm, tts, network). see the full 993 ms table – it's a sobering reminder how many layers sit between a user and your code.

2. llm time‑to‑first‑token is the biggest single chunk

in that breakdown, llm ttfb alone eats ~350 ms. optimizing everywhere else won't save you if your model is slow. the guide's latency table shows gpt‑4o mini at 290 ms median vs claude sonnet at 1.4 s. choose models with speed in mind or host your own tweaked llama on groq if you really need snappy.

3. pipecat proved 500 ms is possible (but $)

by co‑locating stt, llm, and tts on the same gpu cluster and tuning for latency over throughput, the authors hit ~500 ms round‑trip. they admit it's pricey and not yet the norm, but it sets a north star for "instant" voice agents.

4. stt + tts still beat shiny speech‑to‑speech for prod

openai and google's end‑to‑end audio models sound amazing, yet the primer says they're slower for long chats, costlier, and prone to odd quirks (repetition, unfinished sentences). today, stitching best‑in‑class stt + llm + tts is still the pragmatic path.

Pro Tip:

At iwy.ai, we've optimized this exact stack to get voice agents configured and deployed in mere minutes. We handle the complexity so you can focus on your agent's personality.

5. deepgram (us) and gladia (eu) own low‑latency stt

today's production stacks lean on deepgram (~150 ms ttfb in us) or gladia (similar latency, better multilingual). whisper's 500 ms+ delay knocks it out of real‑time contention unless you run a turbo variant on groq.

6. cost math: hosting is noise, apis dominate

a sample ten‑minute session using deepgram + gpt‑4o + cartesia costs ≈ $0.025/min; cloud compute for the agent itself is <1 % of that. optimize api usage, not vm pennies.

// Cost breakdown per minute:

deepgram_stt: $0.0043

gpt_4o: $0.015

cartesia_tts: $0.0055

compute_hosting: $0.0002

// ────────────────────

total: ~$0.025/min

7. function calls are your integration backbone

multi‑turn voice agents lean hard on llm function calling: pulling rag data, driving crm actions, transferring calls. but prompt drift across turns degrades reliability, so keep the function set tight and prompts clean.

tl;dr for builders

• chase sub‑800 ms, but know llm ttfb is the wall.
• mix‑and‑match best apis; speech‑to‑speech isn't ready.
• budget for api tokens, not servers.
• treat function calls as first‑class; test them under prompt bloat.

want more? the illustrated primer is gold – read the whole thing here → voiceaiandvoiceagents.com

voice ai cheat‑sheet: 7 takeaways devs shouldn't ignore in 2025

Sveinung Myhre

Voice AI 2025