[Prompt]
Herman and Corn, I would like to discuss real-time conversation models for voice agents, specifically for a home preparedness checklist use case. Please address the following questions regarding the current state of the technology: 1. What is the current cost of these APIs, and how has it evolved since OpenAI first released theirs? 2. What are the unique architectural differences and value propositions of true real-time models compared to the traditional 'speech-to-text, LLM, then text-to-speech' pipeline? 3. Are there any models capable of running locally on edge or embedded devices? 4. What are some recommended platforms or frameworks for building real-time AI agents without having to manage the underlying infrastructure?

[Response]
Corn: Imagine for a second that your basement is rapidly turning into an indoor swimming pool because a pipe just burst. You are standing there, knee-deep in cold water, holding a flashlight in one hand and a wrench in the other. You do not have a spare hand to swipe through a PDF or type a query into a search engine. You need to know where the secondary shut-off valve is, and you need to know right now. So you shout to your smart home assistant, and you wait. One second. Two seconds. Three seconds. By the time that little glowing ring finishes spinning and the AI starts its polite, slow response, you are already reconsidering your life choices. This is the moment where the difference between a three-second delay and a three-hundred-millisecond delay becomes the difference between a minor repair and a five-figure insurance claim.

Herman: It is the ultimate stress test for user interface design. We have spent years in what we used to call the digital sandwich era, which we talked about way back in episode twelve eighteen. You remember that, Corn? That awkward posture where you are holding your phone like a slice of pizza, shouting at a cursor that refuses to move until you stop talking and wait for the cloud to catch up. For home preparedness, that latency is not just annoying. It is a failure state. If the tool makes you feel more panicked because it is slow, it is not a tool for emergencies. It is a liability. In a high-cortisol situation, the human brain cannot handle the cognitive load of managing a slow interface while also trying to stop a flood.

Corn: Well, Herman Poppleberry, it sounds like you have been waiting all week to dive into the guts of why our voice assistants are finally starting to act like actual humans instead of very slow librarians. Today is your lucky day. We have a prompt from Daniel about the current state of real-time conversation models, specifically looking at them through the lens of a home preparedness checklist. He wants to know about the costs, the architectural shifts, the edge-case local models, and the platforms that make this stuff actually work without needing a degree in distributed systems.

Herman: Daniel always points us toward the most interesting intersection of technology and utility. And honestly, the timing is perfect. We are sitting here in late March, twenty twenty-six, and the landscape has shifted more in the last six months than it did in the previous six years. We have officially moved out of the era of what I call the cascaded pipeline. For a long time, if you wanted an AI to talk to you, it was basically three different programs wearing a trench coat. You had a speech-to-text model listening, a large language model thinking, and a text-to-speech model talking.

Corn: The classic three-headed monster. And the problem with that monster is that it is slow because each head has to wait for the other one to finish its homework before it can start. If the speech-to-text takes half a second, and the language model takes a second to generate the first few words, and the voice synthesizer takes another half second to buffer, you are already at two seconds of silence before the AI even says hello. In a crisis, two seconds of silence feels like an eternity. It is the silence of uncertainty.

Herman: That is exactly the latency floor we discussed in episode fifteen sixty-four, titled Beyond the Transcript. But the shift Daniel is asking about is the move to native omni-modal or native speech-to-speech models. This is where the model is not transcribing your voice into text at all. It is hearing the audio tokens directly. It is thinking in audio and responding in audio. It skips the middleman entirely. When we talk about native models, we are talking about a neural network that has been trained on raw audio waveforms or spectrograms alongside text. It does not see the word water; it hears the frequency and the cadence of you saying the word water.

Corn: It sounds more efficient, but I am guessing it is not cheaper. Daniel’s first question is about the cost. How do the numbers look now that we are deep into twenty twenty-six, especially compared to when OpenAI first dropped their Realtime API back in late twenty twenty-four?

Herman: The pricing has actually become much more palatable, though it still carries what I call the real-time tax. When the first versions of these APIs came out, they were prohibitively expensive for anything other than high-end enterprise use cases. But as of this month, following the update to GPT-realtime-one-point-five, the costs have dropped by about forty percent. Right now, you are looking at roughly thirty-two dollars per one million audio input tokens and sixty-four dollars per one million audio output tokens.

Corn: Thirty-two and sixty-four. For those of us who do not speak in tokens, what does that actually mean for a ten-minute conversation about how to brace a water heater during an earthquake?

Herman: If you are having a continuous, high-quality conversation, you are looking at somewhere between five and ten cents per minute of active audio. Now, compare that to text tokens on the same model, which are only four dollars for input and sixteen dollars for output. Audio is still significantly more expensive because the data density is so much higher. You are not just sending the meaning of the words; you are sending the prosody, the tone, the background noise, the emotional weight. All of that has to be processed. In fact, one minute of audio can be equivalent to several thousand text tokens depending on the sampling rate and the codec being used.

Corn: So if I am building a home preparedness bot, I am paying for the privilege of the AI hearing the panic in my voice?

Herman: In a way, yes. But that leads directly into Daniel’s second point about the architectural differences. In a traditional cascaded pipeline, the AI is deaf to your tone. You could be screaming in terror or whispering a joke, and the speech-to-text model would just output the same string of characters. The native models we have now, like the ones from OpenAI or the new Nemotron-three VoiceChat that NVIDIA just announced, they actually perceive the audio. If you sound rushed, the model can sense that and give you shorter, more urgent instructions. If there is a loud crashing sound in the background, the model knows it was interrupted. It can even distinguish between the sound of a human voice and the sound of rushing water.

Corn: That interruption part is huge. In the old days, if you started talking while the AI was mid-sentence, it would just keep blathering on like a politician on a scripted teleprompter. You had to wait for it to finish its thought before you could tell it that, actually, the house is on fire now.

Herman: That is the full-duplex capability. Because these native models are streaming audio tokens in both directions simultaneously, they support what we call barge-in. The moment the model detects your voice, it can kill its own output stream and start listening. This is why the response times have plummeted from twenty-five hundred milliseconds down to about two hundred or three hundred milliseconds. That is faster than the human brain's typical conversational reaction time. It feels like a real person is on the other end of the line. But there is a technical cost here, Corn. This is what we call the inference tax. To keep that latency low, the model has to keep its KV cache—the memory of the conversation—warm and ready at all times. You cannot just spin it down between sentences.

Corn: So it is like keeping the engine of a car running at a red light instead of turning it off. It uses more fuel, or in this case, more GPU memory, but it allows you to floor it the second the light turns green.

Herman: And the latency in those old cascaded pipelines was inherently non-deterministic. Sometimes the speech-to-text would be fast, but the language model would get stuck in a long reasoning chain, or the network jitter would delay the voice synthesis. With a native omni-modal model, the pipeline is unified. The latency is much more predictable, which is vital when you are trying to talk someone through a medical emergency or a structural failure.

Corn: It also solves the problem of the AI sounding like a robot trying to win a spelling bee. When you skip the text layer, you lose that weird, jerky transition between words. But I have to ask, isn't there a downside to not having text in the middle? I like being able to read what the AI said. If it is just audio to audio, how do we debug it? How do we make sure it is not hallucinating that the shut-off valve is behind the fridge when it should be in the garage?

Herman: That is the big trade-off. It is much harder to guardrail a native speech-to-speech model. With text, you can run a filter to check for safety or accuracy before the voice synthesizer ever sees it. With audio, you are basically monitoring a live stream. We are seeing new techniques where a secondary, smaller model watches the audio stream for specific keywords or safety violations, but it is definitely more of a black box. For something like a home preparedness checklist, you really need that model to be grounded in a specific set of facts. You do not want it improvising how to handle a gas leak.

Corn: I can see the headline now. Local man blows up kitchen after AI suggests using a scented candle to find the leak. So, how do we ground these things? Are people still using RAG, you know, retrieval augmented generation, with audio?

Herman: They are. You can still feed text-based documentation into these models as context. The model reads the manual for your specific brand of generator and then uses that knowledge to inform its audio responses. The magic is that it can translate the dry, technical text of a manual into a calm, spoken instruction that matches the urgency of the situation. This ties into what we discussed in episode fifteen forty-four about the inference era. We are spending so much on the hardware now because these models are not just retrieving data; they are performing a high-wire act of real-time translation and synthesis. The GPU overhead for this is massive. You are looking at needing high-bandwidth memory just to handle the audio token stream in real-time without the system choking.

Corn: Speaking of hardware, let's get to Daniel’s third question about local and edge models. Because if the internet goes out during a storm, that thirty-two-dollar-per-million-token API in the cloud is about as useful as a screen door on a submarine. Can I run this stuff on my own gear?

Herman: This is where it gets really exciting for the DIY preparedness crowd. We have seen a massive push toward what I call the small-but-mighty models. If you have a decent consumer GPU with at least twelve gigabytes of video memory, you can run something like Orpheus TTS three-B, which was released late last year. It is a three-billion-parameter model that handles emotional, high-quality speech synthesis locally. But for the full pipeline, we are seeing people use things like Whisper-large-v-three-turbo for the listening part, paired with a quantized Llama-three-point-one-eight-B model for the thinking.

Corn: Three billion parameters sounds like a lot for a smart thermostat to handle.

Herman: It is, which is why we have things like Piper and the new CosyVoice two. Piper is the speed king for embedded devices. It can run ten times faster than real-time on a basic processor, like a Raspberry Pi five. It is not as soulful as the big cloud models, but for a checklist, it is perfect. And then there is the Synaptics Astra SR-eighty, which is a dedicated chip specifically designed for always-on, low-power edge AI audio. It handles the wake-word detection and the basic speech processing without ever needing to wake up a big, power-hungry server. We are even seeing high-end home hubs equipped with Jetson Orin modules that can run a fully local, sub-five-hundred-millisecond voice agent.

Corn: So the dream of a local, indestructible home assistant is actually becoming a reality. I could have a little box in my pantry that knows every emergency procedure, works without Wi-Fi, and doesn't charge me by the token.

Herman: That is the goal. For a preparedness use case, I would argue that a local-first fallback is mandatory. You want your high-intelligence cloud model for the complex planning, but you need a quantized, local model for the moment the fiber optic line gets clipped by a falling branch. The latency on a local model is often even better because you are not dealing with round-trip network jitter. You are talking about sub-one-hundred-millisecond response times. Imagine the AI responding before you even finish your sentence because it is running on a dedicated NPU three feet away from you.

Corn: Okay, so we have the tech and we have the models. But Daniel also asked about the platforms. I am a busy sloth, Herman. I do not want to spend my weekends configuring Python worker nodes and managing WebRTC signaling servers. If I want to build this checklist agent tomorrow, what am I actually using?

Herman: There are three main players right now, and they each serve a different type of builder. First, you have Vapi. It is spelled V-A-P-I, stands for Voice API. They are very developer-centric. It is a bring-your-own-model approach. They handle all the messy telephony and the WebRTC stuff, and they charge you about five cents a minute on top of whatever the model costs. It is very flexible if you want to swap between OpenAI, Groq, or your own custom models. They have a great dashboard for visualizing the latency of each step in your pipeline.

Corn: Five cents a minute feels like a fair price to pay to avoid having to learn how a phone call actually works. What else?

Herman: Then there is Retell AI. They are more of an all-in-one, low-code platform. They are known for being incredibly reliable and having great out-of-the-box handling for things like background noise and interruptions. They are doing over forty million calls a month now. They have a flat fee of about seven cents a minute. If you want something that just works and meets all the security standards like HIPAA, Retell is usually the choice for professionals. They have built-in state management, which is crucial for Daniel's checklist idea. It keeps track of which steps you have completed even if the conversation wanders.

Corn: And what if I am a cheapskate? Or, let's say, a highly motivated open-source enthusiast?

Herman: Then you look at LiveKit Agents. LiveKit is the infrastructure layer that a lot of these other companies are actually built on. It is open-source, and it is significantly cheaper—we are talking like half a cent per minute for the infrastructure—but you are responsible for hosting it. You have to manage the underlying servers. It is the most powerful option because you have total control over the pipeline. You can write custom logic in Python or Go to handle things like sensor data from your home. If your smart home detects a leak, LiveKit can trigger the agent to call you.

Corn: So Vapi for the experimenters, Retell for the pros, and LiveKit for the people who enjoy pain and low server costs. Got it. Now, let's talk about the actual application. Building a home preparedness checklist is not just about having the AI read a list. It has to be smart. If I tell the AI, I found the wrench, but it is rusted shut, it needs to be able to pivot. It needs to tell me to grab the WD-forty and move to the next step while I wait for the lubricant to work.

Herman: That is where the state management comes in. A good real-time agent needs to maintain a structured checklist in the background while having a fluid conversation in the foreground. This is why the native models are so much better for this. They can maintain the context of where you are in the house, what you are holding, and how much time has passed. In twenty twenty-six, we are seeing these agents use what we call multi-turn persistence. They don't just forget the last five minutes of the conversation because a new event happened. They can also handle non-linear inputs. You might jump from step two to step five because you happened to be standing near the electrical panel, and the AI should be able to adapt without getting confused.

Corn: It is like having a really calm friend on the phone who also happens to have memorized the FEMA handbook and your specific home insurance policy. I think one thing people overlook is the psychological aspect. In a crisis, your brain does not work at full capacity. You get tunnel vision. You forget basic things. Having a voice that is not just fast, but sounds human and empathetic, actually lowers your cortisol levels. It helps you think more clearly.

Herman: There is actually research on that. The prosody of a voice can either escalate or de-escalate a situation. The old text-to-speech voices were often too flat or too chirpy, which can be incredibly grating when you are dealing with a flood. The native models can match your energy. If you are breathing hard and talking fast, the AI can adopt a firm, steady, and slightly slower tone to help ground you. It is a subtle form of biofeedback. We call this emotional mirroring, and it is a key feature of the latest omni-modal releases.

Corn: That is fascinating. It is the AI acting as a co-regulator for your nervous system. I bet that also helps with the barge-in issue. If the AI is being too chatty, you can just say, stop talking and tell me where the gas mask is, and it immediately snaps to attention.

Herman: One thing I should mention though, is a new hurdle that popped up just yesterday. The FCC's new mandate for SIP six-oh-three-plus response codes went into effect.

Corn: S-I-P six-oh-three-plus? Herman, you are losing me. Speak sloth to me.

Herman: Basically, it is a new rule for how phone companies handle AI-generated calls. If you are building a service that calls people to remind them to check their smoke detectors or to give them emergency alerts, the phone companies now have to provide much more transparency if they block that call. It is a win for legitimate services because it means your emergency bot won't get accidentally flagged as a telemarketer selling car warranties. But it also means there is more paperwork and technical overhead for the platforms like Vapi and Retell to stay compliant. It is all about distinguishing between helpful automation and spam.

Corn: It is always a cat-and-mouse game between the people trying to be helpful and the people trying to sell me a cruise I didn't win. So, if we are looking at the big picture for Daniel, it seems like the barrier to entry has never been lower, but the ceiling for what you can build has never been higher.

Herman: That is the perfect summary. We have moved from the digital sandwich to the ambient assistant. We are at a point where you can build a system that is low-latency, emotionally aware, and potentially local. If I were starting a project like this today, I would focus entirely on interruptibility. That is the gold standard. If your agent cannot handle being interrupted mid-sentence without losing its place in the checklist, it is not ready for a real-world emergency. You also need to decide on your buy-versus-build strategy. If you need to get to market in a week, use Retell. If you want to build a custom, local-first infrastructure that you own entirely, go with LiveKit.

Corn: And I would add that you should not over-engineer the voice. In a crisis, I do not need the AI to sound like a Hollywood star. I need it to be clear, concise, and incredibly fast. Speed is the ultimate feature. If I have to wait for the AI to finish a beautiful, poetic sentence about the nature of water damage while my basement is flooding, I am going to throw the device out the window.

Herman: Speed is a function of the entire stack. From the audio sampling rate—usually twenty-four kilohertz for high quality—to the way you handle the WebSocket connection. This is why we are seeing companies move away from Python for the real-time parts of the stack and toward languages like Rust or Go. Every millisecond you shave off the processing time is a millisecond the user isn't spending in a state of uncertainty. And remember, the bottleneck is rarely the speech-to-text anymore; it is the time-to-first-token in the generation phase.

Corn: It is a good reminder that even as these models get smarter, the basics of performance still matter. You can have the smartest model in the world, but if it is running on a slow connection or a bloated framework, it is useless in a pinch. We are also seeing a lot of innovation in the way these models handle background noise. If you are in a storm, there is wind, there is rain, maybe there is a siren in the distance. Traditional speech-to-text would often hallucinate those sounds into words.

Herman: The native models are much better at distinguishing between the user's voice and the environment. They can actually use the environment as context. They might hear the wind and ask, hey, should we check the shutters? Or they might hear the sound of a smoke alarm and prioritize that over whatever else you were talking about. This is the difference between a model that is just processing text and a model that is truly situated in your world.

Corn: That is the kind of proactive AI I can get behind. Not just waiting for me to ask a question, but actually observing the situation through the audio feed. It is a little creepy, but in an emergency, I will take all the help I can get. It is the ultimate expression of what we called the inference era in episode fifteen forty-four. We are no longer just asking computers to store our data. We are asking them to be our eyes and ears in the physical world.

Herman: And the cost of that is high, but the value in those critical moments is immeasurable. For Daniel, the takeaway is clear: the technology is ready. The challenge now is the implementation. How do you design a conversation that is helpful without being intrusive? How do you ensure the local fallback is just as capable as the cloud version? These are the engineering problems of twenty twenty-six.

Corn: So, to wrap this up for Daniel, if you are building that home preparedness agent, start with a managed platform like Vapi or Retell to get your logic right. Use a native model like GPT-realtime for the cloud version to get that sub-three-hundred-millisecond response time. But make sure you have a local fallback using something like Piper or a quantized Whisper model on an edge device like a Jetson Orin. And for the love of all that is holy, make sure it knows where the water shut-off valve is.

Herman: And maybe keep a manual wrench nearby, just in case. No matter how fast the AI is, it still cannot physically turn a valve for you. At least not until we get those humanoid robots we keep hearing about.

Corn: One crisis at a time, Herman. Always have a backup for the backup. That is the first rule of preparedness. This has been a deep dive into the guts of how we talk to machines, and honestly, I am feeling a lot better about the next time a pipe bursts in my basement. At least I know the AI will be able to hear me scream.

Herman: It will hear you, it will understand you, and it might even tell you to take a deep breath. Which, let's be honest, is something we all need from time to time. The transition from the digital sandwich to the ambient assistant is finally complete. We are building interfaces that adapt to humans instead of forcing humans to adapt to the interface.

Corn: Especially you, Herman Poppleberry. You get a little worked up about those token costs. I will drink to that. Preferably a drink that isn't made of basement flood water.

Herman: Good call. Someone has to keep an eye on the bottom line while you are busy being a cheeky sloth.

Corn: Fair enough. Well, we have covered the costs, the architecture, the local edge models, and the platforms. I think Daniel has plenty to chew on for his next project. It is amazing to see how far we have come from those awkward days of shouting at our phones.

Herman: We are finally in the era of true real-time interaction. It is a good time to be a builder.

Corn: Alright, we should probably wrap this up before Herman starts reading the FCC mandate line by line. Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a huge thank you to Modal for providing the GPU credits that power this show and allow us to explore these deep technical topics every week.

Herman: If you want to dive deeper into the archives or see the full transcript for this episode, head over to myweirdprompts dot com. You can search for all our past episodes there, including the ones we mentioned today about the inference era and omni-modal audio.

Corn: And if you are enjoying the show, do us a favor and leave a review on your favorite podcast app. It really does help other people find the show, and it makes Herman feel like all those hours spent reading research papers were worth it.

Herman: They are always worth it. The data never sleeps.

Corn: We will see you next time on My Weird Prompts. Stay safe out there, and maybe go check your smoke detectors.

Herman: Goodbye everyone.

Corn: Later.