I Built Real-Time AI Conversations for NPCs. Here's Why It's the Wrong Architecture.

In the last update to our Daily Dream project, my daughter and I had successfully built a character designer, pivoting to a robust procedural SVG system. The characters looked distinct. They looked like a cast. But they didn’t feel like a cast yet.

So, I set out to make them talk. Not just pre-recorded barks, but genuine, dynamic dialogue between characters.

The v0.5 Premise: Live Bidirectional AI

I wanted to see if two NPCs could autonomously hold a conversation. Using the Gemini Live API, I set up a real-time, low-latency, bidirectional audio stream.

The architecture involved a Python FastAPI server with WebSockets acting as the middleman. Two Mii instances were spawned, each armed with their own personality prompt, streaming audio back and forth. I even wired up the Web Audio API to analyze the audio stream in real-time to drive synchronized mouth animations.

After a few tuning iterations—realizing that personality prompts have to be strong and opposed to create interesting dialogue, rather than just polite AI agreement—it started working. I had Yoku (a cat persona) and Koji (a dog persona) fiercely debating whether tuna or chicken was superior.

It was an impressive tech demo. And as an architect, it made me realize exactly why I was going to throw most of it away for the final game.

The Architect’s Critique: Four Fatal Flaws

The live AI-to-AI demo works beautifully for a tech showcase, but it is fundamentally the wrong architecture for a production game environment. Here are the four reasons why:

1. Cost

Every second of conversation requires two live API streams running in parallel. For a game where dialogue happens dozens of times across ten or more characters per session, this is economically unsustainable. A pre-generated script, by comparison, costs orders of magnitude less—you pay once to generate it, and then it’s effectively free to replay.

2. Control

Real-time conversation means you cannot predict, edit, or test what the characters will say. In game design, you want the emotional beats to land. You want a joke to hit properly. Live AI is a wild animal; scripted AI is a director’s tool. If an agent goes off-tone during a live session, you can’t fix it without re-rolling the entire conversation.

3. Latency

Live streams inherently have unpredictable timing. Game dialogue needs to fit tightly into animation windows and scene timings. Pre-generated audio files have exact, known durations, ensuring your scene transitions are perfect. Live audio simply doesn’t guarantee that sync.

4. Repeatability

If a player triggers the exact same scene twice, live AI will generate two different conversations. While this sounds cool for novelty, it’s terrible for game feel. Players want a world that is consistent and memorable, not a slot machine that changes fundamental character interactions on a whim.

The “Kid Director” Approach

Ultimately, live AI-to-AI is a tech demo, not a feature. The right architecture for a game is actually much more boring: an LLM generates a script ahead of time, a second LLM pass validates it, TTS turns the lines into audio, and the game caches everything aggressively.

But there is an even more important reason to use scripted dialogue: human direction.

When the script is generated ahead of time, my 13-year-old daughter can read it before the characters perform it. She can tell me that Yoku is being too mean, or Koji isn’t funny enough. And here’s the magic—she isn’t proofreading; she’s directing.

She tells me, “Make Yoku more sarcastic and obsessed with naps.” I update the prompt, regenerate the script, and the characters perform the new lines. Live AI is the model improvising. Scripted AI is the model performing—and someone gets to sit in the director’s chair.

In our next iteration, we’ll build that exact scripted pipeline and see how much better our cast performs under proper direction.