The Hierarchy of Needs for Voice AI
Every week we talk to teams building voice AI. Smart teams, well-funded teams, teams with great engineers. And almost every one of them is working on the wrong thing.
They'll tell us about the sophisticated multi-turn memory system they're building while their P95 call latency is 900ms. They'll walk us through a beautifully crafted persona while their bot silently drops 8% of calls due to telephony edge cases. They'll show off a demo of their bot handling an angry customer gracefully, then admit they haven't tested it on a real SIP trunk yet.
This is a failure of prioritization. It's not stupidity — it's the natural outcome of building voice AI without a coherent mental model of what matters and when.
The Six Levels
We think about production voice AI as a strict hierarchy. You cannot skip levels. Progress at a higher level is meaningless if a lower level is broken.
**Level 1: Functional.** Does the call complete without crashing? Can the bot understand speech input and produce speech output? Does it handle silence, background noise, accents, and cross-talk without falling apart? This sounds like table stakes, but you'd be surprised how many production bots fail here on the long tail of real calls.
**Level 2: Low-latency.** Is the interaction fast enough to feel like a conversation? The human threshold for uncomfortable pause is around 500ms. P95 latency under 400ms is a hard requirement for any voice bot that users won't immediately hang up on. Most AI voice stacks, assembled naively from off-the-shelf components, don't clear this bar.
**Level 3: Natural.** Does the conversation feel human? Does the bot speak in a way that matches the register of the domain? Does it recover gracefully from user confusion? This is where most teams start — and it's the third level, not the first.
**Level 4: Robust.** Does the bot handle adversarial input? Can it recover from mid-call crashes? Does it behave predictably when users go off-script? This is where edge-case coverage, error handling, and stress testing live.
**Level 5: Ergonomic.** Is the bot pleasant to interact with? Does it avoid asking for information it already has? Does it remember context from earlier in the call? Does it respect users' time? This is the level of craft and polish.
**Level 6: Complete.** Does the bot accomplish everything it's supposed to accomplish — across all the call types, all the user populations, all the system integrations that the business needs? This is where features live.
Why You Can't Skip Levels
The temptation to skip to naturalness or completeness is understandable. Those are the things that show up in demos. A smooth, natural voice interaction is immediately impressive. A robust telephony stack is invisible.
But here's the problem: a natural bot that drops 5% of calls is a bot that fails 1 in 20 customers. A feature-complete bot with 800ms latency is a bot that people hang up on. Worse, the failures at lower levels corrupt your ability to evaluate higher levels at all — if your bot crashes under load, you have no idea whether your naturalness improvements are actually working in production.
There's also a resource allocation problem. Naturalness optimizations are expensive and slow-moving. Getting from "good" to "great" on naturalness might take months of prompt engineering and model fine-tuning. Getting from 400ms to 200ms latency might take a week of infrastructure work. If latency is broken, that week delivers far more user-facing value than months of naturalness work.
Guava's Position
We built Guava with this hierarchy baked in. Our obsession with latency comes before our work on naturalness. Our telephony reliability work comes before our ergonomics features. This isn't modesty about what we can build — it's a deliberate prioritization of what matters.
The checklist architecture is itself a manifestation of the hierarchy. By forcing structure at Level 1 (the bot always knows what it's trying to do), we ensure that natural conversation (Level 3) can operate within a reliable frame. The bot can be natural about *how* it collects a field because the *what* is determined by the structure.
When teams come to us with a voice AI problem, the first question we ask isn't "what do you want the bot to say?" It's "does your call stack currently complete calls reliably?" Start at the bottom. The top will follow.