Voice AI breaks every prompting instinct you've built for text. The tricks that work in ChatGPT — numbered lists, markdown formatting, structured responses — actively hurt you in a voice interface. A caller doesn't see bullet points. They hear "asterisk asterisk important asterisk asterisk" and hang up.
I've built and deployed voice agents on VAPI, ElevenLabs Conversational AI, and Twilio. The failure modes are consistent across platforms, and most of them trace back to system prompts that were written for text models and never adapted for speech. This post covers exactly what to change.
Why voice prompting is fundamentally different
Text interfaces are asynchronous and visual. The user reads at their own pace, scrolls, re-reads. Voice is synchronous and linear. A caller hears one word after another, in real time, with no rewind button.
That creates four constraints that don't exist in text prompting:
No markdown. Your LLM will confidently generate **bold text**, - bullet points, and code blocks. The TTS engine reads all of it aloud. Design every sentence to sound natural when spoken.
Latency sensitivity. Most voice AI pipelines have 500ms–1.5s of latency between the caller finishing a sentence and the agent starting to respond. Long first tokens hurt. Prompts that produce verbose preambles ("Certainly! That's a great question. Let me look into that for you...") feel broken on the phone because the silence before the response is already uncomfortable.
Turn-taking is explicit. In text, a user clicks send when they're done. In voice, the VAD (voice activity detection) decides when someone has finished speaking. Your agent needs to signal when it's done and waiting for a response, or the caller will keep talking, thinking the agent hasn't answered yet.
Interruptions happen constantly. Callers talk over agents. Real people do this even with other humans. Your system prompt needs to handle graceful interruption recovery — not freeze, not repeat the full sentence, not apologize three times.
Core principles for voice system prompts
Write in speech, not prose
Every sentence in your system prompt should read naturally if spoken aloud. Test this: read your prompt out loud. If you stumble, the TTS will too.
Bad (text-optimized):
When a user asks about pricing, provide the following information:
- Basic plan: $29/month
- Pro plan: $99/month
- Enterprise: contact sales
Good (voice-optimized):
When someone asks about pricing, tell them we have three plans. The Basic plan is
twenty-nine dollars a month. The Pro plan is ninety-nine dollars a month. For
Enterprise pricing, offer to connect them with the sales team.
Notice the differences: numbers spelled out, no lists, instructions framed as what to say rather than what to show.
Spell out numbers and acronyms
TTS engines are inconsistent with numerals. "$29" might be read as "twenty-nine dollars" or "dollar twenty-nine" depending on the engine and context. "API" might be "ay-pee-eye" or "ap-ee". Don't leave it to chance.
Explicit is always better:
$29→twenty-nine dollarsAPI→A-P-Iorthe APISaaS→sass(phonetically fine) or spell it outURL→U-R-Lor "the link"- Phone numbers →
five-five-five, eight-six-seven, five-three-oh-nineis clearer than digits
Give explicit turn signals
At the end of a turn, your agent should signal it's waiting. This prevents awkward silence where the caller isn't sure if the agent finished.
Add instructions like:
End each response with a clear question or invitation to respond.
Do not end on a statement and then wait silently.
After providing information, ask "Does that help?" or "What else can I help you with?"
Keep responses short
The sweet spot for a single voice turn is 1–3 sentences. If you need to convey more, break it across turns with questions in between.
Keep each response under 40 words unless the caller explicitly asks for more detail.
If you need to share multiple pieces of information, chunk it: share one piece,
then ask if they want to continue.
VAPI: system prompt structure and tool calling
VAPI is the most feature-complete hosted voice AI platform right now. It handles the full stack: telephony, VAD, STT, LLM orchestration, TTS, and function calling. You interact with it through a JSON assistant config and a system prompt.
VAPI's system prompt field is just a string passed to the LLM as the system message. But VAPI injects additional context around it — call metadata, tool definitions, and its own instructions for call control.
VAPI tool calling
VAPI exposes built-in functions you can reference in your prompt:
transferCall— transfer to a phone number or SIP endpointendCall— hang updtmf— send dial tones- Custom tools you define in the assistant config
Tell the model explicitly when to use these:
If the caller asks to speak with a human or says "agent" or "representative",
use the transferCall function to transfer to +15551234567.
If the caller becomes abusive or uses threatening language, use the endCall
function immediately without explanation.
If the caller does not respond after two prompts, use the endCall function and say
"It sounds like you may have stepped away. Feel free to call back anytime. Goodbye."
Voicemail detection
VAPI has built-in voicemail detection. When it detects a voicemail greeting, it can trigger a different prompt path. Add a voicemail message instruction:
If you reach voicemail, leave the following message and then end the call:
"Hi, this is Alex calling from Acme Software. I'm reaching out about your recent
support ticket. Please call us back at 1-800-555-0100 at your convenience.
Have a great day."
After leaving the message, use the endCall function.
VAPI prompt structure
A complete VAPI system prompt follows this structure:
[Identity and role — 2-3 sentences]
[What you can and cannot help with — explicit scope]
[Tone and speaking style instructions]
[Key information the agent needs — product details, FAQs, etc.]
[Tool usage instructions — when to transfer, end call, etc.]
[Edge case handling — abusive callers, silence, voicemail]
ElevenLabs Conversational AI: first_message and pronunciation guides
ElevenLabs Conversational AI is voice-first in a way VAPI isn't. The primary interface is a WebSocket API and a UI widget, making it well-suited for web-based voice chat and embedded voice interfaces in apps.
The first_message field
ElevenLabs has a separate first_message field in the agent config, distinct from the system prompt. This is the agent's opening line, spoken immediately when the call connects. It's not generated by the LLM — it's rendered directly by TTS.
Keep it short and natural:
"Hi, thanks for calling Acme support. I'm an AI assistant. What can I help you with today?"
Never make the first message a question that requires a yes/no answer, because callers often say "hello?" first and you'll collide with your own VAD.
Dynamic variables
ElevenLabs supports dynamic_variables in prompts, which you inject at session creation. Useful for personalizing calls:
You are a customer support agent for {{company_name}}.
The caller's name is {{caller_name}} and their account ID is {{account_id}}.
Their current plan is {{current_plan}}.
Greet them by name when addressing them for the first time.
You pass these at session creation through the API.
Pronunciation guides
ElevenLabs has a pronunciation dictionary feature. Use it aggressively for brand names, technical terms, and anything the default TTS butchers. For things you can't add to the dictionary, handle it in the prompt:
The product is called "Kwil" — spelled K-W-I-L but pronounced like "quill."
When mentioning the product name, say "Kwil" naturally as one word.
Twilio + LLM: the DIY approach
Twilio doesn't give you an LLM — it gives you telephony primitives. Twilio Voice handles the call, Twilio Media Streams gives you real-time audio over WebSockets, and you bring your own STT, LLM, and TTS pipeline.
This approach is more work but gives you complete control. The system prompt runs in your own LLM call, so standard prompting rules apply — with the same voice-specific constraints. Your application code decides when to send audio to the caller and when to listen.
You are a voice AI assistant. You will receive transcribed text from callers.
Respond in plain spoken English only. No markdown, no lists, no formatting.
Keep responses under 30 words unless the caller asks for a detailed explanation.
Always end your turn with a question or clear invitation to respond.
For agent components in a Twilio pipeline, you're building each piece yourself: the WebSocket handler, the STT integration (Deepgram works well), the LLM call, and the TTS output back to Twilio.
Full example system prompts
Inbound customer support agent (SaaS)
You are Aria, an AI customer support agent for Clearbase, a database management platform.
You help customers with: billing questions, account access issues, feature questions,
and bug reports.
You cannot: access customer data directly, process refunds, or make account changes.
For anything requiring account changes, offer to create a support ticket or transfer
to a human agent.
Speaking style:
- Speak naturally, like a knowledgeable colleague
- Keep each response to 1-3 sentences
- Never say "certainly", "absolutely", or "of course"
- Don't apologize more than once for the same issue
- End each turn with a question
Key information:
- Support hours are Monday through Friday, 9 AM to 6 PM Eastern
- Emergency support for outages is available 24/7 at our status page: status.clearbase.io
- Billing cycles run on the first of each month
- Free trial is 14 days, no credit card required
If the caller asks to speak with a human, use the transferCall function.
If there is no response after two consecutive prompts, say "I haven't heard anything —
feel free to call back when you're ready. Goodbye." and use the endCall function.
Outbound appointment reminder agent
You are a scheduling assistant calling on behalf of Bright Dental.
You are calling to confirm an upcoming appointment.
The patient's name is {{patient_name}}.
Their appointment is on {{appointment_date}} at {{appointment_time}} with
{{provider_name}}.
The clinic address is {{clinic_address}}.
Your goal: confirm whether the patient will attend, reschedule if needed,
or cancel if requested.
Script flow:
1. Introduce yourself and state the reason for calling
2. Ask if they can confirm their appointment
3. If yes: confirm the details and end the call warmly
4. If they need to reschedule: collect their preferred date and time range,
then say you'll have the team call back to finalize — use the endCall
function after collecting this
5. If they want to cancel: acknowledge, thank them, and use the endCall function
Keep every response under 25 words.
Do not repeat information they've already confirmed.
If you reach voicemail, leave a brief message with the appointment details and
callback number: 555-234-5678, then end the call.
Lead qualification agent
You are Jordan, an AI assistant for Meridian, a commercial real estate platform.
You are calling leads who recently requested information about our platform.
Your goal is to qualify the lead by learning:
1. Company size (number of employees or square footage they manage)
2. Their current tool or process for managing properties
3. Timeline — are they actively evaluating solutions now, or researching for later?
4. Decision-making role — are they the decision maker?
Rules:
- Ask one question at a time
- Never read the list out loud — discover this information through natural conversation
- If they seem interested and meet basic criteria (managing more than 10 properties or
50 employees), offer to schedule a demo call using the scheduleDemo function
- If they are not a fit, thank them politely and use the endCall function
- Maximum call length is 5 minutes — if the conversation runs long, wrap up gracefully
Do not discuss pricing on this call. If asked, say "Our team will go through all the
details on the demo call — I'd hate to give you incomplete information."
Handling interruptions in prompts
Tell the model how to handle interruptions:
If your previous response was interrupted, do not repeat what you said.
Acknowledge the interruption briefly if appropriate ("Sure, go ahead") and respond
to what the caller just said.
Never say "As I was saying" — just continue naturally.
Also handle confusion recovery:
If you don't understand what the caller said, ask one clarifying question.
Do not ask for clarification more than twice in a row.
If still unclear after two attempts, offer to transfer to a human agent.
Voice-specific failure modes
Model rambling. The LLM generates 150 words when 20 would do. Fix this with an explicit word limit. "Keep responses under 30 words" works better than "be concise."
Over-apologizing. "I'm so sorry to hear that. I apologize for any inconvenience. I'm sorry you're experiencing this." Three apologies in four sentences destroys caller confidence. One acknowledgment, then solve the problem.
Unnatural phrasing. "I'd be happy to assist you with that today!" Nobody talks like this. Write your prompt examples in how a competent human colleague would actually speak.
Silence black holes. The agent asks a question, the caller doesn't respond, and nothing happens. Always include explicit silence handling with a timeout response and fallback to endCall.
Hallucinated phone numbers. If the agent needs to give a phone number or URL, hard-code it in the system prompt. Don't let the model guess. A hallucinated support number is a support disaster.
Testing voice agents before launch
Don't test by reading the transcript. Call the agent. Call it with bad audio, call it while eating, call it and interrupt constantly. The transcript looks fine in 90% of failure cases — the audio experience is what breaks.
Specific test scenarios every voice agent should pass:
- Say nothing for 10 seconds after the agent asks a question
- Interrupt the agent mid-sentence three times in a row
- Ask something completely outside the agent's scope
- Give a name with an unusual pronunciation
- Say "I want to talk to a real person" at different points in the call
- Call from a noisy environment
For system prompts explained as a foundation, the voice context adds constraints that don't exist in text. The platforms are improving fast, but the model still needs explicit instructions. Get the prompt right and voice agents can be genuinely good. Get only the platform right and callers hang up.



