The Anatomy of Great Speech
- Julie Ask

- Oct 10
- 4 min read
Why care about speech quality (i.e., comprehension)?
The use of conversational interfaces - voice and text - is on the upswing from consumers and employees. Just two years ago, my research from Forrester showed that less than 20% of US online adults were comfortable using voice to communicate with brands or transact. Comfort with chat for the same tasks was less than 30%. With the explosion in adoption and usage of conversational interfaces, there is no doubt that consumers are getting more comfortable using their words to do stuff.
Sounds simple, but if we are going to use conversational interfaces to “do stuff” like code, draw pictures, do analysis, and more, then we need to think about speech - and what makes it understandable.
Backdrop - Why am I writing this blog today?
I attended an event this summer (a high quality event). The speakers had amazing stories to tell and experiences to share. The speech deliveries fell broadly into two categories: 1) easy to understand and engaging 2) flat and difficult to follow despite the quality of the content. I finally realized that some of the presenters were reading essays from a teleprompter. Two things stood out to me around my comprehension difficulties: 1) hard to concentrate on someone talking like a machine for 20+ minutes 2) written essays use vocabulary and sentence structure very different from the spoken word.
How a human speaks certainly impacts comprehension. Humans might use vocabulary words that we don’t understand. They may also articulate words poorly or make errors (e.g., “wabbit” for “rabbit”), speak too quickly or mumble, misplace stress on syllables or words (which in English can change the meaning), phrase words poorly, or even have too many speech disfluencies (i.e., interruptions). Finally, muscle weakness, neurological conditions, hearing impairment and more will impact one’s speech.
None of these issues was in play during these presentations. The humans sounded like machines so I wanted to dig into synthetic voices - an interface that relatively few consumers use today beyond retrieval of simple answers to questions, but we expect to see them used more often.
What I Learned About Voice And Comprehension
I called a friend of mine (a linguist with voice assistant expertise from his work at both Google and Meta) to ask him some questions and get some direction on where to look … and then turned to Perplexity to fill in some blanks. We focused our conversations on synthetic speech. I learned that speech comprehension is not just passive decoding of words - it is an active predictive process that depends heavily on biology, memory, culture, Gricean Maxims, pronunciations, and more.
First, I learned a lot about the paradoxes of disfluencies in comprehension. Researchers estimate that there are at least six disfluencies for every 100 words of speech. Disfluencies such as pauses or filler words give the listener time to process what has been said and predict what is next. Here are a few highlights that really stood out:
Moderate speech disfluencies aid our listening comprehension. Disfluencies (i.e., hesitations or repairs such as “um” or “uh”) can increase our comprehension. Interestingly, speakers use “um” and “uh” differently with “uh” signaling a shorter delay than “um.” Turns out, listeners lean in when they hear “uh.” As a result they can process the ensuing information more quickly (120 ms). Excessive disfluencies (>1 per 10 words) do the opposite. Ironic that so many of us practice speaking with no filler words.
Syntactic bracketing (compared to other tactics) increases referential accuracy by 27%. Machines use this tactic (one of many) to parse through speech to understand intent or meaning. As humans, we may speak in perfectly formulated sentences the first time we utter a phrase. If we do, that would be easy for a machine to understand. More often than not, our brains get ahead of our speech (i.e., our brains work faster than our lungs can support speech) and we correct ourselves mid-sentence. For example in the phrase, “you said you would … no I should feed Diego,” the machine can interpret two separate phrases and meanings.
Attention modulation - disfluent pauses (500-800 ms) improve word encoding. Attention modulation refers to the brain’s ability to focus on specific words (i.e., speech) and process it. This comes into play in noisy environments or when speakers pause while talking. A listener’s ability to process speech can be impacted by the length of a pause.
Second, we dug into the mechanics of speech and how it can aid or inhibit comprehension. Each of us speaks with a dialect, speed, vocabulary and more that is unique to us though like others around us (i.e., we blend - mirror neurons). In the English language, emphasis on one word or syllable in a sentence can change our meaning. We also gesture, make facial expressions, or use other non-verbal cues to communicate. See figure:

And finally - a few more loose ends. I learned about:
Linguistic predictability - refers to the brain's ability to anticipate upcoming words, sounds, or meanings based on context, grammar, and prior experience. We need to do this well for comprehension. Key predictability terms include: phoneme (i.e., faster sound identification), syntax (i.e., grammar to allow smoother parsing of sentence structure), and semantics (i.e., word recognition).
Prosody and prosodic constituencies provide an acoustic and rhythmic framework. These elements often reflect and support syntactic structure, which in turn organizes meaning (semantics). These systems work together to help listeners segment, interpret, and understand spoken language efficiently.
Gricean Maxims help guide conversations and help shape neural prediction mechanisms. (Source: Effectiviology) The quantity (over-informative), quality (i.e., factual), relevance, and manner of communication impact comprehension.



Comments