Sound Communication: The Holdcom Blog

Behold, The Power of Real Human Audio

Multilingual 2 99769854While waiting on hold, most listeners can tell almost instantly if the message on hold program contains human narration or synthetic speech. Granted, artificial speech technology has come a long way in replicating the human voice, but very noticeable differences still exist. Although synthetic speech and natural speech are no longer at completely opposite ends of the spectrum, natural speech comes out on top due to the persistent shortcomings of artificial voice. Unlike synthetic voice and the use of concatenation to assemble a sequence of prerecorded words and phrases, a human speaker has intelligence to interpret context and information. Human speakers also have the ability to naturally adjust pronunciation, pacing, tone, mood, style, dynamics, pitch, intensity, etc....all the emotional elements that put the "human" into the human voice.

In the English language especially, texts are frequent in heteronyms, abbreviations and numbers which are difficult for the speech synthesizing programs to interpret .  Pronunciation and prosody are common issues in text to speech (TTS), and in order to implement them convincingly, the use of phonetic representation is necessary. Often times there are problematic patterns in wave concatenation, and difficulties in synthetically reproducing high pitched female and child voices.

Here are some frequent disadvantages of synthetic speech

  • There are many spellings in the English language which are pronounced differently based on context (homographs).
  • Often times it is difficult to convert numbers into the specific way you intend them to be said.
  • Roman numerals can be read as letters.
  • Tends to sound robotic and unnatural.
  • Poor pronunciation.
  • Lacks character and personality.
  • Compound words are sometimes hard to pronounce correctly.
  • Hard to find the right stress, duration and intonation of the text.
  • Lacks ability for a dynamic voice.
  • Cannot create accent.
  • Concatenative synthesis requires collecting speech samples, correctly labeling, then forming words which is extremely time consuming.

Although synthetic speech is finding its way into numerous applications for businesses, the professional human voice is what callers respond to more positively. The natural flow of speech and pronunciation in audio programming is something only authentic human voices can accomplish. Speech synthesis continues to develop and become increasingly successful in high-level IVR and telephony transactions. With all the advances in synthetic speech technology, businesses still prefer natural speech because the human voice is the real deal, and synthetic speech is, for lack of a better term, a "knock-off" that is missing the personal "human" touch.

Tags: resources, message on hold, voice talent