Search
You are here: Home page > Computers > Speech synthesizers
Advertisement

A composite photo of a computer with a human mouth superimposed to illustrate computerized speech synthesis.

Speech synthesizers

  • Tweet

by Chris Woodford. Last updated: January 21, 2017.

How long will it be before your computer gazes deep in your eyes and, with all the electronic sincerity it can muster, mutters those three little words that mean so much: "I love you"! In theory, it could happen right this minute: virtually every modern Windows PC has a speech synthesizer (a computerized voice that turns written text into speech) built in, mostly to help people with visual disabilities who can't read tiny text printed on a screen. How exactly do speech synthesizers go about converting written language into spoken? Let's take a closer look!

Photo: Humans don't communicate by printing words on their foreheads for other people to read, so why should computers? One day it may be common to speak commands to your computer and get back a spoken reply!

What is speech synthesis?

Computers do their jobs in three distinct stages called input (where you feed information in, often with a keyboard or mouse), processing (where the computer responds to your input, say, by adding up some numbers you typed in or enhancing the colors on a photo you scanned), and output (where you get to see how the computer has processed your input, typically on a screen or printed out on paper). Speech synthesis is simply a form of output where a computer or other machine reads words to you out loud in a real or simulated voice played through a loudspeaker; the technology is often called text-to-speech (TTS).

Talking machines are nothing new—somewhat surprisingly, they date back to the 18th century—but computers that routinely speak to their operators are still extremely uncommon. True, we drive our cars with the help of computerized navigators, engage with computerized switchboards when we phone utility companies, and listen to computerized apologies on railroad stations when our trains are running late. But hardly any of us talk to our computers (with voice recognition) or sit around waiting for them to reply. Professor Stephen Hawking is a truly unique individual—in more ways than one: can you think of any other person who's famous for talking with a computerized voice? All that may change in future as computer-generated speech becomes less robotic and more human.

How does speech synthesis work?

Let's say you have a paragraph of written text that you want your computer to speak aloud. How does it turn the written words into ones you can actually hear? There are essentially three stages involved, which I'll refer to as text to words, words to phonemes, and phonemes to sound.

1. Text to words

Reading words sounds easy, but if you've ever listened to a young child reading a book that was just too hard for them, you'll know it's not as trivial as it seems. The main problem is that written text is ambiguous: the same written information can often mean more than one thing and usually you have to understand the meaning or make an educated guess to read it correctly. So the initial stage in speech synthesis, which is generally called pre-processing or normalization, is all about reducing ambiguity: it's about narrowing down the many different ways you could read a piece of text into the one that's the most appropriate.

Preprocessing involves going through the text and cleaning it up so the computer makes fewer mistakes when it actually reads the words aloud. Things like numbers, dates, times, abbreviations, acronyms, and special characters (currency symbols and so on) need to be turned into words—and that's harder than it sounds. The number 843 might refer to a quantity of items ("eight hundred and forty three"), a year or a time ("eight forty three"), or a padlock combination ("eight four three"), each of which is read out slightly differently. While humans follow the sense of what's written and figure out the pronunciation that way, computers generally don't have the power to do that, so they have to use statistical probability techniques (typically Hidden Markov Models) or neural networks (computer programs structured like arrays of brain cells that learn to recognize patterns) to arrive at the most likely pronunciation instead. So if the word "year" occurs in the same sentence as "843," it might be reasonable to guess this is a date and pronounce it "eight forty three." If there were a decimal point before the numbers ("0.843"), they would need to be read differently as "eight four three."

Preprocessing also has to tackle homographs, words pronounced in different ways according to what they mean. The word "read" can be pronounced either "red" or "reed," so a sentence such as "I read the book" is immediately problematic for a speech synthesizer. But if it can figure out that the preceding text is entirely in the past tense, by recognizing past-tense verbs ("I got up... I took a shower... I had breakfast... I read a book..."), it can make a reasonable guess that "I read [red] a book" is probably correct. Likewise, if the preceding text is "I get up... I take a shower... I have breakfast..." the smart money should be on "I read [reed] a book."

2. Words to phonemes

Having figured out the words that need to be said, the speech synthesizer now has to generate the speech sounds that make up those words. In theory, this is a simple problem: all the computer needs is a huge alphabetical list of words and details of how to pronounce each one (much as you'd find in a typical dictionary, where the pronunciation is listed before or after the definition). For each word, we'd need a list of the phonemes that make up its sound.

What are phonemes?

Voiceprint

Crudely speaking, phonemes are to spoken language what letters are to written language: they're the atoms of spoken sound—the sound components from which you can make any spoken word you like. The word cat consists of three phonemes making the sounds /k/ (as in can), /a/ (as in pad), and /t/ (as in tusk). Rearrange the order of the phonemes and you could make the words "act" or "tack."

There are only 26 letters in the English alphabet, but over 40 phonemes. That's because some letters and letter groups can be read in multiple ways (a, for example, can be read differently, as in 'pad' or 'paid'), so instead of one phoneme per letter, there are phonemes for all the different letter sounds. Some languages need more or fewer phonemes than others (typically 20-60).

Okay, that's phonemes in a nutshell—a hugely simplified definition that will do us for starters. If you want to know more, start with the article on phonemes in Wikipedia.

In theory, if a computer has a dictionary of words and phonemes, all it needs to do to read a word is look it up in the list and then read out the corresponding phonemes, right? In practice, it's harder than it sounds. As any good actor can demonstrate, a single sentence can be read out in many different ways according to the meaning of the text, the person speaking, and the emotions they want to convey (in linguistics, this idea is known as prosody and it's one of the hardest problems for speech synthesizers to address). Within a sentence, even a single word (like "read") can be read in multiple ways (as "red"/"reed") because it has multiple meanings. And even within a word, a given phoneme will sound different according to the phonemes that come before and after it.

An alternative approach involves breaking written words into their graphemes (written components units, typically made from the individual letters or syllables that make up a word) and then generating phonemes that correspond to them using a set of simple rules. This is a bit like a child attempting to read words he or she has never previously encountered (the reading method called phonics is similar). The advantage of doing that is that the computer can make a reasonable attempt at reading any word, whether or not it's a real word stored in the dictionary, a foreign word, or an unusual name or technical term. The disadvantage is that languages such as English have large numbers of irregular words that are pronounced in a very different way from how they're written (such as "colonel," which we say as kernel and not "coll-o-nell"; and "yacht," which is pronounced "yot" and not "yach-t") —exactly the sorts of words that cause problems for children learning to read and people with what's known as surface dyslexia (also called orthographic or visual dyslexia).

3. Phonemes to sound

Okay, so now we've converted our text (our sequence of written words) into a list of phonemes (a sequence of sounds that need speaking). But where do we get the basic phonemes that the computer reads out loud when it's turning text into speech? There are three different approaches. One is to use recordings of humans saying the phonemes, another is for the computer to generate the phonemes itself by generating basic sound frequencies (a bit like a music synthesizer), and a third approach is to mimic the mechanism of the human voice.

Concatenative

Speech synthesizers that use recorded human voices have to be preloaded with little snippets of human sound they can rearrange. In other words, a programmer has to record lots of examples of a person saying different things, break the spoken sentences into words and the words into phonemes. If there are enough speech samples, the computer can rearrange the bits in any number of different ways to create entirely new words and sentences. This type of speech synthesis is called concatenative (from Latin words that simply mean to link bits together in a series or chain). Since it's based on human recordings, concatenation is the most natural-sounding type of speech synthesis and it's widely used by machines that have only limited things to say (for example, corporate telephone switchboards). It's main drawback is that it's limited to a single voice (a single speaker of a single sex) and (generally) a single language.

Formant

If you consider that speech is just a pattern of sound that varies in pitch (frequency) and volume (amplitude)—like the noise coming out of a musical instrument—it ought to be possible to make an electronic device that can generate whatever speech sounds it needs from scratch, like a music synthesizer. This type of speech synthesis is known as formant, because formants are the 3–5 key (resonant) frequencies of sound that the human vocal apparatus generates and combines to make the sound of speech or singing. Unlike speech synthesizers that use concatenation, which are limited to rearranging prerecorded sounds, formant speech synthesizers can say absolutely anything—even words that don't exist or foreign words they've never encountered. That makes formant synthesizers a good choice for GPS satellite (navigation) computers, which need to be able to read out many thousands of different (and often unusual) place names that would be hard to memorize. In theory, formant synthesizers can easily switch from a male to a female voice (by roughly doubling the frequency) or to a child's voice (by trebling it), and they can speak in any language. In practice, concatenation synthesizers now use huge libraries of sounds so they can say pretty much anything too. A more obvious difference is that concatenation synthesizers sound much more natural than formant ones, which still tend to sound relatively artificial and robotic.

Articulatory

The most complex approach to generating sounds is called articulatory synthesis, and it means making computers speak by modeling the amazingly intricate human vocal apparatus. In theory, that should give the most realistic and humanlike voice of all three methods. Although numerous researchers have experimented with mimicking the human voicebox, articulatory synthesis is still by far the least explored method, largely because of its complexity. The most elaborate form of articulatory synthesis would be to engineer a "talking head" robot with a moving mouth that produces sound in a similar way to a person by combining mechanical, electrical, and electronic components, as necessary.

What are speech synthesizers used for?

Work your way through a typical day and you might encounter all kinds of recorded voices, but as technology advances, it's getting harder to figure out whether you're listening to a simple recording or a speech synthesizer. You might have an alarm clock that wakes you up by speaking the time, probably using crude, formant speech synthesis. If you have a speaking GPS system in your car, that might use either concatenated speech synthesis (if it has only a relatively limited vocabulary) or formant synthesis (if the voice is adjustable and it can read place names). If you have an ebook reader, maybe you have one with a built in narrator? If you're visually impaired, you might use a screen reader that speaks the words aloud from your computer screen (most modern Windows computers have a program called Narrator that you can switch on to do just this). Whether you use it or not, it's likely your cellphone has the ability to listen to your questions and reply through an intelligent personal assistant—Siri (iPhone), Cortana (Microsoft), or Google Assistant/Now (Android). If you're out and about on public transportation, you'll hear recorded voices all the time speaking safety and security announcements or telling you what trains and buses are coming along next. Are they simple recordings of humans... or are they using concatenated, synthesized speech? See if you can figure it out! One really interesting use of speech synthesis is in teaching foreign languages. Speech synthesizers are now so realistic that they're good enough for language students to use by way of practice.

Who invented speech synthesis?

Talking computers sound like something out of science fiction—and indeed, the most famous example of speech synthesis is exactly that. In Stanley Kubrick's groundbreaking movie 2001: A Space Odyssey (based on the novel by Arthur C. Clarke), a computer called HAL famously chatters away in a humanlike voice and, at the end of the story, breaks into a doleful rendition of the song Daisy Bell (A Bicycle Built for Two) as an astronaut takes it apart.

Here's a whistle-stop tour through the history of speech synthesis:

Experiment for yourself!

Why not experience a bit of speech synthesis for yourself? Here are two examples of what the first sentence of this article sounds like read out by Microsoft Sam (a formant speech synthesizer built into Windows XP) and Microsoft Anna (a more natural-sounding, formant synthesizer in Windows Vista and Windows 7). Notice how much the technology improved just in the five years or so between those different speech synthesizers being released.

Sam

Anna

If you have a modern computer (Windows or Mac), it's almost certainly got a speech synthesizer lurking in it somewhere:

  • Tweet
Sponsored links

Find out more

On this website

Articles

Technical papers

Books

Current research

Please do NOT copy our articles onto blogs and other websites

Text copyright © Chris Woodford 2011, 2017. All rights reserved. Full copyright notice and terms of use.

Follow us

Rate this page

Please rate or give feedback on this page and I will make a donation to WaterAid.

Share this page

Press CTRL + D to bookmark this page for later or tell your friends about it with:

Cite this page

Woodford, Chris. (2011/2017) Speech synthesizers. Retrieved from http://www.explainthatstuff.com/how-speech-synthesis-works.html. [Accessed (Insert date here)]

More to explore on our website...

Back to top