Which do you think is better – Google speech or Siri? Have you ever wondered what makes each
Multimodality Speech
At age two years maturation in our nervous system allows us to think in words. Even as adults, we can still think in earlier thought carriers, things like smells, sounds, images, and feelings. If you are typical, you think in words and often in phrases and sentences. We have the ability to vocalize words and we develop ability to write them. We recognize that a word may exist as thought, spoken word, written word, or even as recorded word. A couple generations ago–after WWII and the Korean War–mainframe computers were being applied to language. Logistics people in both the linguistics and math disciplines got involved and were investigating: How can we understand language, the way that sentences are put together? How can we know if we have a sentence that people can understand?
A Mechanical-Logic Approach
At this point enter Morris Halle, Noam Chomsky, and the MIT Linguistics group. They developed “transformational grammar,” a series of rules for putting the components of spoken speech into recognizable sentences. Think back to the use of Basic in the early computer days. Now you can understand that early MIT focus: Develop a series of rules to get sentences and meaning into a form that Joe Average will understand. Phonetics, phonemes, stress, and meaning were all linked to symbols which were strung together into “well-formed” sentences. Just as in Basic 101, rule 01 connects in steps sequentially to rule 52. Turn the transformational program on, run the rules, and out comes that generated sentence.
Using Rules to Cross-Index Language Modalities
I suppose that transformational grammar still exists, at least esoterically. But practicality left generative language a long time ago. That is, all except the idea of rules. The generative principle was you start with the plan for a sentence and you proceed to develop the sentence through a series of steps. If you reverse the process on a written document and convert it to the phonemic level, you get a verbal sentence.
Designing a Text to Speech Converter
The rule system used for conversion of text to speech programs must be hierarchical. The first series of rules must analyze the written document into its phonemic components, taking into account historic spelling anomalies. Depending on spoken dialect, phonetic allophones must be assigned for each phoneme, and the phonemes must be arranged serially for a speech generator. Word spacing, dashes, elipses, periods, etc., must be programmed into that speech generator so that words are separated in time and are each distinct. Superimposed on the word stream are “phrase markers” or “intonation patterns” that mark the beginning and ends of spoken phrases and sentences. In spoken language—again, depending on dialect—stress marks these components for hearers of speech.
State of the Art
There are several speech platforms on which conversion takes place. We are in the next phase and already have some success in conversion of sentences from one language into another, but the problem with meaning—not phonemes—continues to be our major complication.