Basic concepts of Automatic Speech Recognition

Read Time: 4 min

Speech is an integral part of communication. Importance of speech recognition has been increased over the years with advancement of technology. Through this tutorial, you will be able to get familiar with Automatics speech recognition system, necessity of such models, key terms, architecture of such systems and applications.

Table of Contents:

  • What is Automatic Speech Recognition?
  • Why is it desirable to study and research?
  • What are the challenges faced?
  • What are the key terms used for Speech Recognition?
  • How does Automatic Speech Recognition work?
  • Applications of Speech Recognition System
  • What can be improved more in this domain?
Let’s dive in!!

What is Automatic Speech Recognition?

Basically speech recognition refers to accurate translation of spoken utterances into text data i.e. words, sentences, syllables. This process is automated using various technologies like deep learning.

Necessity of automatic speech recognition

  • Speaking ability is primary way of communications for humans.
  • Speech is a form of communication that can be understood by anyone either literate or illiterate person.
  • Speech recognition can be very helpful to preserve endangered languages from local tribes or history.

Challenges faced

Speech recognition is such task where accuracy plays a very important role, otherwise there is no meaning of recognition.

Here are key challenges which make speech recognition a bit difficult:

  • Resources: Every human has it’s own way of speaking. Sometimes for machine, it becomes really difficult to understand such pattern accurately.
  • Speech Style: If a person is talking continuously or in isolated manner affects the ASR’s efficiency. In continuous speech case, it’s bit more easy to predict next words occurrences. Continuous speech refers fixed pattern of speaking governed by some rule like a teacher explaining a topic using slide show. Isolated speech is random one where no fix pattern observed like a conversation.
  • Environmental conditions: Environmental noise is unavoidable reason which causes decrease in accuracy. For e.g. background noise, room acoustics and instrument based noise (mic interruption)
  • Task specific problems: Even if noise is minimal, a lot of other factors like how sentences are morphologically structured, vocabulary types and range affect speech recognition.

Key terms used in speech analysis

If you are a newbie in this domain, you will come across various weird terms or keywords, which needs to understand first. So let’s start:

  • Phoneme: Unit of sound that differentiate words from each other
  • Acoustics : Acoustics is basically a branch of physics which deals with waves of sound signals. When we say “Acoustics of room”, it represents how sound waves of room interact with each other.
  • Lexicon: A list words which with it’s specific pronunciation like vocabulary. It’s a bridge between language and information understanding given by that language.
  • Utterance: It is the smallest and continuous unit of speech with proper beginning and ending.
  • Syllable: It is unbroken, smallest unit of sound of a word.

Typical Automatic Speech Recognition System: Architecture

Basic architecture of automatic recognition system contains three parts:

  • Acoustic model (AM)
  • Pronunciation Model (PM)
  • Language Model (LM)

Steps followed for speech recognition:

  • Analysis of speech signal: As a first step, signals obtained, processed, analysed and converted into discrete samples i.e. frames.

Why frames are necessary for analysis?

Speech signal can be analysed only when they have fixed length. When a signal is converted into frames, signal of fixed length between two frames can be obtained.

  • Once they are converted into frames, features are extracted from discrete samples. These features are generally Mel Frequency Cepstral Coefficients (mfcc) as they are very close to acoustic features that a human ear perceives.
  • Obtained acoustic features along with their corresponding frames are sent to decoder.
  • In parallel, acoustic model is created based on phonemes of the words and sent to decoder block.

Why phoneme is used rather than words as basic unit?

let’s understand it with an example. Suppose training data has three words orange, banana, apple. So you know phonemes of these words. Now, in test data a new word comes (e.g. pineapple). If word tokens are used as base units, model will not be able to detect pineapple easily. If phonemes are used, then even if pineapple didn’t occur before, model will be able to identify it well as it’s phoneme has occurred in previous words.

Note: Word can be used as basic unit if we have limited vocabulary use or we have enough data to train the model.

  • Next is Language model, which determine that how these words are to be arranged according to the given context. It’s more or less probability model which tells how likely it is for different words to occur in recent word’s context. For e.g. “Rice is ” : “Grain” or “Food” or “Boy”. Here probability of rice being a ‘grain’ or ‘food’ is high as compare to ‘boy’.
  • Third is, Pronunciation model, it provides links between phonemes and the words. For pronunciation model, usually a dictionary is maintained which is derived by linguists or domain experts.
  • Pronunciation varies person to person and these variations are caused by speech production system. So for designing pronunciation model, Articulatory feature based pronunciation model is used.
  • At last, decoder model takes all these inputs from different models (AM,LM, PM) then provides the output.
  • Deocder block usually consists of Hidden Markov Model.

Software based on Speech recognition

Voice Assistants: Google Now, Apple’s Siri, Amazon Alexa etc.


Research options in speech recognition systems

  • Development of such model which will be robust to variations in age and accent based on regions.
  • Creation of such system which will be able to handle noisy real life settings like speech in conferences where a lot of background noise is present due to crowd.
  • Design an efficient model which can cope up with variations in pronunciation and able to work on new languages as well.

In Summary, we have discussed speech recognition method, challanges faced in real time implementation, it’s key elements, software based speech recognition. In next tutorial, I will walk you through decoder’s functioning.

Useful Resources

Happy Learning!!