Speech is an integral part of communication. The importance of speech recognition has been increasing over the years with the advancement of technology. Through this tutorial, you will be able to get familiar with the Automatic speech recognition system, the necessity of such models, key terms, the architecture of such systems, and applications.
Table of Contents:
- What is Automatic Speech Recognition?
- Necessity of Automatic Speech Recognition
- Challenges faced
- Key terms used in speech analysis
- Typical Automatic Speech Recognition System: Architecture
- Softwares based on Speech Recognition
- Research Options in Speech Recognition Systems
- Useful Resources
What is Automatic Speech Recognition?
Basically, speech recognition refers to the accurate translation of spoken utterances into text data i.e. words, sentences, syllables. Technologies like Deep Learning helps to automate this process.
Necessity of Automatic Speech Recognition
- Speaking ability is primary way of communications for humans.
- Speech is a form of communication that can be understood by anyone either literate or illiterate person.
- Speech recognition can be very helpful to preserve endangered languages from local tribes or history.
Speech recognition is such a task where accuracy plays a very important role, otherwise, there is no meaning of recognition.
Here are key challenges which make speech recognition a bit difficult:
- Resources: Every human has it’s own way of speaking. Sometimes for machine, it becomes really difficult to understand such pattern accurately.
- Speech Style: If a person is talking continuously or in isolated manner affects the ASR’s efficiency. In continuous speech case, it’s bit more easy to predict next words occurrences. Continuous speech refers fixed pattern of speaking governed by some rule like a teacher explaining a topic using slide show. Isolated speech is random one where no fix pattern observed like a conversation.
- Environmental conditions: Environmental noise is unavoidable reason which causes decrease in accuracy. For e.g. background noise, room acoustics and instrument based noise (mic interruption)
Key terms used in speech analysis
If you are a newbie in this domain, you will come across various weird terms or keywords, which need to understand first. So let’s start:
- Phoneme: Unit of sound that differentiate words from each other
- Acoustics : Acoustics is basically a branch of physics which deals with waves of sound signals. When we say “Acoustics of room”, it represents how sound waves of room interact with each other.
- Lexicon: A list words which with it’s specific pronunciation like vocabulary. It’s a bridge between language and information understanding given by that language.
- Utterance: It is the smallest and continuous unit of speech with proper beginning and ending.
- Syllable: It is unbroken, smallest unit of sound of a word.
Typical Automatic Speech Recognition System: Architecture
The basic architecture of an automatic recognition system contains three parts:
- Acoustic model (AM)
- Pronunciation Model (PM)
- Language Model (LM)
Steps followed for speech recognition:
- Analysis of speech signal: As a first step, signals obtained, processed, analysed and converted into discrete samples i.e. frames.
Why frames are necessary for analysis?
Speech signals can be analyzed only when they have a fixed length. When a signal is converted into frames, a signal of fixed length between two frames can be obtained.
- Discrete samples enables to extract features. These features are generally Mel Frequency Cepstral Coefficients (mfcc) as they are very close to acoustic features that a human ear perceives.
- Obtained acoustic features along with their corresponding frames are sent to decoder.
- Phonemes of words enable to create the acoustic model and send to decoder block.
Why phoneme is used rather than words as a basic unit?
let’s understand it with an example. Suppose training data has three words orange, banana, apple. So you know the phonemes of these words. Now, in test data, a new word comes (e.g. pineapple). If word tokens are used as base units, the model will not be able to detect pineapple easily. If phonemes are used, then even if pineapple didn’t occur before, the model will be able to identify it well as its phoneme has occurred in previous words.
Note: Word can be used as a basic unit if we have limited vocabulary use or we have enough data to train the model.
Language and Pronunciation Model
- Language model determines the arrangement of words according to the given context. It’s more or less probability model which tells how likely it is for different words to occur in recent word’s context. For e.g. “Rice is ” : “Grain” or “Food” or “Boy”. Here probability of rice being a ‘grain’ or ‘food’ is high as compare to ‘boy’.
- Third is, Pronunciation model, it provides links between phonemes and the words.
- Pronunciation varies from person to person. Speech Production System creates these variations.
- At last, decoder model takes all these inputs from different models (AM,LM, PM) then provides the output.
- Deocder block usually consists of Hidden Markov Model.
Softwares based on Speech Recognition
Voice Assistants: Google Now, Apple’s Siri, Amazon Alexa, etc.
Research options in Speech Recognition Systems
- Development of such model which will be robust to variations in age and accent based on regions.
- Creation of such system which will be able to handle noisy real life settings like speech in conferences where a lot of background noise is present due to crowd.
- Design an efficient model which can cope up with variations in pronunciation and able to work on new languages as well.
In Summary, we have discussed the speech recognition method, challenges faced in real-time implementation, its key elements, software-based speech recognition. In the next tutorial, I will walk you through the decoder’s functioning.
Let us know through the comment if it was helpful for you! Happy Learning!
If you want to build a sentiment analysis classifier as well, refer to the following blog.