Navigating the Evolution of Automatic Speech Recognition (ASR)

Automatic Speech Recognition

Automatic Speech Recognition (ASR) is a rapidly advancing technology that holds immense relevance in today’s digital age. As a voice-based technology it is a beneficial way to make Human Computer Interaction more accessible. However, the inner workings of ASR can be complex, so we aim to demystify the concept of ASR and present it in a simple and understandable manner while giving an overview of the different time periods and what led to the transitions between them. 

The Evolution of Automatic Speech Recognition

In the ever-evolving landscape of technology, the journey of Automatic Speech Recognition (ASR) has been nothing short of remarkable. Over the years, numerous approaches have been explored and refined, each contributing to the advancement of this pivotal technology. In this blog post, we delve into the fascinating evolution of ASR, shedding light on the key milestones and transitions that have shaped its trajectory.

Statistical Approach  

Hidden Markov Models (HMMs) were one of the earliest fundamental approaches introduced to address speech recognition. This approach was encompassed within the HCLG (HMM – Context – Lexicon – Grammar) framework, which decomposed the problem into distinct components, each addressing specific aspects. However, with advancements in hardware and the flourishing domain of Big Data, the field of ASR transitioned into the end-to-end era, where individual components were consolidated into a singular Deep Learning model.     

End-to-end Approach  

Neural networks have revolutionized speech recognition, marking a significant milestone in technological advancement. By leveraging large datasets and powerful computational capabilities (e.g. GPUs), they have significantly impacted applications like voice assistants, or transcription services. Systems can now handle a wide range of languages, accents, and speech variations more effectively and accurately, often rivaling human-level accuracy. Notably, the year 2017 heralded a transformative era in Deep Learning with the introduction of the Transformer architecture, which swiftly found its application in ASR. However, as we navigate through the landscape of ASR, we must ponder: are Transformers the ultimate solution, or are there yet uncharted territories awaiting exploration? In the ensuing sections of this blog post, we will embark on a journey into the intricacies of ASR, unraveling its inner workings and exploring the possibilities that lie ahead.

Unveiling the Spectrum of Automatic Speech Recognition: From HMMs to Whisper and Deep Learning Techniques

Automatic Speech Recognition (ASR) involves a series of complex steps that enable machines to convert spoken language into written text. To understand the fundamental mechanisms of ASR, let’s delve into key components, focusing on two main approaches: Hidden Markov Models (HMMs) and Deep Learning.

Hidden Markov Models (HMMs)

Speech recognition systems comprise two primary components: the Acoustic Model (AM) and the Language Model (LM). Initially, this division was finely delineated within the HCLG (HMM – Context – Lexicon – Grammar) framework. HCLG is a graph constructed by composing a Grammar, Lexicon, Context and an HMM Weighted Finite State Transducers (WFSTs). In this graph, vertices depict symbols which together transform into a sentence once the graph has been traversed. The edges which connect the vertices have penalties or costs which are incurred upon getting opted for. When solved for the least costly path, the output is the best transcript for the input audio.   

The Acoustic Model (AM) of the HCLG system is comprised of HMM, Context, and Lexicon WFSTs, collectively modeling speech as a sequence of phonemes. The Grammar WFST hosts the Language Model (LM), depicting the likelihood of word sequences. Notably, AM and LM training can be conducted separately and combined using WFST math, rendering this setup highly convenient and potent in low-resource ASR environments.

Deep Learning

Deep Speech emerged as a significant advancement within the realm of end-to-end ASR. This architecture simplified ASR system development for diverse use cases, circumventing the complexity associated with traditional methods. Deep Speech’s architecture does not require hand-designed components to address background noise, reverberation, or speaker variation. Instead, it directly learns a function that is robust to such effects. Furthermore, it eliminates the need for a phoneme dictionary, revolutionizing the conventional approach to ASR. Which leads to the question: How can such a “simple” architecture exist and perform ASR without requiring the basic units of speech (phonemes)?  Before diving into that however, it is important to mention that the architecture alone was not enough to compete with the complex HCLG systems, it required significantly more data to reach competitive levels.

Connectionist Temporal Classification (CTC) loss is the objective function which needs to be optimized when dealing with a setup as is present for Deep Speech. So, how is CTC helping us answer our questions to some degree? In ASR, two sequences are at play, and if the system learns to map one sequence to the other, it has learned to solve the ASR problem. For HCLG, the sequences were those of the input speech and the sequence of phonemes. The CTC loss is designed in a way which makes it possible for the second sequence to be those of characters/words/sub-words. A graph similar to HCLG is constructed, but with character states and a few special states, and as the CTC loss is optimized the system is able to assign the correct states to different sections of the speech input. This signifies that the model learns to map speech to characters/words/subwords. In an HCLG system the sequential nature of speech is modeled by the HCLG graph structure, however Recurrent Neural Networks (RNNs) can do the same, but much more competently because of their ability to remember the history of a sequence infinitely (in theory). In comparison, HMMs remember the information of only the previous state.  Deep Speech used this architecture in combination with standard deep feed-forward blocks and fed it a lot of training data, while leaving the alignment to the CTC loss. 

Transformers and Whisper  

We mentioned that RNNs could, in theory, model infinite history. In practice this is far from being the truth. However, problems like vanishing gradient result in RNNs forgetting parts that appeared earlier in a longer sequence. Not only this, it turns out that training RNNs is a slow process. The fact that RNNs consume sequences one element at a time results in an increased complexity and training time. Transformers were introduced to alleviate both these issues. They can consume sequences all at once i.e., all elements are processed parallelly. Since all elements of a sequence are visible at all times, there is no forgetting, and the parallel processing means that learning proceeds via simple backpropagation instead of using backpropagation through time as is the case for RNNs.   

Transformers deal with sequences using self-attention – This mechanism is able to extract information present in an element by considering it in the context of the entire sequence and does so by computing similarities (relations) with other elements of the sequence. Therefore, self-attention is able to extract meaningful information about the sequence as a whole. Transformers enable us to feed huge amounts of data corresponding to the task you want the model to perform and give the model a lot of parameters to learn this task. CTC is still relevant in the Transformer age; however, CTC has never been the only way to align two sequences of different length. Before Transformers and after Deep Speech, there was development of the Encoder-Decoder framework. The Encoder can take the input sequence without having the need to produce the output sequence. This job is outsourced to the Decoder, and the two networks are connected via a mechanism similar to self-attention, called cross-attention. Via cross-attention the system can compute similarities between elements of the output and the input sequence and learn the required alignment. Since Transformers have been proven to be superior to RNNs, the Encoder-Decoders of today are composed of Transformers. And this brings us to the Whisper speech recognition system. Whisper is the state-of-the-art in the field currently, and a gross simplification would be to say that it is a Transformer Encoder-Decoder model trained with 680,000 hours of training data. In comparison, Deep Speech used 5,000 hours of training data, while for the HCLG days, the amount of data was around a few hundred hours. Post Transformers, the trend has been to scale the number of parameters of the model and the amount of training data. 

In the next section, we will explore the challenges faced by ASR systems, which have not been dealt with via scaling of parameters and data. 

Facing the Challenges in Automatic Speech Recognition Technology (ASR)  

In the realm of Automatic Speech Recognition technology (ASR), we encounter a multitude of challenges that impact the accuracy and effectiveness of these systems. This chapter delves into three central challenges: Robustness, Out-of-vocabulary Words, and Hallucinations. These obstacles shed light on the complex aspects of ASR technology, highlighting the advancements needed to further enhance the performance of these systems.

1. Robustness 

Robustness in Automatic Speech Recognition (ASR) refers to the system’s ability to accurately transcribe speech under various conditions, including different accents and language variations. Accents and language diversity present a significant challenge to ASR systems due to the diverse ways in which individuals pronounce words and phrases.

Different accents and dialects introduce variations in pronunciation and acoustic characteristics, such as pitch, rhythm, and intonation patterns. For example, speakers with regional accents may pronounce words differently or use distinct speech patterns compared to standard pronunciation. Additionally, speakers of different languages may have different phonetic inventories or speech sounds, further complicating the recognition process. These variations pose challenges for ASR systems, as they must be trained to recognize and interpret speech from a diverse range of speakers and linguistic backgrounds.

To address this challenge, researchers are developing ASR systems that are more robust and adaptable to diverse speech patterns and linguistic variations. This may involve collecting and annotating speech data from a wide range of speakers and languages to improve the system’s performance across different accents and dialects.­ Overall, improving the robustness of ASR systems is crucial for enhancing their performance in real-world applications and ensuring accurate transcription across diverse linguistic and cultural contexts.

2. Out-of-vocabulary Words  

Furthermore, Out-of-vocabulary (OOV) words pose a significant challenge for Automatic Speech Recognition (ASR) systems, as they are words that the system has not encountered during training. These words may include new terms, names, or specialized vocabulary that are not present in the system’s lexicon or training data.

When faced with OOV words, ASR systems may struggle to accurately recognize and transcribe them, leading to errors in the transcription output. This is because the system lacks the necessary linguistic information or context to correctly identify and interpret these unfamiliar words. One common scenario where OOV words occur is in conversational speech or domain-specific content, where speakers may use slang, jargon, or technical terms that are not part of the system’s vocabulary. For example, in a medical setting, clinicians may use specialized terminology that is not commonly found in general speech or written text.

Addressing the challenge of OOV words requires strategies to improve the robustness and adaptability of ASR systems. One approach is to continually update and expand the system’s lexicon and training data to include a broader range of vocabulary, including OOV words encountered in real-world usage. This may involve incorporating domain-specific dictionaries or datasets to cover specialized terms and jargon. Another approach is to develop techniques for handling OOV words during the recognition process. Additionally, researchers are exploring the use of techniques such as phonetic similarity or word embeddings to map OOV words to similar words or concepts in the system’s vocabulary.

Generally speaking, addressing the challenge of OOV words is essential for improving the accuracy and performance of ASR systems, particularly in real-world scenarios where speakers may use diverse vocabulary and language variations. By developing robust techniques for handling OOV words, ASR systems can better adapt to the dynamic nature of spoken language and provide more accurate and reliable transcription outputs.

3. Hallucinations 

Hallucinations in Automatic Speech Recognition (ASR) systems refer to the phenomenon where the system transcribes text that is not actually present in the audio file. Despite the impressive performance of current state-of-the-art architectures like Whisper, hallucinations remain a persistent challenge in ASR technology.

Hallucinations can occur due to various factors, but one of the primary causes is the presence of poor quality or corrupted samples in the training data. When ASR systems are trained on data that contains inaccuracies, background noise, or simply if the data is biased, they may inadvertently learn to transcribe nonexistent words or phrases. These hallucinations can significantly degrade the accuracy and reliability of the ASR system, leading to erroneous transcription outputs. The presence of hallucinations underscores the critical importance of high-quality training data in ASR. Training data should be carefully curated and preprocessed to remove any inaccuracies, distortions, or artifacts that could potentially lead to hallucinations. Additionally, robust quality control measures should be implemented to ensure that the training data accurately reflects real-world speech patterns and characteristics.

To mitigate the impact of hallucinations, researchers are developing techniques to improve the robustness and resilience of ASR systems. This may include incorporating advanced signal processing algorithms to filter out noise and distortions, as well as researching new model architectures that are more robust to variations in speech quality. Furthermore, ongoing research is focused on developing novel evaluation metrics and benchmarking procedures to assess the performance of ASR systems in terms of hallucination detection and mitigation. By systematically evaluating and addressing the issue of hallucinations, researchers aim to improve the overall accuracy and reliability of ASR systems, ultimately enhancing their utility and effectiveness in real-world applications. In conclusion, ASR technology faces challenges related to accents and language diversity, but advancements in deep learning, language modeling, and large-scale datasets have significantly improved ASR accuracy and performance. However, the aforementioned problems still need to be solved. 

Conclusion 

The evolution of Automatic Speech Recognition (ASR) technology from its nascent stages to the present era represents a remarkable journey marked by transformative advancements. From simple systems relying on statistical approaches to sophisticated neural networks and transformers, ASR has undergone a profound transformation, driven by the quest for seamless human-computer interaction.

The journey underscores the relentless pursuit of innovation and improvement in ASR technology, with current Deep Learning-based systems standing as a testament to the remarkable progress achieved. These advanced ASR systems excel at adapting to diverse settings and linguistic contexts, paving the way for enhanced accessibility and usability across various applications.

At Fraunhofer IAIS and Lamarr Institute, we are committed to addressing the challenges inherent in ASR technology while continuously enhancing model capabilities. Through rigorous research and development efforts, we strive to overcome obstacles such as robustness, out-of-vocabulary words, and hallucinations, enabling the seamless integration of ASR into daily lives. Our aim is to empower individuals with greater accessibility through voice technology, promising a future where ASR serves as a ubiquitous and indispensable tool for communication and interaction.

Thomas Holz

Thomas Holz completed his Bachelor’s in Audio & Video at Robert Schumann conservatory and University of Applied Sciences in Düsseldorf. He manifested his passion for music information retrieval and audio processing during his semester abroad at LTU in MI, USA. After his graduation, he worked as an acoustics consultant for some time until he started his Master’s in Audio Communication & Technology at the Technical University of Berlin. At the […]

Manas Maurya

Manas Maurya completed his bachelor’s degree in Electronics and Communication Engineering, during which he developed a keen interest in Speech Signal Processing. This passion led him to pursue a master’s degree in Speech and Language Processing at the University of Edinburgh. Currently, he holds the position of Speech Recognition Engineer at Fraunhofer IAIS. In this role, Manas focuses on researching techniques to enhance a Live speech recognition system while also […]

More blog posts