The Saxon State Parliament is committed to enhancing the accessibility of its plenary sessions to meet the societal need for barrier-free access. To achieve this, the Parliament has introduced Live Automatic Speech Recognition (ASR) Software, a significant step towards ensuring inclusivity. This technology enables real-time conversion of spoken language into digital text, providing a crucial tool for live captioning. Live ASR not only facilitates greater participation for the hearing-impaired by allowing them to read what is spoken during debates but also signifies a broader commitment to digital inclusiveness within the legislative process.
In this blog article, we will explore the specific needs that prompted the Saxon State Parliament to adopt ASR technology. Detailed discussions will follow on how the software was customized to accommodate the unique vocabulary of state parliaments and the Saxon dialect, its integration within the parliamentary infrastructure, and the broader digitalization benefits that emerged from its implementation.
Why barrier-free plenary sessions are important: The Saxon State Parliament as an example
Accessible plenary sessions are crucial for ensuring equal participation in parliamentary debates. This involves providing transcripts alongside live streams to accommodate individuals with hearing impairments.
Mr. Kindler, head of the IT department at the Saxon State Parliament, highlighted the necessity for automated live subtitling to accurately reflect the diverse dialects and specialized terminology used by representatives. In addressing the linguistic challenges posed by the Saxon dialect and the need for precise transcription of political and legal terms, the State Parliament sought a solution ensuring accuracy and efficiency. Fraunhofer IAIS was chosen to provide tailored Live-ASR Software designed for such requirements. Accurately transcribing the complex Saxon dialect and specialized political terminology required software capable of deciphering various dialects and maintaining a low Word Error Rate (WER). Furthermore, ensuring data security and control necessitated an on-premise deployment of the software.
The ASR Software
In response to these requirements, the Saxon State Parliament collaborated with Fraunhofer IAIS. The Live-ASR Software, developed by Fraunhofer IAIS, was refined to meet the challenges of real-time transcription in intricate linguistic contexts. Leveraging expertise in linguistic analysis and machine learning, the software effectively interpreted the nuances of the Saxon dialect and transcribed parliamentary terminology accurately. This solution facilitated live captioning and contributed to digitalization efforts within the Saxon State Parliament.
The Live Automatic Speech Recognition (ASR) technology, developed by scientists of Lamarr’s Partner Institution Fraunhofer IAIS, specializes in real-time conversion of speech into digital text, even in challenging environments with background noise or dialects. The technology supports German and English, is adaptable for specific applications, and can handle noisy environments. It offers benefits such as on-premise or cloud-based deployment, customization for dialects or specialized vocabulary, and high data security. It can be employed in various sectors, including the already mentioned parliament, the broadcast industry and even in voice assistants in call-centers and other voice-controlled applications. The ASR works by capturing an audio signal, in this case the speaker at a parliamentary debate. The signal is digitalized, and this digital signal is mapped to phonetic units. This is done using a hybrid acoustic model consisting of a neural network and hidden Markov models (HMM), more information can be found in our previous blog post. The hybrid model outputs phonemes, so for example /k/, /æ/, and /t/. Next, employing a lexicon, which contains words and their phonetic representation and a language model which predicts the probability of a word, given the context the word is in, e.g. the previously recognized words, the word could be recognized as “cat.” This all happens in less than half a second, with minimal hardware requirements. Newer state-of-the-art model architecture exist as well, as described in our previous blog post. However, the described architecture was chosen for its minimal resource requirements, 1 Single CPU Core, for real time recognition, while maintaining very high accuracy. The interplay of acoustic model, lexicon and language model is shown in the following graphic:
Another advantage of the described architecture is its easy adaptability: each model can be trained separately. Allowing an optimal adaptation to the use case. The acoustic model was trained on a huge collection of audio recordings in a variety of environments, with different noise conditions and accents, with corresponding (phonetic) transcriptions. This results in a very (noise) robust acoustic model. The phonetic lexicon can be updated on a regular basis using text and corresponding phonetic representations from a variety of sources, either automatically created or manually curated. This simple lexicon update allows the recognizer to incorporate newly appearing word without requiring acoustic data or retraining of the acoustic model. This especially allowed us to incorporate the special political and local terms. Finally, the language model is trained on a huge corpus of text data, requiring no acoustic recordings or phonetic representations. This easily enabled us to incorporate special political and local phrases from protocols of previous parliamentary meetings, political texts, and other local sources, for example newspapers. All these adaptations ensure a high level of quality for the deployed system and enable regular updates, especially of the lexicon and language model to incorporate new up-to-date terms and phrases. This adaptation is further described in the next section.
Adaptations
In implementing the ASR software for transcription in the Saxon State Parliament, a significant challenge arose in accurately transcribing the Saxon dialect and parliamentary vocabulary. This required extensive adaptation and reevaluation of the ASR system, utilizing text protocols and video recordings of previous parliamentary debates. (For those seeking further insights into ASR model training, our colleagues have covered this topic extensively in a previous blog post. The text protocols, containing vocabulary specific to parliamentary contexts, served as ideal training data. By analyzing existing video recordings and their corresponding protocols, the ASR system underwent iterative refinement during and after training.
This iterative process involved fine-tuning the software with a diverse range of parliamentary speeches, resulting in a customized vocabulary tailored to the subjects discussed in debates. This included parliamentarian names, specialized political and legal terms, and even local Saxonian terminology. Fine-tuning entailed providing substantial amounts of text and, at times, audio data to the training algorithm. Analogous to human learning from text and audio examples, the training algorithm taught the speech recognition model to better comprehend specific terms and accents.
Following successful adaptation, Lamarr’s Partner Institution Fraunhofer IAIS initiated a testing phase of the software in collaboration with the Saxon State Parliament, preceding its full implementation.
Integrating Live-ASR Software: Implementation and Functionality
The speech recognition system was integrated into the existing recording and streaming Infrastructure of the Saxonian parliament. To achieve this, the speech recognition software was deployed as an on-premise installation on a server located in the computer center of the parliament. Audio/Video streams from redundant streaming servers were subsequently connected to the speech recognition system. Upon processing, the speech recognition system generated a transcript for the audio stream. The recognition results were then disseminated via a dedicated output connection from the speech recognition system into a text field on the live-stream website, alongside the live video. The text field keeps the whole transcripts to allow scrolling back and following the whole context of the talk. The following graphic provides an overview of the complete system. Another 2nd option to add transcripts to the audio/video streams directly was planned and is shown in the graphic. Although the output within the video was planned, only the text field output was implemented for the Saxonian parliament to allow scrolling back in the transcript and to display a longer text context, as this was perceived more usable by the users.
Significance and Future Trends: The Effects of Live-ASR Implementation
Implementation of the Live-ASR Software has had significant implications, particularly for hearing-impaired individuals, by providing real-time captioning of plenary sessions. This capability not only enhances immediate understanding and engagement for attendees but also extends to a broader audience through live streaming services, ensuring inclusivity and transparency in governance. It’s worth noting that while the Saxon State Parliament implemented the live ASR by choice, the European Union passed the European Accessibility Act (EAA) in 2019, aiming to make products and services more accessible for persons with disabilities and the elderly. This includes access to audio-visual media services, thereby anticipating a potential increase in demand for ASR implementations to meet legal requirements by 2030.
Additionally, an unforeseen benefit emerged from the implementation, next to increasing the accessibility for viewers of the debates. As mister Kindler mentions, the excellent text recognition capabilities have been utilized for creating reports of the debates, potentially streamlining the process of generating minutes for meetings and committees of the Saxon State Parliament, thereby reducing personnel effort. Here one can see the diversity that an ASR solution can offer. Because the live ASR is of such a good quality, the Saxon State Parliament will use it for other purposes as well, in this case creating debate reports, decreasing the labor-intensive resources needed for the Saxon State Parliament.
Conclusion – The Impact of Live-ASR Software in Governance
This project serves as an example project for other legislative bodies, highlighting the transformative potential of integrating Fraunhofer IAIS Live-ASR software into governance. The on-premise deployment, customization of dialects and specialized vocabulary, and high quality of the software have made it a suitable solution for the parliament, not only demonstrating a commitment to inclusivity but also enhancing future operational efficiency. To end with the words of Mr. Kindler, the satisfaction derived from five years of productive usage underscores the efficacy of the Live-ASR system.
If you are interested in learning more about the Live-ASR developed by Fraunhofer IAIS Scientists, you can visit the website here.