How Audio Mining, Generative AI, and LLMs Are Transforming Media Archive Searches

Conceptual image of a library filled with digital assets like images, documents, and videos.
© Borin – stock.adobe.com

AI-Driven Solutions for Media Archives

In the rapidly growing digital landscape, media archives containing vast amounts of audio and video data – such as those from radio and TV broadcasts – require advanced tools for efficient content search and retrieval. The Fraunhofer IAIS Audio Mining System enables fast and efficient searches for spoken content and specific speakers. Several AI-based technologies are employed in this system: automatic speaker diarization (segmenting a file into different speaker sections), speaker recognition (identifying known speakers), and automatic speech recognition (ASR), i.e., the transcription of spoken content.

These technologies offer significant advantages, particularly for journalists who need to quickly search through archives to find quotes from past broadcasts. For instance, a journalist can search for the exact moment when a politician spoke about a certain topic, without the need to manually review hours of footage. However, keyword-based search methods face limitations when it comes to finding general topics or semantically related content.

Revolutionizing Archive Search with Generative AI

The rise of Generative AI, particularly with tools like ChatGPT, has transformed how we interact with technology. These Large Language Models (LLMs) allow for natural language queries, providing an intuitive interface for users to search through large datasets.  Recognizing this, Fraunhofer IAIS developed a prototype to integrate generative AI with their media archive search system, creating and advanced tool that goes beyond simple keyword search.

Fraunhofer IAIS Audio Mining System
© Fraunhofer IAIS
Fig. 1: The original Fraunhofer IAIS Audio Mining System, with speaker recognition and transcription.

How the Audio Mining LLM Prototype Works: ASR, LLM, and RAG Technologies

The Audio Mining LLM Prototype combines the AI-generated metadata of the Fraunhofer IAIS Audio Mining System (which includes automatic segmentation of media files, speaker recognition and ASR transcription) with a semantic search function powered by Retrieval-Augmented Generation (RAG). This system allows users to ask detailed questions, such as “Did Tim Walz serve in the military?” (Gibt es Neuigkeiten im Dönerstreit?). The system then retrieves and ranks the most relevant segments based on semantic similarity, using sentence embeddings to find the best matches.

For an in-depth look at how embeddings work, see the blog post on the meaning of words through vectors.

The Audio Mining LLM Prototype answers the question “Did Tim Walz serve in the military?”
© Fraunhofer IAIS
Fig. 2: The Audio Mining LLM Prototype answers the question “Did Tim Walz serve in the military?”. It gives a textual answer and lists the five most relevant segments based on semantic similarity.

The system uses a Large Language Model to formulate the answers based on relevant content retrieved from the archive, minimizing the risk of “hallucinations” – where large language models generate false information – by ensuring that all responses are grounded in verifiable data, making it an ideal solution for reliable, fact-based media archive searches, rather than answering general knowledge questions like tools such as ChatGPT.

Key Applications of the Audio Mining LLM Prototype in the Media Industry

Traditionally it’s relatively easy to use keyword search to find out if someone mentioned a particular topic, like searching for all instances where Kamala Harris made negative comments about electric cars. The Audio Mining LLM Prototype offers a significant potential for various applications, particularly in the media industry. For example, investigative journalists can use the system to quickly search for specific statements or opinions expressed by public figures, such as finding instances where Kamala Harris spoke negatively about electric cars – something that goes beyond a simple keyword search.

This is where the Audio Mining LLM Prototype stands out. The system transcribes every spoken word, allowing it to capture even subtle opinions and details, without relying on manually annotated metadata. The RAG-approach ensures that it pulls relevant segments from the archive, uncovering insights that traditional search methods might miss.

Another application is in public video-on-demand platforms, where users could benefit from a system that not only allows searches by title or genre but also provides personalized recommendations based on individual preferences or specific questions. This could significantly enhance the user experience, offering new ways to interact with media archives.

Challenges and Limitations of the Current Prototype

While the Audio Mining LLM Prototype shows promising results, it remains a prototype with room for improvement. One of the main challenges lies in video and audio segmentation, where short, irrelevant segments may be returned, reducing the usefulness of the search results. Future advancements in contextual retrieval systems could address this limitation, improving segment quality and relevance.

The Audio Mining LLM Prototype sometimes returns irrelevant segments to the question.
© Fraunhofer IAIS
Fig. 3: The Audio Mining LLM Prototype sometimes returns irrelevant segments to the question, for example the last two segments returned in this screenshot.

Another current limitation is the system’s inability to conduct quantitative analysis. For instance, the prototype cannot count how many times a topic was mentioned by a specific person or provide a comprehensive list of all mentions. If you ask the system to list all supporters and opponents of electric vehicles based on the video archive, it won’t provide a full list because the current configuration of the RAG only selects the five most relevant snippets. Future updates could involve fine-tuning the RAG system to handle more complex analytical tasks, including counting occurrences and generating lists based on the entire archive.

Future Directions: Expanding the Capabilities of Audio Mining with LLMs

Following its successful debut as a proof of concept at the International Broadcasting Convention 2024 (IBC), the Audio Mining LLM Prototype has received positive feedback from the media industry. Looking ahead, future developments will focus on tailored solutions for use cases such as quantitative media analysis, news verification, and enhanced search interfaces for video-on-demand services.

At Fraunhofer IAIS, we are continuously exploring how to adapt RAG technology for specific industry needs, enhancing the power and usability of media archives for journalists, researchers, and the public. We expect to complete several RAG-related projects by the end of this year, with the results to be shared on the ML Blog by the Lamarr Institute, so stay tuned!

Dr. Christoph Schmidt

Dr. Christoph Schmidt is Head of Business Unit “Speech Technologies” at Fraunhofer IAIS in Sankt Augustine.He is researching topics such as automatic speech recognition, speaker recognition and large language models / generative AI in the media industries.

Joran Poortstra

Joran Poortstra holds a Master’s degree in Finance from the University of Lund and a Master’s degree in Economics from the University of Bonn. He currently serves as a Business Developer in the Speech Technologies Team at Fraunhofer IAIS. In this role, he is responsible for identifying new customers and ensuring that the Fraunhofer Speech Technologies remain competitive in the market. Additionally, he provides technical consulting for customers, ensuring their […]

More blog posts