Language technologies already support many people and businesses in making their daily lives more efficient. Thanks to the use of Machine Learning (ML), natural language processing has reached a very advanced level. Large language models, also known as foundation models, are evolving rapidly and can already perform qualitatively demanding tasks such as automating the creation of computer programs and newspaper articles while simultaneously considering different media. In this post, we take a behind-the-scenes look at language technologies, which can also be considered contributions to Artificial Intelligence (AI).
Foundation Models as the basis of AI systems
If a sentence begins with the words “The dog,” many different words can appear in the next word position, such as “ran” or “barked.” On the other hand, many other words are not possible there for syntactic or content-related reasons, such as “green” or “gladly”. Therefore, so-called language models have been defined to calculate the probability of the possible next words, and these can be formalized as conditional probabilities: p(v3 | v1 = The, v2 = dog). In our case, the words “ran” and “barked” should receive a high conditional probability, while “green” and “gladly” should have a probability of 0.0.
How can conditional probabilities be captured with a model? It has been shown that deep neural networks with association modules are particularly suitable for predicting conditional probabilities. In this approach, the words of the language are represented by a limited number of tokens, so that common words have their own tokens, and rarer words are composed of tokens. Each token is represented by a context-dependent embedding vector that represents the meaning of the token. By considering neighboring words, these vectors can also explain the meaning of ambiguous words such as the word “turkey,” which can be an animal or a country depending on the context. The algorithm for calculating such embeddings was detailed in the posts “Capturing the meaning of words through vectors” and “BERT: How Do Vectors Accurately Describe the Meaning of Words?.”
A language model must generate the probabilities of the words in a sentence one after the other. This is illustrated in the following animation. The model receives the start symbol (v1 = BOS) as input at the beginning. Then, the BERT algorithm is applied at multiple layers for the known words of the sentence. Each layer contains a series of parallel association modules, each generating a new context-sensitive embedding vector for each input token.
From the embedding vector positioned furthest to the right, a probability vector is predicted using a logistic regression model. This vector estimates, for each possible token, the probability with which it may appear in the next position. Next, the tokens (v1 = BOS, v2 = The) are used as input, and a forecast for the third token is calculated. This process continues until the last token is predicted from (v1 = BOS, v2 = The, v3 = dog, v4 = he#, v5 = peeked, v6 = the). Each time, new context-sensitive embeddings are calculated for all previous words, providing additional information for the last token. The model is trained on texts from a large training dataset, assigning the observed token in the text the highest possible probability when the previous tokens are entered as the starting text. The details of these models are presented in the book “Künstliche Intelligenz – Was steckt hinter der Technologie der Zukunft?“.
Generating coherent texts with the GPT-3 language model
The GPT-3 model can process input texts of up to 2048 tokens, considering a very large context. It has 96 layers, each with 96 parallel association modules. In total, it has 175 billion free parameters, allowing it to generate highly meaningful context-dependent embeddings. It was trained on a text collection from books, Wikipedia, and websites with about 500 billion tokens—more than 100 times the amount of text a person could read in their lifetime. With a given starting text, the language model can be largely controlled. Instructed by one or more examples, it can generate text for a specific purpose, such as translations into another language or summaries of a document (Few-shot Prompts).
When GPT-3 receives input that could be the beginning of a newspaper article, it can generate ‘news articles’ with many hundred words that are hardly distinguishable from human contributions. Generated texts contain almost no syntactic errors and are logically plausible, although the stated facts may not always be correct. The average human accuracy in detecting articles produced with GPT-3 was around 52%, suggesting that people can hardly distinguish synthetically generated texts from human ones.
Verification through extensive test collections
In the meantime, additional models such as Gopher and PaLM have been developed with 280 billion and 540 billion parameters, respectively, and a similar architecture to GPT-3. Both models outperform GPT-3 in the quality of generated texts. To comprehensively assess the performance of these models, they were tested with a collection of more than 150 benchmark tests from various application areas, including medicine, logical reasoning, history, etc. Unlike BERT, these models were not fine-tuned for these tasks but instructed through Few-shot Prompts. Gopher was able to improve the accuracy of GPT-3 in more than 82% of these tasks. PaLM achieved a higher score than the average score of humans solving the same tasks. A notable feature of PaLM is its ability to draw logical conclusions better than previous models. This is supported by prompts that provide a logical chain of reasoning for a sample problem, giving the model guidance on how to solve the problem by breaking it down into steps. An example query is shown in the following box, with the system response printed in green:
Through this thought-chain prompting, the model’s ability to answer logical questions has been significantly enhanced. It is important that the same example prompt can be used for different logical problems. Following this pattern, the model can even explain jokes.
However, these models also have weaknesses. Since they represent the associations found in the data, they can reproduce biases about certain population groups. Additionally, they are not immune to misjudgments and factual errors, which they may reconstruct from the training data or simply associate with plausible outcomes. The article “Fact-based text generation with retrieval networks” describes techniques to reduce these problems.
Foundation Models: A new basis for Artificial Intelligence
There are now models that use the same technique (parallel association modules, attention) to simultaneously generate embeddings for image content and texts and can associate them . With this capability, images matching a text and descriptions fitting an image can be generated. Similar embeddings can be created for videos, 3D images, and motion sequences. Due to their broad application spectrum, these models are referred to as “Foundation Models,” and many researchers believe they form the basis for the development of advanced AI systems.
Due to the large number of parameters and the significant amount of training data required, existing Foundation Models have only been developed by large internet companies and are not fully available to interested researchers. To enable researchers and businesses to benefit from the advantages of modern language models, the “OpenGPT-X” project was launched in early 2022. Led by the Fraunhofer Institutes IAIS and IIS, the development of a powerful open AI language model for Europe has been advanced.