Databases containing extensive text collections, such as news articles, publications, or patents, are often challenging to navigate due to the sheer number of documents they contain. Therefore, it has long been common practice to tag stored texts with keywords, facilitating quick and targeted searches. In the past, this was done through manual keyword assignment, a labor-intensive and costly process requiring annotators with expertise in the covered topics. Even with experts available, the results may not be optimal as experts may use different terms for the same topic. On the other hand, using a controlled list of keywords for uniformity requires meticulous maintenance to adapt to new topics.
Hence, we explored how short texts, such as news agency reports or summaries of publications and patents, can be automatically tagged with content-related keywords. This process is referred to as “tagging” in professional terms.
To better understand this task, it’s important to clarify possible goals of tagging: keywords (tags) should immediately represent the essential themes of the text. It’s crucial that these keywords exist in a broader context, meaning selected keywords should also appear in other texts on similar topics. Therefore, selecting a term that is either too specific or too general is not practical. The approaches presented in the next two sections, “Word Clouds” and “Topic Models,” fulfill these conditions and come with specific advantages and disadvantages.
Word Clouds
In the first decade of the millennium, Word Clouds were highly popular in Web 2.0. Word selection was achieved through simple frequency statistics. Word Clouds are based on many texts (referred to as the corpus). Words from the cloud that appear in one of the texts can be used as tags for that text. The advantage lies in the simplicity of the algorithm. However, a drawback is the lack of context and purpose of the texts and the often questionable meaningfulness of the identified words. Therefore, post-cleaning of the Word Cloud is recommended to remove function words or other frequently occurring non-meaningful terms. An often-cited example is the dominance of the term “http” in a corpus consisting of HTML-encoded files. Another drawback arises from the issue that a text not containing any of the words from the Word Cloud cannot be assigned any tags.
Topic Models
Several years later, Topic Models built on a significantly more complex theoretical basis. Here, topics are understood as themes addressed in texts. The LDA algorithm (Latent Dirichlet Allocation) was frequently used as a topic modeling algorithm. In this text clustering process, a relationship between texts in a corpus, the words contained in the texts, and “invisible” (latent) topics is established based on a probability distribution. Latent topics are characterized by words that are highly likely to appear in texts addressing that particular topic. Texts themselves are also assigned probabilities related to the topics. This is logical because a text can have a relationship with multiple topic themes. To tag a text, words from the topics with a high probability of representing the text’s themes can be used. An advantage over Word Clouds is the ability to assign theme-centered keywords to each text. However, the problem of non-content terms, such as “http,” persists here as well.
Semantic Tagging
Given this situation, we developed a multi-stage processing chain (workflow) to overcome or compensate for the drawbacks of existing approaches. We build on the approach of word embedding vectors, which has seen significant development in recent years. The most well-known algorithm for calculating embedding vectors is Word2Vec. Word embedding vectors allow for the inference of semantic similarities between terms without human interpretation. This property has been known for some years, starting with publications on the Word2Vec algorithm. Expanding on this knowledge, the program “StarSpace” (explanation and program code) was developed. StarSpace calculates embeddings for entities. The method works for words and texts, as well as for metadata describing texts, such as text categories, keywords, author names, etc.
Our approach begins by collecting candidates for keywords from individual texts in a corpus. We use a simple fully automated algorithm for this purpose. Here, we broaden the term “keyword” to also include phrases consisting of multiple words, which we refer to as key phrases. For a large corpus, this results in several thousand or even tens of thousands of key phrases, which are then filtered in two steps. Initially, using count-based statistics from information theory, we determine the information content for each part of a key phrase, i.e., each individual word, in relation to the corpus. This allows us to eliminate phrases that do not contain words with high information content. The remaining phrases are then sorted by their frequency of occurrence in the texts, and the most frequently counted phrases are retained. StarSpace allows for the learning several thousand key phrases for large corpora.
In this context, learning means that the key phrases, treated as categories of a classification system, are assigned to the texts by the StarSpace model in which they actually appear. The effect we utilize for tagging is that the learned key phrases can also be assigned to new texts, even if they do not literally contain these phrases but are semantically similar to texts in the learning corpus. StarSpace utilizes the property, known from the Word2Vec algorithm, of encoding semantic similarity in word embedding vectors for the vectors of key phrases.
Tagging of Patent Abstracts
We extensively studied the example of tagging summaries of patent documents (Patent Abstracts). A partial result is shown in Figure 1. The key phrases were extracted from a very large corpus of patent abstracts.
Here, terms from the field of laser technology are displayed. The connecting line between two terms indicates that the word embedding vectors of these terms are very similar. In terms of content, this means that the terms are used in similar contexts (with respect to other terms and words in general) in the corpus. Therefore, a semantic similarity of the terms can be assumed. Our approach provides two essential results: the key phrases extracted from the corpus are in a semantic similarity relation, which can be visualized and provides a quick overview of the topics in a text collection. Moreover, using the learned StarSpace model parameters, suitable key phrases can be assigned to thematically related new texts, even if these phrases do not occur literally in the texts.