The daily routine of many companies includes creating and reviewing reports that transform financial metrics from mostly countless tables into written form. This monotonous and repetitive task is often done manually, consuming significant time and financial resources. With the increasing availability of digital financial and legal documents, the demand for their automatic processing is crucial. The goal is to extract central patterns in the texts and assist users. At this point, Artificial Intelligence (AI) can be a valuable tool to accompany or fully automate parts of this work.
The Problem: Despite the high availability of financial and legal documents, text processing or machine learning systems are rarely utilized. This is also due to the presence of sensitive information in these documents, allowing usage and processing exclusively for authorized individuals and purposes.
Anonymizer – a tool for automatic anonymization of reports
To address this issue, we have initiated a new project with PricewaterhouseCoopers (PwC): the Anonymizer. This tool enables the automatic anonymization of financial documents. It recognizes and censors sensitive data, such as locations, names of individuals or companies, and other information (e.g., emails, phone numbers) that would allow data to be attributed to a specific company.
In our collaboration and the development of the Anonymizer tool, PwC provided training data by annotating sensitive information in financial reports in advance. Machine Learning models could then learn from this data without sensitive information leaving an authorized recipient circle. Based on the latest insights from the field of Machine Learning, a team from the Lamarr Institute (formerly known as Machine Learning Competence Center Rhine-Ruhr (ML2R)) and Fraunhofer IAIS developed new methods to automatically anonymize sensitive information from financial documents. PwC employees supported the development of ML methods with their domain-specific expertise and evaluated the results. The resulting Anonymizer tool is described below.
Recognizing relationships between words through numerical word representations
The Anonymizer tool aims to automatically recognize and censor sensitive data, such as locations, names of individuals or companies, and other information. In research, this problem is known as “Named Entity Recognition” (NER), which refers to the automatic identification and classification of proper nouns. A proper noun is a sequence of words describing a real-existing entity, such as a company name.
The first step in creating the Anonymizer focused on embedding words into a vector space. The idea behind this is to find a numerical representation for words in which words with similar meanings also have similar representations. For example, the words “monkey” and “chimpanzee” should be close to each other in this space. Additionally, numerical representation allows mathematical operations that relate words (see Figure 1). Models trained on these representations can learn these relationships and incorporate them into their predictions.
In addition to the traditional embedding of words into a vector space, modern embeddings also incorporate the context of the sentence. These embeddings are mostly based on well-known language models trained on millions of words from large text datasets. These language models can distinguish words with multiple meanings based on their context. For example, the word “bird” is assigned a different representation in the sentences “The early bird catches the worm.” and “My family doctor is Dr. Bird.” In our application, we use Flair, which belongs to modern, context-aware embeddings.
Classification using Recurrent Neural Networks and Conditional Random Fields
In the second step of Anonymizer development, we trained a recurrent neural network (RNN) based on the embedded words. The RNN predicts to which entity the words belong (e.g., per for persons or org for organizations). Through feedback in the architecture, a recurrent neural network provides a kind of memory by repeatedly feeding previous inputs into the neural network. Similar to modern embeddings, previously fed words can thus be included in the classification of the current word.
In the third step, the predicted entities were considered as a sequence and assessed by a Conditional Random Field (CRF). A CRF is a probabilistic model that evaluates how likely it is that the elements in the input sequence will appear in this order. This model is first trained on the training data, learning which sequences occur frequently and assigns them a higher value during prediction. For example, if the RNN predicts the classes per, per, and org for three consecutive words, it could be a sequence that rarely occurs. It might be an organization name containing a person’s name (org, org, org) or a list of individuals (per, per, per). This can be illustrated well with the example of the “Robert Koch Institute.” The CRF must decide whether it is an organization (org, org, org) or a list of individuals (per, per, per).
Results and quality of anonymization
In anonymization, we consider the binary case for classification: either the word belongs to the class of sensitive words and should be anonymized, or it does not. There are two important criteria by which the quality of anonymization can be measured:
- How precise is the classification (Precision), i.e., how many of the predictions were correct?
- How many of the sensitive pieces of information were found and censored (Recall)?
These two metrics balance each other out. Relying only on secure predictions can easily increase precision but, in turn, loses points in recall because sensitive words are overlooked. For each ML application, the appropriate criteria must be selected and weighted. In the case of financial reports, recall plays a greater role, as the focus is on finding as much sensitive data as possible. If the tool fails to detect most sensitive data, the remaining sensitive data must be removed manually, significantly reducing the utility and scalability of this automation.
In our tool, the trained model achieves almost perfect anonymization on the test data (99% recall). It is gratifying that, despite prioritizing recall, the overall model maintains a precision of over 90%. This means that, on the one hand, almost all sensitive data (99%) is discovered and anonymized by the tool, and on the other hand, text is rarely mistakenly anonymized.
Anonymizer: Application in practice
To make the tool as user-friendly as possible, we have created two ways to use it: the Anonymizer exists as a command-line tool or can be operated through a web application.
The web application is a web-based tool that allows users to upload text documents and visualize the anonymized content (see Figure 2). The interface consists of two sections: a left section with controls and a right section where the anonymized document is displayed. Sensitive entities are highlighted in different colors depending on their type. In the above figure, names of individuals, companies, and locations are highlighted in red, green, and blue, respectively. Additionally, the tool gives users the option to activate masking so that sensitive objects are completely blacked out. Once the document is anonymized, the tool allows users to download the processed document, which is free of sensitive elements.
The command-line tool provides users with an interface to the functionalities of the Anonymizer. Thus, the Anonymizer can easily be integrated into other processes and projects. Additionally, the command-line tool can be used to quickly anonymize thousands of documents. For this purpose, users can create a template that sets the settings for the operation or provide them through the console.
In summary, the Anonymizer tool enables the anonymization of sensitive information, such as names of individuals, locations, organizations, numbers, phone numbers, dates, and URLs, in a document. Sensitive information can be anonymized in all common file formats by either blacking out text or replacing text snippets with generic tags. We use state-of-the-art techniques in deep learning, natural language processing, and rule-based post-processing. Ultimately, the Anonymizer addresses the bottleneck in document sharing and facilitates the use of AI solutions in the financial sector and businesses in general.
More information in the associated publication:
Leveraging Contextual Text Representations for Anonymizing German Financial Documents D. Biesner, R. Ramamurthy, M. Lübbering, B. Fürst, H. Ismail, L. Hillebrand, A. Ladi, M. Pielka, R. Stenzel, T. Khameneh, V. Krapp, I. Huseynov, J. Schlums, U. Stoll, U. Warning, B. Kliem, C. Bauckhage, R. Sifa. AAAI Workshop on Knowledge Discovery from Unstructured Data in Financial Services at KDF, 2020, PDF.