How does a GAN work?
The idea behind the term “GAN” is fascinating and was first published in 2014 by Ian Goodfellow and colleagues: The system consists of two components (both are neural networks) called the “Generator” and “Discriminator” (see the figure below). These two share the actual task of the system, which is to generate new data that is very similar to the input data from a large amount of input data (learning data), such as photos, without any further human input. It is a variant of unsupervised learning because the system does not rely on external control of the similarity of the generated output data to the training data. The Generator component is tasked with generating new data. The Discriminator component is tasked with distinguishing the data generated by the Generator from the training data and recognizing it as artificially generated.
So, one could characterize this principle as a kind of game using the example of photos: the Generator network tries to outsmart the Discriminator network with as realistic artificial photos as possible. During the learning process, the Generator network generates increasingly “realistic” images, while the Discriminator network learns to distinguish more accurately between reality and artificial data. After a new image is generated, it is presented to the Discriminator network along with a real photo. It then has to decide which of the two images is a real photo and which one is artificially generated. It is important to note that the two images do not have to depict the same motif, but only roughly the same subject area. Apart from the general theme, such as portrait shots, the Discriminator network must derive its decision solely from the micro-properties of the images, such as contrast, color gradient, depth of field, resolution, etc. At the beginning of training, this distinction is still easy because the Generator network does not have access to any of the real photos as input, but only a randomly generated input vector. One could imagine this input similarly to the image on an old television when no channel is set. A contemporary analogy would be a large QR code with randomly distributed black and white fields. The Generator component is therefore a generative neural network. It is supposed to learn to reproduce the general range of variations in the images in the training data. More details on this can be found in another blog post. The network must now learn to transform the repeated random input values in such a way that a as realistic as possible image is generated as output. After each attempt, it only receives the binary decision of the Discriminator component “real photo” or “artificial image” as feedback.
GANs were initially used to produce real-looking images based on photos. Particularly remarkable were images depicting human portraits (see figure above). Even people with relevant expertise in the field could not distinguish these portraits from real photos.
Which improvements have been proposed for the original system?
An important modification to the GAN architecture is to replace the simple Generator input, i.e., the randomly generated vector, with a data vector from the training data, which is then additionally perturbed with random values before processing in the Generator component. This principle resembles that of the Variational Autoencoder, which is described in the blog post. This results in an interesting alternative: such a GAN can be trained to change real inputs in such a way that certain properties are preserved, while others are systematically altered.
The figure above schematically illustrates the Generator component of a GAN for generating images corresponding to a specific artistic style from real photos. Further publications describe successful attempts to modify additional modalities of photos, such as from a summer landscape to a winter landscape or from a line drawing to a photo-realistic object.
One drawback of the simple GAN architecture is the slow and not always successful convergence, i.e., the learning behavior of the system, which requires extremely high computational efforts in the computer. The success of the training also depends on many externally adjustable parameters, such as the network architecture of the two components, the number of neurons, and the learning rate for weight adjustment. It can also happen that the Generator and Discriminator components settle into a singular state in which even obviously not real-looking Generator outputs are not recognized as artificial by the Discriminator component. This is referred to in the literature as Mode Collapse.
A new idea that seeks to overcome this difficulty through “regularization” (see also our blog post on deep neural networks) is to provide a comparison between the output of the Generator component and a “reverse transformation” by a second GAN. This is illustrated in the following figure. The learning task of the second GAN is the inverse task of the first. In our example above, the second GAN is supposed to learn to generate realistic photos from painted pictures. It should be noted that this objective does not mean comparing the reverse-transformed photo directly with the original input of the first GAN, as in supervised learning. The Discriminator component of the second GAN is tasked with distinguishing the reverse-transformed photo from “arbitrary” original photos.
How can a GAN learn to translate words?
It has already been mentioned that GANs belong to the unsupervised Machine Learning methods. So, in principle, detailed specifications for individual data from which error values can be calculated are not needed. This property would be very desirable for machine translations. It would then not be necessary to have or translate large amounts of text in parallel into two languages. Approaches that leverage the GAN property of unsupervised learning do not initially aim to translate entire sentences, but rather individual words or terms, as this task is less complex. However, a solution would still be very interesting as it would allow for the automatic generation of dictionary-like translation lists without human intervention.
The first question is in what form words or texts need to be converted for processing in a GAN. Embedding vectors offer a suitable option. The principle is explained in this blog post. So, embedding vectors must first be calculated for both languages between which translation is desired. There are several practical methods for this, which are explained in our blog post on text to embedding vectors.
This principle is illustrated in this figure based on translations from English to French (see also “Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders”): for one learning step, one English and one French embedding vector are randomly selected. These do not have to be vectors of the same words because learning occurs unsupervised. The Generator component then produces a “French” embedding vector from the English embedding vector, and the Discriminator component has to decide which of the two French vectors is an original and which one is a generated version.
The embedding vectors produced by a GAN after the learning phase are not yet “translated” words. For this, the assignment of the embedding vector to a word in the target language must be performed. This assignment is done by determining the similarity to known words in the target language using the cosine value and selecting the most similar word.
Experiments with variations of this architecture have shown that better results are achieved when the Generator network is a (variational) autoencoder (see figure above). In this case, the Discriminator component evaluates the internal representation of the embedding vectors from the autoencoder. To arrive at a translation, these vectors must first be transformed with the Decoder and then a word of the target language must be determined using the cosine distance.
In the publication “Unsupervised Word Translation with Adversarial Autoencoder,” this architecture is combined with the cyclic architecture. This results in a complex architecture consisting of two GANs, which has achieved good results in translation between multiple languages. However, the difficulty of slow convergence and the need to choose the hyperparameters within narrow limits remain.
GANs therefore open up an interesting opportunity to generate new examples typical of a domain from large amounts of data alone, without the specification of rules or other expert knowledge. The cyclic interleaving of two GANs increased the stability of the system behavior. The combination of example data with the random distribution as input allows targeted control of the results produced.
The use of GANs for dubious or even criminal purposes in the “deep fake” environment must not be ignored for an evaluation. However, it would go too far at this point to go into details. Intensive work is underway to develop analysis methods to detect such fakes (see, for example, “Deep learning for deepfakes creation and detection: A survey”).