Generative neural models have achieved success in application areas where hand-curated data for supervised learning is difficult or unavailable. A frequently cited example is the generation of “real” looking artificial images, such as portrait photos, using Generative Adversarial Networks (GANs) based on a large number of unclassified photos. In this post, we aim to explain the general principle behind the generative approach. As an example architecture, we do not choose GANs, which already have a complicated internal architecture and a complex training process, but the simple autoencoder network, which is often used as a component for GANs.

**What can generative neural models do better (than discriminative models)?**

In statistical modeling, a distinction is made between discriminative and generative models. In the realm of Machine Learning, this differentiation can be characterized as follows: Discriminative models define their learning performance exclusively with respect to the chosen training data. In other words, the trained model generates an output from an input date not belonging to the training set, given the training set. This implies, on the one hand, that only the regularities derived from the training data are used to calculate the output. On the other hand, the training data assumes the role of an absolute standard. Examples include classifiers such as decision trees, (simple) support vector machines, and neural networks trained with backpropagation and the standard loss function (MSE = Mean Square Error).

In contrast, a generative Machine Learning model describes the indirectly accessible conditional probability distribution (Bayes distribution) that has generated the input-output pairs used for training. This approach fundamentally differs in perspective from the discriminative approach. Generative models aim to make a generalized statement about the latent probability distributions underlying an observable dataset. Frequently, this is also expressed as the generative model reconstructing the latent probability distribution (and the resulting real observations).

The key aspect of this approach lies in the ability to generate new outputs using a trained model and to obtain insights into the imbalance of training data. We can better understand the practical implications of this somewhat abstract characterization by considering the example of the neural autoencoder.

**A discriminative model: Neural autoencoder**

Neural autoencoders are neural networks with one or more internal layers of neurons whose dimension is smaller than the dimension of the input vectors. The learning task is to reconstruct the input vector as accurately as possible in the output layer. Autoencoders were originally developed to generate compressed, low-dimensional encodings of data (see Figure 1). Neural autoencoders, according to the above description, are discriminative models optimized to learn a deterministic mapping between input and output data. What changes are needed to turn an autoencoder into a generative model?

Applied to autoencoders, the generative principle states that the encoder should not only optimally map the available data but also learn a latent probability distribution that actually generates the internal encodings of the encoder. With such an encoder model, one can generate new encodings for new input data independently of the training data based on the learned probability distribution. Using a large, generated dataset, one can estimate how unevenly weighted the original training examples are. By selectively choosing subsets from the training data, a more suitable training set can be obtained for other learning tasks.

This means that we now want to use the autoencoder to learn two probabilistic inference functions, as graphically depicted in Figure 2 (for further explanations, see the last section on Wasserstein autoencoder): The first function *QmathrmVAE*(*Z*|*X*) maps the input space *cal*(*X*) to the space of latent autoencoder representations *cal(Z). *The second function *PmathrmG*(*X*|*Z*) then maps the latent space *cal*(*Z*) to the output space (which is identical to the input space). How this looks for autoencoders based on a neural network is schematically shown in Figure 3. Both inference functions use a modified function instead of the deterministic computation rule for forward propagation in neural networks. This function calculates the internal values of the autoencoder from the input values, with an additional application of a predefined probability distribution known as the prior distribution. Usually, a Gaussian distribution is used for this purpose. This theoretical concept requires fundamental modifications for practical implementation, which are realized in the variational autoencoder.

**A generative model: Variational autoencoder**

The standard backpropagation learning rule of the autoencoder cannot be applied to random functions. This problem is solved by a technique called reparameterization from variational calculus, hence the name variational autoencoder. Essentially, the reparameterization of the learning rule causes the deterministic gradients arriving in the internal layer of the autoencoder during the backpropagation step to be varied by parameters of the probability distribution (see Figure 4).

The calculation of gradients itself from the comparison of the computed output and the learning target (which in this case is identical to the input) must also be modified. This implies that a different loss function than the usual MSE (= mean squared error) function must be defined. This is justified by the fact that the autoencoder does not simply learn the deterministic relation between input and output but aims to minimize the differences between the distribution of the computed output and the distribution of the training target. Hence, a loss function is needed whose gradient is defined in areas of the data space where no data are available, and where the probabilities of the data distribution are sometimes very low.

For comparing two probability distributions, so-called divergence measures are suitable for these reasons. Unlike loss functions such as MSE, divergence measures are not directly defined over the errors calculated from the outputs of a neural network compared to the training targets. Divergence measures rather compare the latent probability distribution (which “generated” the training data) with the probability distribution generated by the currently learning model. Therefore, the theoretical formulas of divergence measures involve integrals. In practice, these integrals cannot be computed analytically, but they are approximated by evaluating the training data and model outputs. In the next section, we describe an example of a variational autoencoder with Wasserstein divergence as the loss function.

## Wasserstein autoencoder

The conceptual difference between variational autoencoders with the standard loss function and those with Wasserstein divergence is shown in Figure 5. Figure 5a visualizes the relationships for a variational autoencoder, and Figure 5b for a Wasserstein autoencoder. Both figures are divided into an upper and a lower half. The dark blue half represents the latent, unobservable data space, where the latent correspondences of real data (the “codes”) are represented as white triangles. The lower, light blue half represents the real data space with the observed (training) data as circles and the reconstructions of the autoencoder as squares.

Some explanations for the symbols in the figures: *calX** *denotes the real data space where the observed data *X* (the small circles) lie. *calZ** *is the latent data space where the latent codes *Z *lie. The inference function *PmathrmG*(*X*|*Z*) “reconstructs” real data (squares) from latent codes (triangles) based on the learned probability distribution *PZ**. *The inference functions *QmathrmVAE*(*Z*|*X*) (Figure 5a) and *QmathrmWAE*(*Z*|*X*) (Figure 5b) map the real observed data (circles) to the latent codes (triangles).

In the case of VAE in Figure 5a, the standard loss function leads to different real data *x*inducing different inference functions *QmathrmVAE*(*Z*|*X*=*x*) and thus local probability distributions* PZ* in *calZ*. This is indicated by several orange circles around the latent codes in Figure 5a. As a result, the latent probability distribution *Z *(white circle) is inadequately approximated. This leads to inaccurate reconstructions of real data by the autoencoder model (green arrows). In contrast, the Wasserstein loss function in WAE (Figure 5b) allows a consistent approximation *PZ *of the latent distribution *QZ, *and therefore different real data can be better distinguished in their reconstructions. More information about Wasserstein autoencoders can be found in the related paper by Ilya Tolstikhin et al.

In this blog post, we have outlined the principle of generative models using the example of the neural autoencoder. In practical applications, generative models are often hidden in complex neural or hybrid architectures, such as GANs. The successes of this relatively new type of neural network in image processing, autonomous systems, and more recently, machine translation, demonstrate the appropriateness of this approach for evaluating very large real datasets with the goal of approximating generalizable models.