To get an understanding of a VAE, we’ll first start from an easy network and add parts step by step.

A common way of describing a neural network is an approximation of some function we wish to model. However, they will even be thought of as a knowledge structure that holds information.

Let’s say we had a network comprised of a couple of deconvolution layers. We set the input to always be a vector of ones. Then, we will train the network to scale back the mean squared error between itself and one target image. The “data” for that image is now contained within the network’s parameters.

Now, let’s try it on multiple images. Rather than a vector of ones, we’ll use a one-hot vector for the input. [1, 0, 0, 0] could mean a cat image, while [0, 1, 0, 0] could mean a dog. This works, but we will only store up to 4 images. Employing a longer vector means adding in additional and more parameters therefore the network can memorize the various images.

To fix this, we use a vector of real numbers rather than a one-hot vector. We will consider this as a code for a picture , which is where the terms encode/decode come from. For instance , [3.3, 4.5, 2.1, 9.8] could represent the cat image, while [3.4, 2.1, 6.7, 4.2] could represent the dog. This first vector is understood as our latent variables.

Choosing the latent variables randomly, like I did above, is clearly a nasty idea. In an autoencoder, we add in another component that takes within the original images and encodes them into vectors for us. The deconvolutional layers then “decode” the vectors back to the first images.

We’ve finally reached a stage where our model has some hint of a practical use. We will train our network on as many images as we would like . If we save the encoded vector of a picture , we will reconstruct it later by passing it into the decoder portion. What we’ve is that the standard autoencoder.

However, we’re trying to create a generative model here, not just a fuzzy arrangement which will “memorize” images. We will not generate anything yet, since we do not skills to make latent vectors aside from encoding them from images.

There’s a simple solution here. We add a constraint on the encoding network, that forces it to get latent vectors that roughly follow a unit normal distribution . it’s this constraint that separates a variational autoencoder from a typical one.

Generating new images is now easy: all we’d like to try to to is sample a latent vector from the unit gaussian and pass it into the decoder.

In practice, there is a tradeoff between how accurate our network are often and the way close its latent variables can match the unit normal distribution .

We let the network decide this itself. For our loss term, we sum up two separate losses: the generative loss, which may be a mean squared error that measures how accurately the network reconstructed the pictures , and a latent loss, which is that the KL divergence that measures how closely the latent variables match a unit gaussian.

generation_loss = mean(square(generated_image – real_image))

latent_loss = KL-Divergence(latent_variable, unit_gaussian)

loss = generation_loss + latent_loss

In order to optimize the KL divergence, we’d like to use an easy reparameterization trick: rather than the encoder generating a vector of real values, it’ll generate a vector of means and a vector of ordinary deviations.

This lets us calculate KL divergence as follows:

# z_mean and z_stddev are two vectors generated by encoder network

latent_loss = 0.5 * tf.reduce_sum(tf.square(z_mean) + tf.square(z_stddev) – tf.log(tf.square(z_stddev)) – 1,1)

When we’re calculating loss for the decoder network, we will just sample from the quality deviations and add the mean, and use that as our latent vector:

samples = tf.random_normal([batchsize,n_z],0,1,dtype=tf.float32)

sampled_z = z_mean + (z_stddev * samples)

In addition to allowing us to get random latent variables, this constraint also improves the generalization of our network.

To visualize this, we will consider the latent variable as a transfer of knowledge .

Let’s say you got a bunch of pairs of real numbers between [0, 10], along side a reputation . for instance , 5.43 means apple, and 5.44 means banana. When someone gives you the amount 5.43, you recognize needless to say they’re talking about an apple. we will essentially encode infinite information this manner , since there is no limit on what percentage different real numbers we will have between [0, 10].

However, what if there was a gaussian noise of 1 added whenever someone tried to inform you a number? Now once you receive the amount 5.43, the first number could are anywhere around [4.4 ~ 6.4], therefore the other person could even as well have meant banana (5.44).

The greater variance on the noise added, the less information we will pass using that one variable.

Now we will apply this same logic to the latent variable passed between the encoder and decoder. The more efficiently we will encode the first image, the upper we will raise the quality deviation on our gaussian until it reaches one.

This constraint forces the encoder to be very efficient, creating information-rich latent variables. This improves generalization, so latent variables that we either randomly generated, or we got from encoding non-training images, will produce a nicer result when decoded.

How well does it work?

I ran a couple of tests to ascertain how well a variational autoencoder would work on the MNIST handwriting dataset.