While GAN’s seem to be overtaking VAE’s as the leading class of generative model, I’m still struggling to catch up and fully understand the mechanism behind VAE’s before I get started with GAN’s. If you’re new to VAE’s, these tutorials applied to MNIST data helped me understand the encoding/decoding engines, latent space arithmetic potential, etc:

Miriam Shiffman, code in Tensorflow http://blog.fastforwardlabs.com/2016/08/12/introducingvariationalautoencodersinproseand.html

Francois Chollet, code in Keras https://blog.keras.io/buildingautoencodersinkeras.html
The thing that I can’t get my head fully around is the use of the Gaussian KLDivergence term in the overall cost function. From Francois’ code:
> def vae_loss(x, x_decoded_mean):
> xent_loss = objectives.binary_crossentropy(x, x_decoded_mean)
> kl_loss =  0.5 * K.mean(1 + z_log_sigma  K.square(z_mean)  K.exp(z_log_sigma), axis=1)
> return xent_loss + kl_loss
I think I understand that the purpose of the kl_loss term is to ensure that the encoded (or latentspace) variables are an efficient descriptor of the input (in this instance, that the model efficiently utilizes its allotted two unit gaussians as best as possible to describe the set of handwritten numerals). What I can’t understand is the intuition behind the derivation of the kl_loss function applied to the latent variables… it seems to want to reduce both the realized z_mean and z_log_sigma.
The appendices of what I believe to be the original VAE paper ( https://arxiv.org/pdf/1312.6114.pdf ) include a formal derivation but I’ve been struggling to get my head around it… Is there anyone with a deeper math/statistics background to whom this is intuitive? Perhaps @rachel ?
Thanks a lot and I apologize if this is a distraction from the main course material… I had been struggling with it before the course started and figured someone here might know whats going on.