Does anyone in the class have a very strong grasp of the entire Wasserstein GAN paper? Even with the simplified read-through, parts of it are still opaque to me. Would love to get a little study group in SF together to clarify understanding of this paper, ideally before the next class, since it appears to be pivotal in making GANs performant.

I’m in SF, so can meet at study group on Friday or arrange another time / location in the city.

Note that all of these posts go into much more detail than we need for the implementation next week, which can be summarized as:

Remove the log() from the loss

Clip the weights

But it’s a great paper so if you’re interested in studying it deeply I’m sure you’ll find it interesting. I wouldn’t prioritize it above the various homework assignments however.

Thanks, @jeremy. Vincent’s explanations are a great add-on. He includes his Wasserstein distance Jupyter notebook so you can mess around with the math behind the visualizations yourself.

Here’s the Tensorflow implementation on github that explanation came from:

Copied & pasted with relevant sentences bolded so you don’t even have to leave the forums

A pretty interesting paper that takes on the problem of stability in GANs and interpretability of the loss function during training. GANs essentially are models that try to learn the distribution of real data by minimizing f-divergence (difference in probabilty distribution) by generating adversarial data. The convergence in min max objective of the originally proposed GAN can be interpreted as minimizing the Jensen Shannon (JS) divergence. In this paper, the authors point out the shortcomings in such metrics when the support of the two distributions being compared do not overlap and propose using the earth movers/wasserstein distance as an alternative to JS. The parallel lines example provides a nice intuition to the differences in the f-divergence metrices. Note that when the f-divergence is discrete as in JS, KL we might face problems in learning models with gradients as the divergence loss is not differentiable everywhere.

Theorem 1 proposed in the paper is probably the key takeaway for anyone wondering why wasserstein distance might help in training GANS. The theorem basically states that a distribution mapping function (critic) that is continuous with respect to its parameters and locally lipschitz has a continuous and almost everywhere differentiable wasserstein distance.

A continuous and almost everywhere differentiable metric would mean we can strongly train the discriminator before doing an update to the generator which in turn would receive improved reliable gradients to train from the discriminator. With the earlier formulations of GAN such training was not possible since training discriminator strongly would lead to vanishing gradients.

Given that neural networks are generally continuous w.r.t to its parameters, the thing to make sure is the critic being Lipschitz. By clipping the weight parameters in the critic, we prevent the model from saturating while the growth is made almost linear. This would mean the gradients of the function is bounded by the slope of this linearity becoming Lipschitz bound.

Can anyone share their WGANs training time? I’d be interested to see what speeds you are getting. Probably @jeremy can share how long it took to train the GAN he presented in lesson 10.
I’m currently training a WGAN on the LSUN Bridges dataset, on a P2 instance and it took about 50 hours to process about 2 million samples through the generator (first in batches of 64, then 512). That is about 32k generator iterations of 64 sized batches. This gives a speed of about 11 samples per second (through the generator).
In the paper good results start appearing at around 100k and then great ones at about 300k generator iterations.
Currently there is something that resembles bridges in the generated images, it still has a long way to go:

I’m afraid I don’t remember - although I never really run anything more than overnight, so that’s a max of about 10 hours. Just 100x100 images however.

Is it possible to use pretrained models with GANs? I wonder if I use a pretrained discriminator maybe it will be difficult for the generator to “catch up”

fred just saw your question: This paper (2018) looks at transfer learning in GANs (e.g., use an imagenet trained model to start with). There seems to be mixed results but i found it really interesting that density vs diversity seems more important in helping the outcome (yay for GPU not choking on imagenet size datasets…)

But they supply all you need to play in a github repo here