Can you discuss your “super resolution” of sentence idea a little bit more?

I think jeremy will talk about this as he has mentioned in the comment.

Can you explain what effect using such a relationship of training iterations for Generator vs. Discriminator over training time would have?

I’m myself testing this to have concrete results but I strongly feel that the generator is trained with a decreasingly strict manner first and then increasingly strict manner next as the training ratios decreases and then increases and has a lowest of 1.

This peak nature is inspired from lecture 8 and the idea behind this to ease up generator training initially but make it hard eventually so that it learns better.

Hi, I’ve been pondering how to combine image generators and pre-trained image models. Intuitively, it seems like you should somehow be able to use a pre-trained image model like Resnet or VGG to jump start a generator. As in, the pre-trained model already knows about edges, curves, and more abstract features. So shouldn’t we be able to take advantage of that?
Though when I get down to actually considering the implementation, things kinda fall apart. Going from a low-dimensional space (the output class, or output dimensions) to a higher dimensional space is tough. You can’t just “undo” a linear layer, especially with something like ReLU. And same with convolutions. There are many many input configurations that could have generated the outputs at any given layer of a model. So it seems like you’d explode into some insanely huge possible input space as you went backward from layer to layer. But still, the question remains… that it seems there should be some kind of way to extract the information known from the pre-trained model.
So I guess my question is… are there any ways to extract an approximation of the input space that, when given a set of transformations (the neural net), could have been used to generate a given output? If you could, I would think that you could sample from the known possible input space at each layer, and then go from there… any thoughts on this?

From experimenting I figured that Adam and WGANs not just work worse - it causes to completely fail to train meaningful generator.

from WGAN paper:

Finally, as a negative result, we report that WGAN training becomes unstable at
times when one uses a momentum based optimizer such as Adam [8] (with β1>0)
on the critic, or when one uses high learning rates. Since the loss for the critic is
nonstationary, momentum based methods seemed to perform worse. We identified
momentum as a potential cause because, as the loss blew up and samples got worse,
the cosine between the Adam step and the gradient usually turned negative. The
only places where this cosine was negative was in these situations of instability. We
therefore switched to RMSProp [21] which is known to perform well even on very
nonstationary problems

I’m hoping there’ll be a chance to cover this on Monday since we didn’t get to it last week. This is one of the key missing pieces for me in terms of scaling training.

No worries! I’m always amazed by how much we cover, and that there’s the chance to ask questions and interact along the way. I’m also very grateful for all of the help you’ve provided. I’m sad the course is ending, although I’ve got a page worth of possible projects that should keep me busy until next year.

There was a paper at ICLR which talked about GAN convergence, and said that if you normalize the weights by dividing by the spectral norm, it converged a lot better.

I’ve tried this with 0.4 - installed using conda install, and it still fails. Jeremy must have been compiling from source when he was experimenting with 0.4, so is it possible that the in-place worked for him because of that - that the code was specifically compiled for his GPU?

Unfortunately it’s only for images and raw vector data, so you’d need python to do preprocessing on data to begin with and I’m not convinced losing pytorch, in addition to not being able to use your own hardware would be worth it putting everything else aside.

Looks great for beginners wanting to do deep learning for images though!

Yep, which also means we have to be careful to denormalize our generated data accordingly.
I also tried to find the reason why using tanh instead of sigmoid, but no success yet (the DCGAN paper you refer to just mention that bounded activations are better, but it’s also the case if we use the sigmoid which is bounded between 0 and 1).
The only reason I can think of is that it works better experimentally on some dataset, maybe because the activations are close to zero, which means that the sigmoid will get saturated, which won’t be the case for tanh as it is define between -1 and 1.

Is there any particular reason why dropout is not used in the Darknet notebook? The training loss is significantly lower than the validation loss at the end, so I would expect dropout to help…

More generally, are there cases where dropout shouldn’t be used? I thought that if we take a network without dropout, make each layer x% wider, then add 1/(1+x%) dropout probability, we should get better results. Is this true?

My intuition would be, because batchnorm is used, and it requires some additional work to make it work together.
But people had some success using dropout with batchnorm on Cifar10

I am trying to understand cycle gan, and reading the source code. I dont understand why their generator model has so many of these layers (see bottom) compare to the one we got in wgan. I also couldn’t find things about this on their paper. Does anyone know??

Input size is one factor from the loop, but the layers are more than I would expect.
ref:

I have a question regarding the loss functions for the WGAN.

In Jeremy’s code, the gradients are updated as follows:

For the discriminator training:
Loss = Dis(real_sample) - Dis(Gen(seed))
To minimize the loss, the optimizer will push the value of Dis(real_sample) to low negative values and Dis(Gen(seed)) to high positive values to that the difference becomes large but negative.

For the generator training:
Loss = Dis(Gen(seed))
To minimize the loss, the optimizer will change weights so that the value of Dis(Gen(seed)) becomes small or negative.

This is the opposite direction as above, and to me makes sense because the 2 optimizer work against each other, and therefore they aim at pushing a same value in opposite directions. This can be seen after the 2 epochs ran in the notebook. We see that after the second iteration “real loss” is pushed to smaller negative values and “fake loss” becomes a larger positive number.

Now looking at the paper (see 1:37:00 in the video), the loss for the discriminator is given as MINUS Dis(Gen(seed)), i.e. same direction as the discriminator. Basically this would mean that (intuitively) it would push the weights of the generator in order to improve the score of the discriminator during the generator phase, which does not make sense to me.