Lesson 12 (2019) discussion and wiki

brismith · April 18, 2019, 1:52am

Does using mixup mean less dropout could be used?

sgugger · April 18, 2019, 1:53am

Very probably yes.

sgugger · April 18, 2019, 1:53am

What about it? It’s calculated normally.

tanyaroosta · April 18, 2019, 1:54am

I didn’t quite catch how Jeremy was suggesting to use mixup with NLP?

sgugger · April 18, 2019, 1:54am

He didn’t, you’ll have to figure it out!

mcleavey · April 18, 2019, 1:55am

After you have the embeddings for a sequence of tokens, you can then do a mixup linear combination of two different embedding sequences

tanyaroosta · April 18, 2019, 1:56am

that would be the first thing that comes to mind, but if you consider that a lot of times people use word embeddings to get to sentence embedding, then in a way the sentence embedding is mixup (at least to me)

yeldarb · April 18, 2019, 1:59am

Are there certain models where fp16 doesn’t provide much speed up?

It seemed to be good on vision but training a language model I was only getting about 10% speed up.

Not sure if it wa something specific to my code or if there’s something about AWDLSTM

nbharatula · April 18, 2019, 1:59am

Is there an intuitive way to understand why mixup is better than other data augmentation techniques? I’m not picking up on it.

sgugger · April 18, 2019, 1:59am

The speed up depends on your model, and you also need to have everything be a multiple of 8.

nswitanek · April 18, 2019, 2:02am

Is mixup a type of “crappify” function as discussed in part 1?

sgugger · April 18, 2019, 2:03am

No, not really. It doesn’t alter the resolution of the image for instance.

devforfu · April 18, 2019, 2:03am

The more I am watching the lectures, the more I want to write my own Deep Learning models training library. (In Swift, probably… )

ThomM · April 18, 2019, 2:11am

Python question: Jeremy says “the expansion is either 1 or 4” - is there a way to enforce this in the function definition? I’ve wanted & tried to do this in the past, but haven’t found a way (besides defensive programming with a descriptive error message). (Curious if anyone knows this, not nec. a live question)

devforfu · April 18, 2019, 2:12am

Probably you could assert it? Like:

assert extension in (1, 4)

sgugger · April 18, 2019, 2:12am

In our implementation, the blocks just have an expansion attribute. You can assert to its values.

ThomM · April 18, 2019, 2:13am

Thanks both, that’s what I meant by defensive programming. Good to know.

wittmannf · April 18, 2019, 2:15am

I noticed from previous notebooks and also in this class that maxpooling is never used. Instead, stride 2 is being for reducing the dimensionality of layers. So, is it stride 2 superior? Is it better practices to use stride 2 instead of Maxpooling?

PierreO · April 18, 2019, 2:17am

I’ve seen ResNeXt in some papers. Is it the same as XResNet or are those different things?

devforfu · April 18, 2019, 2:17am

Always was interested to understand how people come up with all these architectures and tricks to make models more effective. It is great to see all these tips and tricks discussed in the lectures. Seems like you don’t always need to perform sophisticated mathematical derivations to come up with new ideas in Deep Learning.