Lesson 12 (2019) discussion and wiki

Does using mixup mean less dropout could be used?

Very probably yes.

1 Like

What about it? It’s calculated normally.

I didn’t quite catch how Jeremy was suggesting to use mixup with NLP?

He didn’t, you’ll have to figure it out!

1 Like

After you have the embeddings for a sequence of tokens, you can then do a mixup linear combination of two different embedding sequences

15 Likes

that would be the first thing that comes to mind, but if you consider that a lot of times people use word embeddings to get to sentence embedding, then in a way the sentence embedding is mixup (at least to me)

Are there certain models where fp16 doesn’t provide much speed up?

It seemed to be good on vision but training a language model I was only getting about 10% speed up.

Not sure if it wa something specific to my code or if there’s something about AWDLSTM

Is there an intuitive way to understand why mixup is better than other data augmentation techniques? I’m not picking up on it.

4 Likes

The speed up depends on your model, and you also need to have everything be a multiple of 8.

2 Likes

Is mixup a type of “crappify” function as discussed in part 1?

No, not really. It doesn’t alter the resolution of the image for instance.

The more I am watching the lectures, the more I want to write my own Deep Learning models training library. (In Swift, probably… :smile:)

7 Likes

Python question: Jeremy says “the expansion is either 1 or 4” - is there a way to enforce this in the function definition? I’ve wanted & tried to do this in the past, but haven’t found a way (besides defensive programming with a descriptive error message). (Curious if anyone knows this, not nec. a live question)

Probably you could assert it? Like:

assert extension in (1, 4)
1 Like

In our implementation, the blocks just have an expansion attribute. You can assert to its values.

Thanks both, that’s what I meant by defensive programming. Good to know.

1 Like

I noticed from previous notebooks and also in this class that maxpooling is never used. Instead, stride 2 is being for reducing the dimensionality of layers. So, is it stride 2 superior? Is it better practices to use stride 2 instead of Maxpooling?

1 Like

I’ve seen ResNeXt in some papers. Is it the same as XResNet or are those different things?

Always was interested to understand how people come up with all these architectures and tricks to make models more effective. It is great to see all these tips and tricks discussed in the lectures. Seems like you don’t always need to perform sophisticated mathematical derivations to come up with new ideas in Deep Learning.