Does using mixup mean less dropout could be used?
Very probably yes.
What about it? Itâs calculated normally.
I didnât quite catch how Jeremy was suggesting to use mixup with NLP?
He didnât, youâll have to figure it out!
After you have the embeddings for a sequence of tokens, you can then do a mixup linear combination of two different embedding sequences
that would be the first thing that comes to mind, but if you consider that a lot of times people use word embeddings to get to sentence embedding, then in a way the sentence embedding is mixup (at least to me)
Are there certain models where fp16 doesnât provide much speed up?
It seemed to be good on vision but training a language model I was only getting about 10% speed up.
Not sure if it wa something specific to my code or if thereâs something about AWDLSTM
Is there an intuitive way to understand why mixup is better than other data augmentation techniques? Iâm not picking up on it.
The speed up depends on your model, and you also need to have everything be a multiple of 8.
Is mixup a type of âcrappifyâ function as discussed in part 1?
No, not really. It doesnât alter the resolution of the image for instance.
The more I am watching the lectures, the more I want to write my own Deep Learning models training library. (In Swift, probably⌠)
Python question: Jeremy says âthe expansion
is either 1 or 4â - is there a way to enforce this in the function definition? Iâve wanted & tried to do this in the past, but havenât found a way (besides defensive programming with a descriptive error message). (Curious if anyone knows this, not nec. a live question)
Probably you could assert it? Like:
assert extension in (1, 4)
In our implementation, the blocks just have an expansion attribute. You can assert to its values.
Thanks both, thatâs what I meant by defensive programming. Good to know.
I noticed from previous notebooks and also in this class that maxpooling is never used. Instead, stride 2 is being for reducing the dimensionality of layers. So, is it stride 2 superior? Is it better practices to use stride 2 instead of Maxpooling?
Iâve seen ResNeXt in some papers. Is it the same as XResNet or are those different things?
Always was interested to understand how people come up with all these architectures and tricks to make models more effective. It is great to see all these tips and tricks discussed in the lectures. Seems like you donât always need to perform sophisticated mathematical derivations to come up with new ideas in Deep Learning.