Maxpool 2d is actually used in the XResnet
It’s not the same things AFAIK, it’s our version of the bag of tricks ResNet.
I thought it was average pooling only
No, there is one in the very first layers ;).
OH, that’s right! hahah thanks @sgugger.
Here are the benchmarks Jeremy is sharing: https://github.com/cgnorthcutt/benchmarking-keras-pytorch
But I am still not sure when I should use one or the other. Are there any best practices rules? Sorry if this is a dumb question hahah
when running out of memory does gc.collect() work ?.
in leader board the accuracy from a single run is recorded, what is the variance based on different train-test splits? do we care?
Depends on the type of memory. I guess if you have CUDA exception, you need use something else or restart the kernel.
I agree that deeply understanding the model and optimize it like the Bag of tricks paper did is great. Don’t you think automatic search for good architectures still has a place though? For example I’ve seen you recently like a new paper that found SOTA architecture for Object Detection through automatic architecture search.
There is an official train/test split to use.
It’s still very immature and hard to reproduce, in our experience.
Any idea why?
Is it safe to overfit when you are doing transfert learning? It looks like jeremy ovefitted before doing transfert learning.
It’s still very new, that’s why.
I think Jeremy has an argument against cross validation for deep learning I forgot what it was
Would love to hear this argument against X-val in DL…
Confusing naming, indeed