@jeremy@sgugger I assume you guys have done experiments selecting these parameters for v2. Do you think these results could be specific to this dataset?
The problem with the head was a mistake by me. I removed the initial batchnorm in the head, without testing that change properly. I thought it wouldn’t have any negative impact, but I was wrong.
I’m interested in the eps issue myself, and also noticed it in the rsna comp. I’m not sure still when it should be high and when low. Perhaps the recent Sadam is the better approach https://arxiv.org/abs/1908.00700v2
I also noticed something similar with eps, although I just players around with a couple values on a quick baseline I had for RSNA.
Thanks for sharing the SAdam paper. This looks like an interesting approach where they replace eps, and introduce an activation function (soft plus) with its own hyperparameter that needs to be tuned. In addition, it seems it is possible to play around with the activation function, leading to another parameter to tune. So there is probably some additional work that needs to be done finding some generic parameters choices that would work on a wide variety of problems.
Nevertheless, this is very interesting work! Thanks again for sharing!