Hello.
I was also asking myself these kind of questions and I conducted some experiments on Rossman data last Friday.
All I was trying is changing number of layer’s depth and layers sizes (with some dropout probabilities variants).
What I’ve learned from this (local machine with GTX1070).
Default batch size (64) is pretty low and slow. Only much bigger bs (8196) made my GPU utilization close to 100% (yet memory consuption were still close to 50% – 4Gb). Moreover training time for 6 epochs dropped from 17 min to just 2 min (which just let me do much more experiments than 17min-loop). The best result in terms of speed and accuracy (if I may use such a composite criterion) was on [2048, 1024, 512, 64] (and similar) net (and bs=8196). The only problem was that after 6 epochs training loss was much lower than validation one (so my net successfully overfitted ). Adding more weight decay and/or dropout didn’t help that much in terms of accuracy (although training loss became much closer to validation one). BUT the initial BS (64), although beeing pretty slow in training, allows much lower overfitting. That was encouraging but I couldn’t make accuracy significantly lower (my best results both in bs = 64 and 8196 was in range of 0.1045–0.105ish). My thought on that is that data itself consist that amount of information and nets more than [1000, 500] has enough parameters to capture (almost) all of it.
I also tried to formulate some kind rule of thumb, how many parameters should we use. My starting point was size of the input layer (it’s something around 230 input values counting all embeddings). All I found is that x1 (256 neurons in initial layer) and x2 (512) is not enough (for this case at least). So x4 (as 1000 in first layer is a pretty good spot) as I hope may work (although much much much experiments is still needed).
Hope this was in some way useful for you, and I would be very interesting in your experiments in number of parameters (layers) on your data (or maybe your thoughts on Rossman one).
(Sorry for my English ).
TLDR; Next time when I will encounter task similar to Rossman, I will probably start with using bigger (8k-ish) BS with different layer’s size (starting with [4x, 2x, 1x, … to something like 512 or 256], where x is number of input values), trying to find where accuracy stops getting better. Then I’ll try to learn this net with bs=64, to see if I could avoid overfitting, still getting better results in accuracy. And then (as it often more time-consuming) I will try to make my data better (more cleaning in data, maybe more feature engineering).
It’s just my thoughts after couple of evening playing with Rossman data, I’m far from thinking that it’s somewhat applicable to other cases, but maybe it’s better than nothing as a staring point