Tabular_learner layers=[] parameter guidelines?

(Peter Walkley) #1

I am having trouble understanding what the layers= parameter does and how we should use it in tabular models. The Rossman notebook has layers=[1000,500] and the lesson4-tabular has layers=[200,100], but where did these numbers come from ? Lesson 4 (around 50 minutes in) says that we would come back to them, but I don’t see it ever happened. Are there any guidelines or a section explaining this in detail ?

The values don’t seem to have any relation to the data frame size that I could see and there isn’t much in the documentation. I can see the effect if I look at constructed models via ‘learn.model’, but why 200,100 as opposed to 20,10 or 20000,10000 ?

I’m still climbing the python learning curve, so looking in the fastai code at the moment is raising too many unknowns and “stuff to learn” for me to understand.

PS: I’m quite happy to be told I’m blind and be pointed at something !

The question seems to have been asked before on the 2018 course, but there wasn’t ever really an answer: Understanding of the layers parameter in get_tabular_learner

(Cary) #2

Yes, I have the exact same question. I really hope someone could explain it.

(Zak Raicik) #3

I have the same question. Is this arbitrary?

(Kyle Nesgood) #4

Looking for input as well. @sgugger - any suggestions? I don’t want to bother Jeremy :slight_smile:

(Amit Kayal) #5

I also have same doubts and not able to understand this API

(Zak Raicik) #6

@jeremy can you provide some guidance?

(Ravi Sekar Vijayakumar) #7

My understanding is they are the numbers of hidden neurons in the 2 hidden layers. First hidden layer has 1000 and the second has 500(for rossman). My guess on how to arrive this would be that it depends on the number of classes.

(Pavel) #8

I was also asking myself these kind of questions and I conducted some experiments on Rossman data last Friday.
All I was trying is changing number of layer’s depth and layers sizes (with some dropout probabilities variants).
What I’ve learned from this (local machine with GTX1070).

Default batch size (64) is pretty low and slow. Only much bigger bs (8196) made my GPU utilization close to 100% (yet memory consuption were still close to 50% – 4Gb). Moreover training time for 6 epochs dropped from 17 min to just 2 min (which just let me do much more experiments than 17min-loop). The best result in terms of speed and accuracy (if I may use such a composite criterion) was on [2048, 1024, 512, 64] (and similar) net (and bs=8196). The only problem was that after 6 epochs training loss was much lower than validation one (so my net successfully overfitted :slight_smile: ). Adding more weight decay and/or dropout didn’t help that much in terms of accuracy (although training loss became much closer to validation one). BUT the initial BS (64), although beeing pretty slow in training, allows much lower overfitting. That was encouraging but I couldn’t make accuracy significantly lower (my best results both in bs = 64 and 8196 was in range of 0.1045–0.105ish). My thought on that is that data itself consist that amount of information and nets more than [1000, 500] has enough parameters to capture (almost) all of it.
I also tried to formulate some kind rule of thumb, how many parameters should we use. My starting point was size of the input layer (it’s something around 230 input values counting all embeddings). All I found is that x1 (256 neurons in initial layer) and x2 (512) is not enough (for this case at least). So x4 (as 1000 in first layer is a pretty good spot) as I hope may work (although much much much experiments is still needed).
Hope this was in some way useful for you, and I would be very interesting in your experiments in number of parameters (layers) on your data (or maybe your thoughts on Rossman one).
(Sorry for my English :slight_smile: ).

TLDR; Next time when I will encounter task similar to Rossman, I will probably start with using bigger (8k-ish) BS with different layer’s size (starting with [4x, 2x, 1x, … to something like 512 or 256], where x is number of input values), trying to find where accuracy stops getting better. Then I’ll try to learn this net with bs=64, to see if I could avoid overfitting, still getting better results in accuracy. And then (as it often more time-consuming) I will try to make my data better (more cleaning in data, maybe more feature engineering).
It’s just my thoughts after couple of evening playing with Rossman data, I’m far from thinking that it’s somewhat applicable to other cases, but maybe it’s better than nothing as a staring point :slight_smile:

(Peter Walkley) #9

Thank pavel. I did some experimentation with batch sizes after your post and certainly see the training times come down. One interesting thing I saw was that after a while, the time needed by the learning rate finder went up quite significantly as it needed a few epochs to decide the rate: so maximising the GPU memory in use isn’t necessarily the “best” optimisation to go for. My dataset isn’t that big and I’m using a 1070Ti with 8GB ram, so its possible the whole lot fits into the CUDA space available.

Judging by the replies, I at least asked a good question :slight_smile:

(Pavel) #10

My thought on that is: LRFinder needs pretty the same amount of batches to test all the learning rates. In case of big BS’s it means more than one epoch to test it all.

Yes, you’re right. I mention it in conjunction with GPU (I mean GPU-processor itself) utilization. Definitely, in this case it’s GPU parallelism that drives loop time down (much more than GPU memory usage, which role in this is just not to be a bottleneck)

By the way did you encountered similar overfitting in your task (using bigger BS or/and bigger and wider layers).

(Peter Walkley) #11

Hi Pavel

Yes, overfitting is an issue. The “right” number of epochs seems to vary between 3 and 5 for my data set. I’m currently looking at the various callbacks available, such as saving model state as you go and the early stopping callback.

I haven’t tried a huge number of experiments with different layer size parameter as, until your reply, I felt it would all be guesswork. I asked the original question in the hope there was an answer and I’d just not picked up on it in the lessons or notes.

I’m not entirely surprised at overfitting as what I’m doing is a proof of concept and my set is smaller than I’d like. Much of this is a learning exercise in how to get the best out of fastai, so “quick if wrong” is (sort of !) helpful.

There is also a latent possibility that its a time series dependent issue (which i’m trying to engineer out with pre-processing of the data). I’ve been watching the time series discussion thread with interest, but currently too far behind the learning curve to understand a lot of it.