Tabular-Learner fine tuning

joresh · March 6, 2019, 12:04am

Hi all,

I have been using fastai tabular learner is some competitions and am getting pretty good scores. But, I am not getting the best scores. While using other algorithms like XGBoost and LightGBM, to get excellent score we need to a lot of parameter fine tuning. In fastai, the only fine tuning I could understand from Jeremy’s lessons is setting the Learning Rate. Are these any other parameters we can fine tune? And how do we do it? The LR is the most important parameter but what are the other parameters that impact the accuracy?

Thanks

PegasusWithoutWinds · March 6, 2019, 5:31am

It really depends on which aspect of the model you are trying to improve.

If you are trying to reduce the training loss, you could try to train a bigger and deeper model, simply train longer, and do a hyperparameter search. Hyperparameter is a huge topic but you are on the right track that learning rate generally considered the most important one. If you believe that you have fiddled with it enough, you can try \beta of momemtum and mini-batch size. I would usually stop there. At the very end, you could always customize the architecture but for most people it is not necessary.

If there is a big gap between your training loss and the validation loss, meaning that your model has a high variance and is not generalizing well, you could try regularization techniques like dropout and data augmentation.

joresh · March 6, 2019, 7:35pm

Thanks!

I am attempting to use the fastai functions in the Santander competition on Kaggle. It is a tabular data classification problem.

I am creating the data object as below

data = (TabularList.from_df(train_modified, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
.split_by_idx(list(range(10000, 30000)))
.label_from_df(cols=dep_var)
.add_test(TabularList.from_df(test, path=path, cat_names=cat_names, cont_names=cont_names))
.databunch())

I see that the batch size argument is in the ‘from_df’ function as below

from_df(path, df:DataFrame, dep_var:str, valid_idx:Collection[int], procs:Optional[Collection[TabularProc]]=None, cat_names:OptStrList=None, cont_names:OptStrList=None, classes:Collection[T_co]=None, test_df=None, bs:int=64, val_bs:int=None, num_workers:int=8, dl_tfms:Optional[Collection[Callable]]=None, device:device=None, collate_fn:Callable=‘data_collate’, no_check:bool=False) → DataBunch

I could not find the β of momentum parameter. In which function do we specify it?

Regards

joresh · March 6, 2019, 10:56pm

When I passed bs = 32 to from_df() I got the error
'TypeError: init() got an unexpected keyword argument ‘bs’

How do I pass a different batch size while creating the data object?

Thanks in advance!

mindtrinket · March 6, 2019, 11:17pm

Try creating your data object like this. I define “test” outside of the object.

data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                       .random_split_by_pct(valid_pct=0.1, seed=42)
                       .label_from_df(cols=dep_var)
                       .add_test(test)
                       .databunch(bs=1024))

joresh · March 6, 2019, 11:34pm

That worked! Thanks! I assumed that from_df() would pass the ‘bs’ to the functions it calls. Looks like that does not happen, or I did not understand the calls.

spacecadet · April 2, 2019, 8:03am

did you finally get the fastai tabular learner to work for the santander competition ?

joresh · April 11, 2019, 3:34pm

Yes I did but the I got better scores using XGBoost. Therefore I switched to GBM.

spacecadet · April 12, 2019, 11:21am

what accuracy did you manage to achieve with gradient boosting?
I used fastai tabular but my accuracy on the test set maxed out at about 0.86?
I will leave a link to the kernel below, I hope everyone can check it out

https://www.kaggle.com/yngspacecadet/fastai-tabular-model

mindtrinket · April 12, 2019, 1:20pm

I was running into something very similar in Santander and asked the exact same question for running NN on the datasets.

I was able to get to .88 with an augment that did some column shuffling of the data.

I saw another kernel here that got to .89 that used Kfolds get as high as the gbm.

So I think it was really the K-fold. Is there a way to do that in Fast.ai?

Oh and congrats on your first Kaggle competition @spacecadet!

marvin · April 24, 2019, 5:04pm

I was working exclusively with the tabular learner for the past three weeks.
Here are my three take-aways:

No other tuning than stage-wise learning rate is required.(*)
[EDIT: This might not be true. Redo experiments.] For predicting numbers or forecasting values in a time-series, predicting the number or the percentage change makes no difference altogether.
Feature engineering makes the most difference. Specifically, by converting implicit knowledge into explicit variables. For instance, for a time series, adding another feature that measures the distance from the moving average immediately improves accuracy.

After having applied all three lessons, my tabular learner went from ~70% rmspe to about 95% rmspe with a mean absolute percentage error (mape) of 2 - 3%.

Looking back, my case really is about the same as the Rossmann example in the sense of dealing with a feature engineering problem in the first place.

(*) Stage-wise LR means, you start with a mega-rate, then use ten-time more than optimal, and finally use the optimal rate to fine tune. Example.:

gist.github.com

https://gist.github.com/marvin-hansen/6f43f6e9744e6b8262346207354db4e8

run_learner.py

def run_learner(learn):
  
  # start with a mega rate
  learn.fit_one_cycle(3, 3e-4)
  # 10x higher learn-rate with higher steps
  learn.fit_one_cycle(5, 1e-6, wd=0.3)
  # smaller rate with smaller steps
  learn.fit_one_cycle(5, 1e-07, wd=0.1)
  # plot losses
  learn.recorder.plot_losses()

joresh · April 24, 2019, 5:38pm

Thanks for insight! I too have learnt the hard way that better feature engineering is better than better algorithms. Maybe that is why ML is still an ‘art’ and not a science.

marvin · April 24, 2019, 7:18pm

Thank you @joresh

I should have mentioned that I actually build an experiment pipeline and spent a few days testing every possible combination of features in and out on the actual model until I got a descent combination.

From my point of view, making results reproducible remains the first step before tuning the results because that way you can seperate the cause from the effect through many experiments.

However, what I have just learned over the past few days is that the tabular learner can be very prone to overfitting so currently I rebuild the pipeline to redo the experiments w.r.t. to best results from automated validation on clean data excluded from training. Looking at my code base, I have about 10X more loc on procs and automation than on actual DL…

Not sure whether it’s an art, science, or just plain engineering PITA.