A walk with fastai2 - Tabular - Study Group and Online Lectures Megathread

muellerzr · April 8, 2020, 2:52am

Your assumption is correct here.

As with most things it depends on many factors but usually you get very very close to 100%, perhaps on your validation set. Then you’d need a seperate test set to verify. Was the test set too easy? Not representative enough? Then perhaps. Or perhaps there could be bias’ in the model. Short answer is yes it’s possible but not very likely.

You can make two test_dl that are labeled would be one way to ensure that everything is set up the same. Make sure training is shuffled (not by default) and then override the train and validation DataLoader. So yes it can be done.

alexbonde · April 8, 2020, 6:44am

Awesome Zachary - Very helpfull!
One quick question: Do you know if it’s possible to load tabular models from fastai1 in fastai2?

AjayStark · April 8, 2020, 6:53am

I tried training the titanic dataset by Bayesian hyp tuning, and i got around 82% accuracy after 14 epochs(as suggested by the tuning)
Also trained the model using fastai defaults without changing any parameters and got an accuracy of around 83% after 7 epocs.

So is Bayesian opt the better way to tune hyp params or is there any other method?
@muellerzr

muellerzr · April 8, 2020, 9:24am

If you use the raw weights and setup the model the exact same way yes, but most of the time learn.save() keeps the optimizer state to in which case the answer would be no.

AjayStark · April 8, 2020, 2:41pm

@muellerzr kindly help me out when you are free.

Thanks

muellerzr · April 8, 2020, 6:02pm

Just so everyone knows it’s a thing here’s the link to part 3, Text which will begin tonight:

waydegg · April 9, 2020, 9:31am

Anyone know if you’re able to get the index values of a TabularPandas object? Doesn’t seem like you can do the usual .index like you would on a DataFrame.

Edit: And I think I just answered my own question. Doing .items give you a Pandas DataFrame version of our TabularPandas object, which you can then call .index. There probably should be an index attribute attached by default so you don’t have to do this.

waydegg · April 12, 2020, 8:19am

Is there a general rule of thumb when choosing how many unique categorical values you should have per column? I have a dataset with many categorical features that each range from having 10 to several hundred unique values and I’m obviously trying to cut down on the latter size. I have around 19k rows in this dataset.

Some approaches I’m taking for “cutting down” the amount of unique values per column is:

Creating new labels that group others together
Re-labeling examples that don’t occur very often simply as “other”
Dropping rows completely if they have multiple “rare” values for several columns

I’m not sure which/if any of these methods are useful, though I have a feeling that having a couple hundred unique values per column certainly isn’t helping my model accuracy…

chkchk12 · April 12, 2020, 8:44pm

Hello everyone, I’ve just gotten started watching the lectures. I’m trying to run the 02_Regression_and_Permutation_Importance notebook in Google Colab. I kept everything as is. However, I get the following error. Anyone know what is going on?

muellerzr · April 12, 2020, 8:45pm

Yes I need to update that notebook as those should go into a tabular_config

Alpha · April 14, 2020, 11:28pm

You can do it like that :
config=tabular_config(ps=[0.001,0.01], embed_p=0.04)

learn = tabular_learner(dls, layers=[1000,500], config=config, y_range=y_range, metrics=rmse, loss_func=MSELossFlat())

vrodriguezf · April 17, 2020, 11:25am

Hi! Thank you for this amazing study group!

Does anybody know how to define which is the positive class in a CategoryBlock? I am getting an encoding where 0 is the positive class, which is a bit against of the standard of binary classification.

Thanks!

DebabrataRoy · April 20, 2020, 7:20pm

Thanks Mueller for the amazing videos. Can someone post their work on tabular datasets. We can get more examples and more datasets. Please share your work on tabular datasets

waydegg · April 20, 2020, 9:11pm

Is it possible to get the encoded data from a dls.test_dl? I’m trying to load and process a test dataset and then get the encoded values for all the columns so I can use that data in non-fastai models (XGBoost, RF, etc.).

When I originally create my DataLoader for training, I can call to.train.xs since I used TabularPandas to feed in my original training set into my DataLoader. Is there a way I can access the transforms, apply them, and view the applied results from a dls.test_dl?

muellerzr · April 20, 2020, 9:16pm

Just call dl.xs I believe (since it shouldn’t have a train or valid separation)

waydegg · April 20, 2020, 9:20pm

dl.xs just shows the non-encoded, original values. On an unrelated note, only per index and not the entire dataset.

muellerzr · April 20, 2020, 9:20pm

After making the dl call dl.process(). That should encode them all

(Also dl.dataset for the dataset)

waydegg · April 20, 2020, 9:29pm

There is no .process() for a DataLoader generated from your ‘original’ dl used in training (as in a dl created from calling dls.test_dl). You can do that for your training dl though. Even after calling .process() and then creating a new .test_dl using my test data, the dataset is still unencoded (which makes sense, but I just wanted to mention that).

fastai2 version: 0.0.17
fastcore version: 0.1.17

vrodriguezf · May 12, 2020, 11:34am

Hi!

Do any of these notebooks contain an example of regression with multiple dependent variables? I am not sure exactly how to proceed.

Thanks!

muellerzr · May 12, 2020, 11:37am

No, I’m afraid they don’t out of the box. TabularPandas doesn’t like that very much. My best recommendation would be using a NumPy DataLoader for tabular instead and working with it. See my article here @vrodriguezf https://muellerzr.github.io/fastblog/2020/04/22/TabularNumpy.html