A walk with fastai2 - Tabular - Study Group and Online Lectures Megathread

Your assumption is correct here.

As with most things it depends on many factors but usually you get very very close to 100%, perhaps on your validation set. Then you’d need a seperate test set to verify. Was the test set too easy? Not representative enough? Then perhaps. Or perhaps there could be bias’ in the model. Short answer is yes it’s possible but not very likely.

You can make two test_dl that are labeled would be one way to ensure that everything is set up the same. Make sure training is shuffled (not by default) and then override the train and validation DataLoader. So yes it can be done.

2 Likes

Awesome Zachary - Very helpfull!
One quick question: Do you know if it’s possible to load tabular models from fastai1 in fastai2?

I tried training the titanic dataset by Bayesian hyp tuning, and i got around 82% accuracy after 14 epochs(as suggested by the tuning)
Also trained the model using fastai defaults without changing any parameters and got an accuracy of around 83% after 7 epocs.

So is Bayesian opt the better way to tune hyp params or is there any other method?
@muellerzr

If you use the raw weights and setup the model the exact same way yes, but most of the time learn.save() keeps the optimizer state to in which case the answer would be no.

@muellerzr kindly help me out when you are free.

Thanks

Just so everyone knows it’s a thing here’s the link to part 3, Text which will begin tonight:

1 Like

Anyone know if you’re able to get the index values of a TabularPandas object? Doesn’t seem like you can do the usual .index like you would on a DataFrame.

Edit: And I think I just answered my own question. Doing .items give you a Pandas DataFrame version of our TabularPandas object, which you can then call .index. There probably should be an index attribute attached by default so you don’t have to do this.

1 Like

Is there a general rule of thumb when choosing how many unique categorical values you should have per column? I have a dataset with many categorical features that each range from having 10 to several hundred unique values and I’m obviously trying to cut down on the latter size. I have around 19k rows in this dataset.

Some approaches I’m taking for “cutting down” the amount of unique values per column is:

  1. Creating new labels that group others together
  2. Re-labeling examples that don’t occur very often simply as “other”
  3. Dropping rows completely if they have multiple “rare” values for several columns

I’m not sure which/if any of these methods are useful, though I have a feeling that having a couple hundred unique values per column certainly isn’t helping my model accuracy…

Hello everyone, I’ve just gotten started watching the lectures. I’m trying to run the 02_Regression_and_Permutation_Importance notebook in Google Colab. I kept everything as is. However, I get the following error. Anyone know what is going on?

1 Like

Yes I need to update that notebook as those should go into a tabular_config :slight_smile:

You can do it like that :
config=tabular_config(ps=[0.001,0.01], embed_p=0.04)

learn = tabular_learner(dls, layers=[1000,500], config=config, y_range=y_range, metrics=rmse, loss_func=MSELossFlat())

Hi! Thank you for this amazing study group!

Does anybody know how to define which is the positive class in a CategoryBlock? I am getting an encoding where 0 is the positive class, which is a bit against of the standard of binary classification.

Thanks!

Thanks Mueller for the amazing videos. Can someone post their work on tabular datasets. We can get more examples and more datasets. Please share your work on tabular datasets

Is it possible to get the encoded data from a dls.test_dl? I’m trying to load and process a test dataset and then get the encoded values for all the columns so I can use that data in non-fastai models (XGBoost, RF, etc.).

When I originally create my DataLoader for training, I can call to.train.xs since I used TabularPandas to feed in my original training set into my DataLoader. Is there a way I can access the transforms, apply them, and view the applied results from a dls.test_dl?

Just call dl.xs I believe (since it shouldn’t have a train or valid separation)

dl.xs just shows the non-encoded, original values. On an unrelated note, only per index and not the entire dataset.

After making the dl call dl.process(). That should encode them all

(Also dl.dataset for the dataset)

There is no .process() for a DataLoader generated from your ‘original’ dl used in training (as in a dl created from calling dls.test_dl). You can do that for your training dl though. Even after calling .process() and then creating a new .test_dl using my test data, the dataset is still unencoded (which makes sense, but I just wanted to mention that).

fastai2 version: 0.0.17
fastcore version: 0.1.17

Hi!

Do any of these notebooks contain an example of regression with multiple dependent variables? I am not sure exactly how to proceed.

Thanks!

No, I’m afraid they don’t out of the box. TabularPandas doesn’t like that very much. My best recommendation would be using a NumPy DataLoader for tabular instead and working with it. See my article here @vrodriguezf https://muellerzr.github.io/fastblog/2020/04/22/TabularNumpy.html