A walk with fastai2 - Tabular - Study Group and Online Lectures Megathread

AjayStark · April 3, 2020, 9:54am

Yeah that’s what I should do, instead of looking at the source and feel “wow this is above my head”
Thanks, I’ll do that.

AjayStark · April 5, 2020, 4:10pm

Hi @muellerzr
Is Inception model only for categorical tabular model?
How to deal with time series regression problem?

Thanks,

shyampagadi · April 5, 2020, 4:59pm

Hi,
How can we add a test file and predict on unlabeled data?

Thanks.

muellerzr · April 5, 2020, 5:02pm

@shyampagadi I tried to include many examples of this. See the regression notebook.

shyampagadi · April 5, 2020, 5:12pm

Thank You for your quick response and Thank You for these wonderful videos, learning a lot from them.

muellerzr · April 5, 2020, 5:24pm

I am unfamiliar with the time series library, I only brought it in to show an example. You should use the time-series megathread for questions related to it.

AjayStark · April 5, 2020, 5:44pm

Sure Thank you

tabularguy · April 6, 2020, 8:08am

If I want to train on new data with the model already in memory.

Say I do something like
train_to = to.new(train_df2)
train_to.process()

train_dbunch = train_to.dataloaders(bs=512, path=path)
learn.dls = train_dbunch

If I call learn.fit_one_cycle(1) would this be training on the new tabular pandas instead of the initial one when I made the learner in the beginning? I am having trouble telling if what I am doing is actually training the old model on new data.

Srinivas · April 6, 2020, 6:10pm

Gloria,
If you watch Jeremy’s ML course last lecture (lecture 12) starting at 33:00:00 he explains his choice of embeddings for the categories for Rossman competition and the general principles he uses for mapping categorical variables into embeddings and why this is useful.

His actual formula for the choice of emb_szs is now (in fastai2) different from what it was in that video (IIRC) but the principles are the same. Also the prior lecture (lecture 11) starting at 1:17:35 explains the feature engineering of the Rossman data in detail.

muellerzr · April 6, 2020, 6:12pm

Correct, this was also changed in v1 after further testing (So they’re both the same)

Srinivas · April 7, 2020, 2:27am

Just to confirm my understanding, the adults dataset actually has 6 cat variables (workclass, education, marital-status, occupation, relationship, race) but when we fix_na for the 3 cont_vars we create 3 cols of booleans indicating where there is a na and these three booleans create the 7th cat variable hence there are 7 emb_szs as in [(10, 6), (17, 8), (8, 5), (16, 8), (7, 5), (6, 4), (3, 3)]. Is this right?

Srinivas · April 7, 2020, 3:04am

Only education-num of the cont vars has na - so I guess that is the 7th cat variable with T, F and na value of its own hence the (3, 3) for the 7th value in the emb_szs right?

Srinivas · April 7, 2020, 3:21am

Q: How do we get 42 in this cell matrix = create_explain_matrix(tot,
cat_dims,
cat_idxs,
42)
The sum of the emb_szs is 6+8+5+8+5+4+3 = 39 + the 3 cont vars??

muellerzr · April 7, 2020, 3:21am

Yes. And I will answer the rest of your questions in the morning I know the documentation is not 100% clear when I wrote that, I kinda figured it out then chucked it in, I’ll refactor and verbosify it a bit

Srinivas · April 7, 2020, 3:23am

Thx. Gnight.

AjayStark · April 7, 2020, 8:10am

Hi, is it possible for a tabular model to reach 100% accuracy?
Maybe the validation set could be simple so that it got correct every time or maybe the relation between the inputs and outputs was so simple. In any such cases is 100% accuracy possible and legitimate?

tabularguy · April 7, 2020, 8:36am

@muellerzr

How can you update the data loaded into the learner to train on new data? Do you directly overwrite the old dls object in the learner? I am doing this without using .copy(). I am not sure if the model is actually training on new data or not. I am assuming that categorical variables that are not NA but were not shown to the embeddings at the first model creation are getting mapped to the NA embedding and updated that way?

muellerzr · April 8, 2020, 2:52am

Your assumption is correct here.

As with most things it depends on many factors but usually you get very very close to 100%, perhaps on your validation set. Then you’d need a seperate test set to verify. Was the test set too easy? Not representative enough? Then perhaps. Or perhaps there could be bias’ in the model. Short answer is yes it’s possible but not very likely.

You can make two test_dl that are labeled would be one way to ensure that everything is set up the same. Make sure training is shuffled (not by default) and then override the train and validation DataLoader. So yes it can be done.

alexbonde · April 8, 2020, 6:44am

Awesome Zachary - Very helpfull!
One quick question: Do you know if it’s possible to load tabular models from fastai1 in fastai2?

AjayStark · April 8, 2020, 6:53am

I tried training the titanic dataset by Bayesian hyp tuning, and i got around 82% accuracy after 14 epochs(as suggested by the tuning)
Also trained the model using fastai defaults without changing any parameters and got an accuracy of around 83% after 7 epocs.

So is Bayesian opt the better way to tune hyp params or is there any other method?
@muellerzr