Tabular learner problem

olegperegudov · January 1, 2021, 10:29pm

Hi.

After finishing 09_tabular I decided to try it all on kaggle’s house prediction dataset.
I did some data cleaning (no nans, fixes some outliers etc, mostly for practice) and assembled a model (after going through a 09_tabular, tutorial on tabular and tabular docs on fastai).
This is the model:

procs = [Categorify, FillMissing, Normalize]
cat_names = x.select_dtypes('category').columns.tolist()
cont_names = x.select_dtypes('number').columns.tolist()
splits = RandomSplitter(valid_pct=0.2)(range_of(x))

to = TabularPandas(x, 
               procs = procs,
               cat_names = cat_names,
               cont_names = cont_names,
               y_names = 'SalePrice',
               splits = splits)

dls = to.dataloaders(bs=32)

learn = tabular_learner(dls, metrics=mse)

Something is off and I’m getting errors here and there when I try to move forward.

When I try lr_find(), the values are very random. Can be 0.02 one “run all” (just to make sure I didn’t rewrite anything) and 2e-8 another run.
looks like fit_one_cycle is giving me something (losses and mse are decreasing) and I was hopeful at 1st, but then I had more errors.
learn.show_results() run immediately after training throws error:
ValueError: Wrong number of items passed 2, placement implies 1
I didn’t pass anything similar to the tutorial here:
https://docs.fast.ai/tutorial.tabular.html
when I tried to predict values for another df I have got “missing value” error. I checked it and I happen to miss a nan, but doesn’t procs have to take care of it?
when I try to predict a new data set (prepped the same way as the train set. Same cols number and names of cols match. Nunique and unique values are the same) with:
dl = learn.dls.test_dl(test_df)
learn.get_preds(dl=dl)
I get an IndexError: index out of range in self I tried a few things from SO and I checked fastai forums
([Solved] Problem with tabular_learner.predict() on a single row)
but couldn’t make it work in my case.
Predicting a single value with learn.predict(x_test_proxy.loc[0, :])[2] worked, but it looks strange. It is said in docs that 2nd index is for decoded value but I’m getting values between -1 and 1, while it is hundreds of thousands. Could it be a prob? But why and what’s with the negative value?
5.5. Also you write in tabular tutorial:
To get prediction on a new dataframe, you can use the test_dl method of the [DataLoaders](https://docs.fast.ai/data.core.html#DataLoaders). That dataframe does not need to have the dependent variable in its column.
It’s ok, but if I don’t have a target col (‘SalePrice’ in my case. Why would I? I need to predict it), it will throw an error that I don’t have a ‘SalePrice’ column. I had to manually create a column with 0 values in test set. Maybe it tries to drop it and can’t find it?
Am I right to assume that fast ai will take cat features, EE them with simple NN, and use them with other cont features in another NN to predict values?
6.5 Am I right to assume - when predicting new, unseen test values, the model should transform them the way it did for train, predict, and then transform back the predicted values?
At some point I was just going to find pytorch examples on EE (there are a lot of Keras, but I want to study the torch), and while I found a few (I think) the code was very difficult to read and understand for my level. I mean when you think about how EE works it doesn’t sound all that difficult, so there should be an ‘easy’ implementation, but no, there are multiple custom classes involved with 50 rows of code each.

I could really read some docs, but as far as tabular goes, there was very little info unless I was looking at the wrong place.

Sorry for a wall of text, but I’ve been fighting these issues for a few days now and decided it was time to ask for help. Any, will be much appreciated.

Thanks.

olegperegudov · January 3, 2021, 4:35pm

I managed to fix most of the problems by experimenting mostly.
But still, maybe someone could give a brief comment on 6 (slightly modified)?

I really like the idea of embedding cat features. All these ‘king to queen is like boy to girl’ example looks like magic, and I like magic. So I started to read and watch more on the topic. But back to the question:
Basically, we can use fastai’s TabularPanda to convert cat features to EE. Am I right to assume that fast ai will take cat features, EE them with simple NN, and rearrange the df to reflect that? When exactly does the learning of embeddings happen?
If I try to fetch to.xs after TabularPanda it looks like the procs were applied to the data but the number of cols is the same. There is also no place (at least in the tutorial) to indicate the length of the output embedded vector. Is some heuristic used by default? Because the number of columns before and after TabularPanda was the same I assumed the cat features have not yet been embedded (only assigned the number values).

I would really like to use fastai for embedding cat values (among other things) and feeding the new df to different models.

olegperegudov · January 5, 2021, 8:00pm

The problem was in a different number of classes for some of the cat features in train and test. It was a pain to figure out and find a workaround.

I have written a rather long post on SO for any future readers:

fredguth · May 11, 2023, 10:52pm

Isn’t that a bug??? I have the same problem. My train dataloader does not have all possible examples for all possible labels, so it is categorizing differently. This seems an error for me.