Hi.
After finishing 09_tabular I decided to try it all on kaggle’s house prediction dataset.
I did some data cleaning (no nans, fixes some outliers etc, mostly for practice) and assembled a model (after going through a 09_tabular, tutorial on tabular and tabular docs on fastai).
This is the model:
procs = [Categorify, FillMissing, Normalize]
cat_names = x.select_dtypes('category').columns.tolist()
cont_names = x.select_dtypes('number').columns.tolist()
splits = RandomSplitter(valid_pct=0.2)(range_of(x))
to = TabularPandas(x,
procs = procs,
cat_names = cat_names,
cont_names = cont_names,
y_names = 'SalePrice',
splits = splits)
dls = to.dataloaders(bs=32)
learn = tabular_learner(dls, metrics=mse)
Something is off and I’m getting errors here and there when I try to move forward.
- When I try lr_find(), the values are very random. Can be 0.02 one “run all” (just to make sure I didn’t rewrite anything) and 2e-8 another run.
- looks like
fit_one_cycle
is giving me something (losses and mse are decreasing) and I was hopeful at 1st, but then I had more errors. -
learn.show_results()
run immediately after training throws error:
ValueError: Wrong number of items passed 2, placement implies 1
I didn’t pass anything similar to the tutorial here:
https://docs.fast.ai/tutorial.tabular.html - when I tried to predict values for another df I have got “missing value” error. I checked it and I happen to miss a nan, but doesn’t
procs
have to take care of it? - when I try to predict a new data set (prepped the same way as the train set. Same cols number and names of cols match. Nunique and unique values are the same) with:
dl = learn.dls.test_dl(test_df)
learn.get_preds(dl=dl)
I get anIndexError: index out of range in self
I tried a few things from SO and I checked fastai forums
([Solved] Problem with tabular_learner.predict() on a single row)
but couldn’t make it work in my case.
Predicting a single value withlearn.predict(x_test_proxy.loc[0, :])[2]
worked, but it looks strange. It is said in docs that 2nd index is for decoded value but I’m getting values between -1 and 1, while it is hundreds of thousands. Could it be a prob? But why and what’s with the negative value?
5.5. Also you write in tabular tutorial:
To get prediction on a new dataframe, you can use the
test_dlmethod of the [
DataLoaders](https://docs.fast.ai/data.core.html#DataLoaders). That dataframe does not need to have the dependent variable in its column.
It’s ok, but if I don’t have a target col (‘SalePrice’ in my case. Why would I? I need to predict it), it will throw an error that I don’t have a ‘SalePrice’ column. I had to manually create a column with 0 values in test set. Maybe it tries todrop
it and can’t find it? - Am I right to assume that fast ai will take cat features, EE them with simple NN, and use them with other cont features in another NN to predict values?
6.5 Am I right to assume - when predicting new, unseen test values, the model should transform them the way it did for train, predict, and then transform back the predicted values? - At some point I was just going to find pytorch examples on EE (there are a lot of Keras, but I want to study the torch), and while I found a few (I think) the code was very difficult to read and understand for my level. I mean when you think about how EE works it doesn’t sound all that difficult, so there should be an ‘easy’ implementation, but no, there are multiple custom classes involved with 50 rows of code each.
I could really read some docs, but as far as tabular goes, there was very little info unless I was looking at the wrong place.
Sorry for a wall of text, but I’ve been fighting these issues for a few days now and decided it was time to ask for help. Any, will be much appreciated.
Thanks.