Fastai v2 tabular

muellerzr · October 23, 2019, 7:06pm

Yes. I just noticed that and adjusted

muellerzr · October 23, 2019, 7:39pm

@sgugger would it be possible to maintain types when generating our databunch? This came up in a discussion on my kaggle kernel here but essentially we’re noticing memory errors (using too much) because some types we want to keep as int8’s and int16’s instead of int64 (for cat) and float64(for cont). I noticed this in the source code too. Is there any plans to adjust this? As this is a big memory user.

A comparison was made with memory usage:

Before: 330128 [pandas dataframe]
After: 1550000 [TabularPandas]

Let me know your thoughts

Edit: Checked the _40 nb and seems you got rid of that hard-code dtype if I’m not mistaken? (and so it maintains type)

hhaider5 · October 23, 2019, 9:35pm

“NameError: name ‘TabularPandas’ is not defined”

How to deal with this?

hhaider5 · October 23, 2019, 9:47pm

Does this imply that we can insert “test dataset” to get predictions, like in a competition?
If yes, I shall be obliged if you do this on any predicting- competition.
It would be immensely helpful for me.

muellerzr · October 23, 2019, 9:48pm

If it’s labeled you can use the labels. Else it operates like the normal test set did back in 1.0 (where there were no labels like Kaggle competitions)

For the import, what libraries are you importing before your call to TabularPandas?

hhaider5 · October 23, 2019, 9:51pm

I went through your Kaggle kernal i.e. https://www.kaggle.com/muellerzr/fastai-v2-starter-code , & cloned the github repo, imported the modules.
I think TabularPandas now gets recognized.

hhaider5 · October 23, 2019, 9:54pm

[https://colab.research.google.com/drive/1ZVz9fg6g0lTzeqG-lSJDBdTwaOy3DiWU](http://This is where I’m stuck)

I’m going through a Kaggle competition, but I’m stuck here.

muellerzr · October 23, 2019, 9:57pm

In the beginning it looks like you’re still using fastai 1.0 not 2.0. (You’re using TabularList).

hhaider5 · October 23, 2019, 10:00pm

How to fix it?
I shall be obliged if you edit it.

muellerzr · October 23, 2019, 10:01pm

Look at my notebook on Kaggle or the adults notebook (notebook 40-41) on the fastai dev repo to see how the new API is done.

hhaider5 · October 23, 2019, 10:04pm

Roger that.
Your kernal is uptill training.
It shall be immensely helpful if you give a little example of predicting on a test-set by furthuring your kernal.

muellerzr · October 23, 2019, 10:15pm

See the post I made earlier today on test sets A Brief Guide to Test Sets in v2 (you can do labelled now too!)

hhaider5 · October 23, 2019, 10:16pm

Roger that

hhaider5 · October 23, 2019, 10:20pm

Here, “learn” implies the model which has been trained & test-set won’t contain target-label column. Right?
One more Q.: Shall I have to indicate cat_names & cont_names again, even though I have indicated them for training i.e. “to” data-bunch?

hhaider5 · October 23, 2019, 10:31pm

It says “AttributeError: ‘Learner’ object has no attribute ‘fit_one_cycle’”

AttributeError: ‘Learner’ object has no attribute ‘fit_one_cycle’

muellerzr · October 29, 2019, 2:13pm

@sgugger how do we use IndexSplitter? I’m trying to walk through Rossmann at the moment. I’m attempting:

splits = IndexSplitter(valid_idx)

But in creating the TabularPandas it will throw an error saying function object is not iterable. So then I tried IndexSplitter(valid_idx)(valid_idx) but that also did not work. Advice?

sgugger · October 29, 2019, 2:17pm

A splitter always takes the items (or something of the same length), so you have to pass

splits = IndexSplitter(valid_idx)(items)

If your items are in a datafame, you can also just pass a range of the same size

splits = IndexSplitter(valid_idx)(range_of(df))

muellerzr · October 29, 2019, 2:22pm

Thanks! @sgugger That makes sense. I get a value error now:

ValueError: operands could not be broadcast together with shapes (0,) (802943,)

Or I guess the better question is: should I be getting my valid_idx a different way than

cut = train_df['Date'][(train_df['Date'] == train_df['Date'][len(test_df)])].index.max()
valid_idx = range(cut)

sgugger · October 29, 2019, 2:27pm

I can’t help without seeing the full stack trace.

muellerzr · October 29, 2019, 2:27pm

Sure, sorry!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-1ed41788e5c8> in <module>()
      1 to = TabularPandas(train_df, procs=procs, cat_names=cat_vars, cont_names=cont_vars,
----> 2                    y_names=dep_var, is_y_cat=False, splits=splits)

/usr/local/lib/python3.6/dist-packages/fastai2/tabular/core.py in __init__(self, df, procs, cat_names, cont_names, y_names, is_y_cat, splits, do_setup)
     31     def __init__(self, df, procs=None, cat_names=None, cont_names=None, y_names=None, is_y_cat=True, splits=None, do_setup=True):
     32         if splits is None: splits=[range_of(df)]
---> 33         df = df.iloc[sum(splits, [])].copy()
     34         super().__init__(df)
     35 

ValueError: operands could not be broadcast together with shapes (0,) (802943,)

The notebook is here