Using fastai v1 for tabular data

maxmatical · November 13, 2018, 2:54am

I know this isn’t covered in the course just yet, but I’m trying to use fastai for a project of mine, but I’m running into some issues.

I have followed the steps to create a databunch

tfms = [Categorify]
data = TabularDataBunch.from_df(path, train_df, valid_df, dep_var = 'Price',  tfms=tfms, cat_names=cat_vars, bs = 32)
learn = get_tabular_learner(data, layers = [1000,500], emb_szs = {'Date':50, 'Month':6, 'Month_Year':34},
                           metrics = [exp_rmspe], ps = [0.0, 0.0])

But when I try to create the learn object I get

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-276-58cac02224e8> in <module>()
      1 learn = get_tabular_learner(data, layers = [1000,500], emb_szs = {'Date':50, 'Month':6, 'Month_Year':34},
----> 2                           metrics = [exp_rmspe], ps = [0.0, 0.0])
      3 
      4 # learn = get_tabular_learner(data, layers = [1000,500])

/usr/local/lib/python3.6/dist-packages/fastai/tabular/data.py in get_tabular_learner(data, layers, emb_szs, metrics, ps, emb_drop, y_range, use_bn, **kwargs)
     95     emb_szs = data.get_emb_szs(ifnone(emb_szs, {}))
     96     model = TabularModel(emb_szs, len(data.cont_names), out_sz=data.c, layers=layers, ps=ps, emb_drop=emb_drop,
---> 97                          y_range=y_range, use_bn=use_bn)
     98     return Learner(data, model, metrics=metrics, **kwargs)
     99 

/usr/local/lib/python3.6/dist-packages/fastai/tabular/models.py in __init__(self, emb_szs, n_cont, out_sz, layers, ps, emb_drop, y_range, use_bn)
     20         layers = []
     21         for i,(n_in,n_out,dp,act) in enumerate(zip(sizes[:-1],sizes[1:],[0.]+ps,actns)):
---> 22             layers += bn_drop_lin(n_in, n_out, bn=use_bn and i!=0, p=dp, actn=act)
     23         self.layers = nn.Sequential(*layers)
     24 

/usr/local/lib/python3.6/dist-packages/fastai/layers.py in bn_drop_lin(n_in, n_out, bn, p, actn)
     31     layers = [nn.BatchNorm1d(n_in)] if bn else []
     32     if p != 0: layers.append(nn.Dropout(p))
---> 33     layers.append(nn.Linear(n_in, n_out))
     34     if actn is not None: layers.append(actn)
     35     return layers

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py in __init__(self, in_features, out_features, bias)
     46         self.in_features = in_features
     47         self.out_features = out_features
---> 48         self.weight = Parameter(torch.Tensor(out_features, in_features))
     49         if bias:
     50             self.bias = Parameter(torch.Tensor(out_features))

TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
 * (torch.device device)
 * (torch.Storage storage)
 * (Tensor other)
 * (tuple of ints size, torch.device device)
      didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
 * (object data, torch.device device)
      didn't match because some of the arguments have invalid types: (!NoneType!, !int!)

Am I doing anything wrong, or should I revert back to the v0.7 for the time being?

@jeremy

whamp · November 13, 2018, 4:30am

Just fyi, best practice is to only @ mention Jeremy is he’s the only one who can answer the question.

I’ve never seen that error before but it seems like the size of output features is None and input is int and it expected (int,int). So I would investigate issues with your data potentially. You also could specify an out_sz and y_range in your data and investigate their values and shape.

wyquek · November 13, 2018, 6:15am

just random guess, but maybe use metrics=exp_rmspe rather than metrics = [exp_rmspe], cos I see an example in the docs without quotes in the metrics

learn = get_tabular_learner(data, layers=[200,100], emb_szs={'native-country': 10}, metrics=accuracy)
```

Jan · November 13, 2018, 8:59am

It has something to do with in TabularModel the out_sz (size of the network final output) being set to data.c. But it seems that if the output is continuous (not a classification task) this data.c is not being set if you call TabularDataBunch. I will dig a bit deeper and see what I can find.

Edit: I manage to get training going by adding a c=1 argument to TabularDataBunch.from_df(). Not sure if this is the long term solution, but at least a work around for now.

Borz · November 13, 2018, 6:25pm

Ooo, I had the same problem last night. My dependent variable was a column of floats even though it was binary. Adding the dep-var to cat_names or leaving it to be continuous didn’t change anything: the DataBunch’s task type became Regression and data.c was None.

Changing the dtype of the dataframe’s dep-var column to np.int64 got it treated as classification with data.c equaling 2.

I have an aside question on tabular: has anyone seen test-set accuracy drop off a cliff when removing the dependent variable from a DataBunch’s test set? Like 99% → 69%.

That’s to say: having your test dataframe include the labels column vs holding that column out as an array.

update:

Kind-of a duhh moment: looks like the indices of the 0,1 classes are just encoded in (I’m guessing) descending alpha or numeric order.

In other words:

learn.data.train_ds.class2idx gives:
{1: 0, 0: 1}.

So a ‘1’ becomes the 0-index, and ‘0’ becomes 1.

update 2:

nevermind. despite this being the case, if the dependent variable’s column is present in the test set, accuracy is great; when it isn’t: accuracy drops.

This also happens if a learner is trained with a TabularDataBunch that doesn’t contain a test set. If you set a new .data with a test dataframe containing the dep-var and run predictions – then do the same thing without the dep-var: the same thing happens.

I think what’s going on is I allowed my dependent variable to be in the list of cat_names…

I just tested this now, making sure the dep-var isn’t in cat_names and… moderate accuracy. Wow, the model was literally learning to peak at the back of the book.

maxmatical · November 14, 2018, 12:55pm

This method worked yesterday, but as of right now it no longer works, setting c = 1 or class = 1 still gives an output of length(y_train)

jeremy · November 17, 2018, 8:32pm

Can you show your code of an example that doesn’t work with c=1? That should work OK.

maxmatical · November 19, 2018, 12:02am

When I tried the method (Nov 13) the code was as follows

data = TabularDataBunch.from_df(path, train_df, valid_df, dep_var = 'Log_Price',  tfms=tfms, cat_names=cat_vars, bs = 32, c = 1)

And when I tried to rerun the code on Nov 14, it stopped working, I also tried doing

data = TabularDataBunch.from_df(path, concat_clean, dep_var = 'Log_Price', valid_idx=val_idx,  procs=tfms, cat_names=cat_vars, cont_names = cont_vars, bs = 32, classes = 1 )

Although I do realize the fastai library underwent several changes during that time, it might have been a version difference

zachcaceres · November 19, 2018, 12:22am

@maxmatical I am able to run a tabular regression model on fastai 1.0.25.dev0 with the following code:

# df declared with pd.read_csv()

test = TabularList.from_df(df.iloc[:-2000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)

valid_idx = range(len(df)-10000, len(df))

data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(valid_idx)
                           .label_from_df(cols=dep_var, label_cls=FloatList)
                           .add_test(test, label=0)
                           .databunch())

learn = get_tabular_learner(data, layers=[100,100], metrics=exp_rmspe)
learn.fit_one_cycle(8)

jeremy · November 19, 2018, 12:29am

You might also want to cast your dependent variable to a float in the dataframe, if it isn’t already. fastai will assume that if you have a single float dependent, you must want a regression model.

ademyanchuk · December 3, 2018, 3:09pm

Hello. Did someone try fastai tabular on small data (like 5000 samples in train set). What was your experience? Do you recommend to try it for such a data?