Tabular regression issue

Hi, I’m facing odd problems while testing tabular learner on the House Prices dataset in Kaggle.

The dataset is a mix of continuous and categorical columns and the target variable is continuous. The first problem occurs in the metrics of the learner and it crashes after the first epoch. If I remove all metrics it doesn’t crash but the results don’t make sense. Here’s what I’m working with:

# Making sure all categorical columns are stings
# and all continuous columns are floats
for col in cats:
    df[col] = df[col].astype('str')

for col in conts:
    df[col] = df[col].astype('float32')

# Making the standard databunch
dep_var = 'SalePrice'
procs = [FillMissing, Categorify, Normalize]

data = (TabularList.from_df(df_train[:-100], path=path, cat_names=cats, cont_names=conts, procs=procs)
                           .split_by_rand_pct()
                           .label_from_df(cols=dep_var, label_cls=FloatList)
                           .databunch())

data.show_batch(rows=10)
# Everything looks good. Categorical columns are strings
# and continuous ones have been converted

Here I get the first error after training for one epoch.

learn = tabular_learner(data, layers=[200,100], metrics=RMSE)
learn.fit_one_cycle(3, 1e-3)

This is the error message (click to view the rest of the message):

Summary
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-261-33f5129af380> in <module>
----> 1 learn.fit_one_cycle(3, 1e-3)

/opt/conda/lib/python3.6/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch)
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start,
     21                                        final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch))
---> 22     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     23 
     24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, wd:float=None):

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    198         callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
    199         if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
--> 200         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    201 
    202     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
    104             if not cb_handler.skip_validate and not learn.data.empty_val:
    105                 val_loss = validate(learn.model, learn.data.valid_dl, loss_func=learn.loss_func,
--> 106                                        cb_handler=cb_handler, pbar=pbar)
    107             else: val_loss=None
    108             if cb_handler.on_epoch_end(val_loss): break

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
     61             if not is_listy(yb): yb = [yb]
     62             nums.append(first_el(yb).shape[0])
---> 63             if cb_handler and cb_handler.on_batch_end(val_losses[-1]): break
     64             if n_batch and (len(nums)>=n_batch): break
     65         nums = np.array(nums, dtype=np.float32)

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in on_batch_end(self, loss)
    306         "Handle end of processing one batch with `loss`."
    307         self.state_dict['last_loss'] = loss
--> 308         self('batch_end', call_mets = not self.state_dict['train'])
    309         if self.state_dict['train']:
    310             self.state_dict['iteration'] += 1

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    248         "Call through to all of the `CallbakHandler` functions."
    249         if call_mets:
--> 250             for met in self.metrics: self._call_and_update(met, cb_name, **kwargs)
    251         for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
    252 

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in _call_and_update(self, cb, cb_name, **kwargs)
    239     def _call_and_update(self, cb, cb_name, **kwargs)->None:
    240         "Call `cb_name` on `cb` and update the inner state."
--> 241         new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
    242         for k,v in new.items():
    243             if k not in self.state_dict:

/opt/conda/lib/python3.6/site-packages/fastai/callback.py in on_batch_end(self, last_output, last_target, **kwargs)
    342         if not is_listy(last_target): last_target=[last_target]
    343         self.count += first_el(last_target).size(0)
--> 344         val = self.func(last_output, *last_target)
    345         if self.world:
    346             val = val.clone()

TypeError: object() takes no parameters

So if I just remove the metrics from the learner, it doesn’t crash after an epoch but the results are bogus.

learn = tabular_learner(data, layers=[200,100])
learn.fit_one_cycle(3, 1e-3)
epoch train_loss valid_loss time
0 38677737472.000000 41577746432.000000 00:01
1 38573531136.000000 41577512960.000000 00:00
2 38618013696.000000 41577345024.000000 00:00
# Add test data and predict
learn.data.add_test(TabularList.from_df(df_train[-100:], path=path, cat_names=cats, cont_names=conts, procs=procs))
pred, y = learn.get_preds(ds_type=DatasetType.Test)
print(y)

[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, …]

Any idea where I’m screwing up?

For the results, you want to Do something like the following to preds. Not y. Y are labels and the test set is never labeled, they’re set to zero.
This is from the Rossmann notebook:

test_preds=learn.get_preds(DatasetType.Test)
test_df[“Sales”]=np.exp(test_preds[0].data).numpy().T[0]

On the metrics not 100% sure there. I don’t have experience with this dataset yet.

1 Like

Ah thanks. The results make a tad more sense but are still way off. I just now realized this dataset only has 1460 observations. Perhaps the amount is too low for a neural network to produce any meaningful results.

I don’t believe so, you could always over sample the training set . I’m not sure others experience on it though. I’ll try to play around with it today or tommorow and see if I can come up close with that.

1 Like

Very helpful, thanks a lot. I couldn’t find any similar fast.ai kernels in the public kernel list of Kaggle for this competition.

Ah really now? Well that’s an easy fix :wink: I’ll update when that’s available.

And no problem! :slight_smile:

1 Like

@eljas1 here is my very basic kernal. I haven’t done any data manipulation, also there are a few variables I did not use due to they are NaN in the test set (need to figure out a solution to this later, and improve on it). My public RMSE was 0.19505

Hope it helps :slight_smile:

1 Like

Thank you! After comparing your kernel to mine, I shamelessly copied a couple of bits:

Added “log=True” at the end of

.label_from_df(cols=dep_var, label_cls=FloatList, log=True)

Defined the y_range for the learner

max_log_y = np.log(np.max(df[dep_var])*1.2)
y_range = torch.tensor([0, max_log_y], device=defaults.device)

And it started working! I’m not fully clear on what these things do exactly. Do they instruct the databunch and learner to work with logarithmic data, and y_range is a logarithmic representation of the minimum and maximum values the target variable can have?

Correct. I’d watch lesson 6 again. Jeremy goes over this linear regression based problem and what those do :slight_smile:

1 Like

Cheers! Will do a re-watch round for on the Part 1 lessons.