Tabular - Issue w/ training after .get_tabular_learner()

Patrick1 · November 2, 2018, 5:42am

Hi everyone,

I’m using fastai v1 to learn about applying deep learning to tabular data. I have tried to overcome my issue with the doco + previous questions but still stuck.

infrastructure: google colab
pytorch: 1.0.0.dev20181029
fastai: 1.0.18

My approach:

//Read the data in from file (CSV, where dep_var I want to predict is continuous between 0 and 1)
play_data = pd.read_csv(path+'train_V2.csv')
tfms = [FillMissing, Categorify]

//Set the dependent variable
dep_var = 'winPlacePerc'

//Set the category variables
cat_names = ['Id', 'groupId', 'matchId', 'matchType']

//Create the data object from the TabularDataBunch class, using a sampling of my data because I was using too much RAM
data = TabularDataBunch.from_df(path, sample_t_df, sample_v_df, dep_var, tfms=tfms, cat_names=cat_names)

//Define the mean absolute error, because this was the metric required by the Kaggle comp
def MAE(pred:Tensor, targ:Tensor) -> Rank0Tensor:
  return abs(targ-pred).mean()
  
//Instantiate my learner
learn = get_tabular_learner(data, layers=[200,100], metrics=MAE)

//A single epoch
learn.fit_one_cycle(1, 1e-2)

My error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-46-3ea49add0339> in <module>()
----> 1 learn.fit_one_cycle(1, 1e-2)

/usr/local/lib/python3.6/dist-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     21                                         pct_start=pct_start, **kwargs))
---> 22     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     23 
     24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    163 
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     87             if hasattr(data,'valid_dl') and data.valid_dl is not None:
     88                 val_loss = validate(model, data.valid_dl, loss_func=loss_func,
---> 89                                        cb_handler=cb_handler, pbar=pbar)
     90             else: val_loss=None
     91             if cb_handler.on_epoch_end(val_loss): break

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
     52             if not is_listy(yb): yb = [yb]
     53             nums.append(yb[0].shape[0])
---> 54             if cb_handler and cb_handler.on_batch_end(val_losses[-1]): break
     55             if n_batch and (len(nums)>=n_batch): break
     56         nums = np.array(nums, dtype=np.float32)

/usr/local/lib/python3.6/dist-packages/fastai/callback.py in on_batch_end(self, loss)
    236         "Handle end of processing one batch with `loss`."
    237         self.state_dict['last_loss'] = loss
--> 238         stop = np.any(self('batch_end', not self.state_dict['train']))
    239         if self.state_dict['train']:
    240             self.state_dict['iteration'] += 1

/usr/local/lib/python3.6/dist-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    184     def __call__(self, cb_name, call_mets=True, **kwargs)->None:
    185         "Call through to all of the `CallbakHandler` functions."
--> 186         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
    187         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    188 

/usr/local/lib/python3.6/dist-packages/fastai/callback.py in <listcomp>(.0)
    184     def __call__(self, cb_name, call_mets=True, **kwargs)->None:
    185         "Call through to all of the `CallbakHandler` functions."
--> 186         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
    187         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    188 

/usr/local/lib/python3.6/dist-packages/fastai/callback.py in on_batch_end(self, last_output, last_target, train, **kwargs)
    268     def on_batch_end(self, last_output, last_target, train, **kwargs):
    269         self.count += last_target.size(0)
--> 270         self.val += last_target.size(0) * self.func(last_output, last_target).detach().item()
    271 
    272     def on_epoch_end(self, **kwargs):

/usr/local/lib/python3.6/dist-packages/fastai/metrics.py in accuracy(input, targs)
     35     "Compute accuracy with `targs` when `input` is bs * n_classes."
     36     n = targs.shape[0]
---> 37     input = input.argmax(dim=1).view(n,-1)
     38     targs = targs.view(n,-1)
     39     return (input==targs).float().mean()

/usr/local/lib/python3.6/dist-packages/torch/tensor.py in argmax(self, dim, keepdim)
    238     def argmax(self, dim=None, keepdim=False):
    239         r"""See :func:`torch.argmax`"""
--> 240         return torch.argmax(self, dim, keepdim)
    241 
    242     def argmin(self, dim=None, keepdim=False):

/usr/local/lib/python3.6/dist-packages/torch/functional.py in argmax(input, dim, keepdim)
    529     if dim is None:
    530         return torch._argmax(input.contiguous().view(-1), dim=0, keepdim=False)
--> 531     return torch._argmax(input, dim, keepdim)
    532 
    533 

RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

What I have already tried:

Using the default metrics=accuracy instead of my custom one.
learn.loss_fn = F.mse_loss (after reading this post, but after reading the source code realised that wasn’t going to be the problem)

Grateful for assistance.

sgugger · November 2, 2018, 1:23pm

Your problem comes form the metric, so maybe try without metric first. Then you’ll probably need to debug your custom one as accuracy doesn’t seem to be working here.

ab_ai · November 2, 2018, 1:37pm

Is your dependent variable continuous or categorical? I’m trying to sort out the same problem in my code. My dep_var is continuous.

Patrick1 · November 2, 2018, 10:17pm

Thanks for the pointers. I’ll have another crack and see how I go.

My dependent variable is continuous. (Or at least that is my intention).

If I examine df.dtypes, I have the type set to float32. I not 100% sure I have converted the categorical columns correctly, could this error stem from my data structure/layout of the data frame?

Please let me know if you work it out in your code and I’ll do the same.

ab_ai · November 5, 2018, 2:17am

My guess is that you are using a metric which is not compatible with your data. For example, metrics like accuracy are made for assessing if the model correctly or incorrectly classifies your data (ergo, a accuracy is good for a classification problem).

I would have thought MAE would work fine if your dependent variable is continuous. In your original post, you mentioned you switched to accuracy so I would recommend not trying that one if your dependent variable is continuous.

Patrick1 · November 5, 2018, 7:07am

Thanks Aaron, I tried again using the built in exp_RMSE metric (since I was worried I had implemented MAE incorrectly). It worked! (or at least appears to, numbers come out).

Reset the environment and tried again with MAE ( same code as my original post, except I used a small sample of the data to speed up the training times). It worked as well!

So I repeated, w/ MAE + Full dataset, about 15% of the way in the output changes to NaN and stays that way.

Now I’m more confused than ever, my code will now run all the way through and produce a number, but only for small datasets. AND I didn’t do anything to fix the last error I was getting (unless it was something as silly as changing metric=accuracy to metric=MAE in the code block but never executing it and the error was from trying to use accuracy (on a continuous variable)

soco_loco · December 11, 2018, 7:16am

Hi Patrick, was looking for examples of others using MAE and saw your post.

For the NaN sudden pop, if it’s not due to normal causes of loss going to NaN (exploding gradients, too high of learning rate) and it seems to correct itself with small datasets then my guess is it’s due to one of your row’s dependent variables being null or erroneous, that causes the metric to breakdown in the middle of training. To correct, consider something like this:

Where you replace depvar with the depvar in your code.

df_train.dropna(axis=0, subset=['depvar'],inplace=True)