Difficulties training a large tabular dataset

Hi,

I am currently having issues training a large dataset and have questions about the best approaches.

Each of my datasets are about 11GB in a .feather format.

I read the first data:
df = pd.read_feather(r’/HDD1/new_model_input/df100_0.feather’)

Define my process:
dep_var = ‘WIN_LOSE’
cat_var1 = [c for c in df.columns if df[c].dtypes==object and c!=dep_var]
cat_var2 = [c for c in df.columns if df[c].unique().shape[0]<30 and c!=dep_var]
cat_names = list(set(cat_var1+cat_var2))
cont_names = [c for c in df.columns if c not in cat_names+[dep_var]]
procs = [FillMissing, Categorify, Normalize]

Create my data object:
data = (TabularList.from_df(df, cat_names=cat_names,

  •                        cont_names=cont_names, procs=procs)*
    
  •                       .split_by_idx(list(range(df_len-10000, df_len)))*
    
  •                       .label_from_df(cols=dep_var)*
    
  •                       .databunch(bs=4096))*
    

Create a leaner object and train it:
learn = tabular_learner(data, layers=[200,100,50], metrics=accuracy, wd=0.11)
learn.fit(1, 5e-5)

Export the model (I have tried both save and export):
learn.export(f’/home/Desktop/model_iteration_1/model_0’)
learn.save(f’/home/Desktop/model_iteration_1/model_0’)

At this point, I have tried many ways to free my RAM, unsuccessfully:
del df
learn.purge()
learn.destroy()
del learn
gc.collect()

However, most of the RAM is still utilised at that point. If someone knows how I can free memory at that point, that would be greatly appreciated.

To free the memory, I restart the kernel and load my second dataframe. I create a data object with the new dataframe, then load my existing model:
data = (TabularList.from_df(df, cat_names=cat_names,

  •                            cont_names=cont_names, procs=procs)*
    
  •                           .split_by_idx(list(range(df_len-10000, df_len)))*
    
  •                           .label_from_df(cols=dep_var)*
    
  •                           .databunch(bs=4096))*
    

learn = load_learner(f’/home/model_iteration_1/’,

  •                         file=f'model_{j-1}')*
    

And replace the data attribute with my new data:
learn.data = data

I attempt to fit my existing model with the new dataset:
learn.fit(1, 5e-5)

However, the fit method fails:

RuntimeError Traceback (most recent call last)
in
18 file=f’model_{j-1}’)
19 learn.data = data
—> 20 learn.fit(1, 5e-5)
21 learn.export(f’/home/Desktop/model_iteration_1/model_{j}’)
22 learn.purge()

~/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
197 callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
198 if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
–> 199 fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
200
201 def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
99 for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
100 xb, yb = cb_handler.on_batch_begin(xb, yb)
–> 101 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
102 if cb_handler.on_batch_end(loss): break
103

~/anaconda3/lib/python3.6/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
24 if not is_listy(xb): xb = [xb]
25 if not is_listy(yb): yb = [yb]
—> 26 out = model(*xb)
27 out = cb_handler.on_loss_begin(out)
28

~/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
–> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

~/anaconda3/lib/python3.6/site-packages/fastai/tabular/models.py in forward(self, x_cat, x_cont)
29 def forward(self, x_cat:Tensor, x_cont:Tensor) -> Tensor:
30 if self.n_emb != 0:
—> 31 x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
32 x = torch.cat(x, 1)
33 x = self.emb_drop(x)

~/anaconda3/lib/python3.6/site-packages/fastai/tabular/models.py in (.0)
29 def forward(self, x_cat:Tensor, x_cont:Tensor) -> Tensor:
30 if self.n_emb != 0:
—> 31 x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
32 x = torch.cat(x, 1)
33 x = self.emb_drop(x)

~/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
–> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

~/anaconda3/lib/python3.6/site-packages/torch/nn/modules/sparse.py in forward(self, input)
115 return F.embedding(
116 input, self.weight, self.padding_idx, self.max_norm,
–> 117 self.norm_type, self.scale_grad_by_freq, self.sparse)
118
119 def extra_repr(self):

~/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1504 # remove once script supports set_grad_enabled
1505 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
-> 1506 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1507
1508

RuntimeError: index out of range at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:193

One thing to note, I made sure each of my datasets all have the same unique elements for each categorical features, although the order of these might change.

My questions are:

  1. Is there a better way for my to perform training on a dataset that doesn’t fit my RAM?
  2. Is there a way for me to free the used RAM after I delete my df and learn object?
  3. Does someone know why my second fit attempt fails, despite the fact that the categorical variables from the new dataset are already accounted for in the embedding of the initial model?

Thank you very much for taking the time to read through this, sorry for the length but I thought I should be explicit. Hopefully it helps someone else in the future.

1 Like