TextList: CUDA out of memory

sinsji · May 3, 2019, 2:10pm

On my text classification exercises I can’t get the Data Block approach working.

Using the following code I get an ‘out of memory’ error when using learn.fit_one_cycle():

data = (TextList.from_df(df, PATH, cols='comment_text')
           .use_partial_data(0.2)
           .split_by_rand_pct(0.2)
           .label_from_df(cols=1)
           .databunch(bs = 8))

The data is from the Kaggle Jigsaw competition. The dataframe contains 1.804.874 rows of which I use 20%. In another text application I encountered the same error with Data Block, and at that time TextLMDataBunch worked fine. Does it have to do with the Data Block approach for text applications? How can I solve this?

The TextLMDataBunch doesn’t work either in this case. When loading data with from_df(), it asks for the validation set which is encoded in one of the columns:

Ultimately I’d like to use the Data Block approach because it allows for better customization. For now any advice is welcome!

andreasl · May 4, 2019, 4:11am

You will normally get the out of memory error when the data in the batch is too big to fit in the memory of your GPU. Try reducing your batch size (bs) to 4 or even 2:

data = (TextList.from_df(df, PATH, cols='comment_text')
           .use_partial_data(0.2)
           .split_by_rand_pct(0.2)
           .label_from_df(cols=1)
           .databunch(bs = 4))

sinsji · May 4, 2019, 8:15am

Thanks. Still I’m not sure this is the solution > I tried this already.
Why would the Data Block approach give an CUDA out of memory, while the preset approach with TextLMDataBunch does work?

shawn · May 5, 2019, 5:47pm

Just a guess at what might be happening: after a CUDA out of memory error, the memory tends to stay allocated. So even if you lower your batch size and try again, you’ll get the error (because the memory wasn’t freed up first).

To see if this is the problem, after getting this error, execute a cell that intentionally triggers an exception: 1/0, for example. In recent versions of the Fastai library, this will free up the GPU memory. Or, you can restart your kernel in Jupyter, although it’s more of a nuisance that way.

Good luck!

sinsji · May 6, 2019, 7:21pm

Thanks again. Sorry, tried this as well. Should have made this clear in the opening post.

One line that might cause an issue is the following:

data.save('data_lm.pkl')
data = load_data(PATH, 'data_lm.pkl')

If I ignore the load_data part the memory issue disappears. Instead I get the following error when using .fit_one_cycle:

ValueError: Expected input batch_size (820) to match target batch_size (2).

So one error down, another one to go.

andreasl · May 10, 2019, 1:14pm

It might be possible that loading the data again creates a duplicate of the data variable in memory, or that some other temporary variable gets created in the load_data function?

It might be worth trying to manually calling GC after load_data like this:

gc.collect()

That should print some number other than 0 if it was able to clear up some memory.

As for the batch size error, have you tried setting the batch_size when loading the data?

data = load_data(PATH, 'data_lm.pkl', bs=2)

jolackner · May 10, 2019, 1:30pm

Hi @sinsji - I think your actual label/target is in “cols=0”.
If I read your code right, your label_from_df points to the “comment_text” column right now (which is in cols = 1) , but you want it to point to the “False/True” target (which is in cols=0).

sinsji · May 17, 2019, 1:58pm

Thanks for all the help. Still not working though.

Here is the data:

Here is the code:

bs = 4

data = (TextList.from_df(df, PATH, cols='comment_text')
           .use_partial_data(0.2)
           .split_by_rand_pct(0.2)
           .label_from_df(cols=0)
           .databunch(bs = bs))

data.save('data_lm.pkl')

data = load_data(PATH, 'data_lm.pkl', bs = 4)

gc.collect()

learn = language_model_learner(data, AWD_LSTM, drop_mult=0.3)

learn.fit_one_cycle(2, 1e-2, moms=(0.8,0.7))

Here is the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-92ddd2cbbe29> in <module>
----> 1 learn.fit_one_cycle(2, 1e-2, moms=(0.8,0.7))

/opt/anaconda3/lib/python3.7/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch)
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start,
     21                                        final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch))
---> 22     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     23 
     24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, wd:float=None):

/opt/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    197         callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
    198         if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
--> 199         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    200 
    201     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/opt/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
     99             for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
    100                 xb, yb = cb_handler.on_batch_begin(xb, yb)
--> 101                 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
    102                 if cb_handler.on_batch_end(loss): break
    103 

/opt/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     28 
     29     if not loss_func: return to_detach(out), yb[0].detach()
---> 30     loss = loss_func(out, *yb)
     31 
     32     if opt is not None:

/opt/anaconda3/lib/python3.7/site-packages/fastai/layers.py in __call__(self, input, target, **kwargs)
    265         if self.floatify: target = target.float()
    266         input = input.view(-1,input.shape[-1]) if self.is_2d else input.view(-1)
--> 267         return self.func.__call__(input, target.view(-1), **kwargs)
    268 
    269 def CrossEntropyFlat(*args, axis:int=-1, **kwargs):

/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
    902     def forward(self, input, target):
    903         return F.cross_entropy(input, target, weight=self.weight,
--> 904                                ignore_index=self.ignore_index, reduction=self.reduction)
    905 
    906 

/opt/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   1968     if size_average is not None or reduce is not None:
   1969         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 1970     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   1971 
   1972 

/opt/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   1786     if input.size(0) != target.size(0):
   1787         raise ValueError('Expected input batch_size ({}) to match target batch_size ({}).'
-> 1788                          .format(input.size(0), target.size(0)))
   1789     if dim == 2:
   1790         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

ValueError: Expected input batch_size (1640) to match target batch_size (4).

I’m going to try the same without the data block. Just the quick fix like in the documentation examples.

andreasl · May 19, 2019, 7:04pm

When labeling the data by the first column in the dataframe (target, which seems to either be True or False), you end up building a databunch for classification, which cannot be passed to the language_model_learner, since it expects a TextLMDataBunch.

Using .label_for_lm() instead should work, and you should not need to neither reload the data or call garbadge collection. The following code should work:

bs = 4

data = (TextList.from_df(df, PATH, cols='comment_text')
           .use_partial_data(0.2)
           .split_by_rand_pct(0.2)
           .label_for_lm()
           .databunch(bs = bs))

data.save('data_lm.pkl')

learn = language_model_learner(data, AWD_LSTM, drop_mult=0.3)

learn.fit_one_cycle(2, 1e-2, moms=(0.8,0.7))