Fastai v2 text

Another approach suggested for NER without CRF layer is as below

They are using an AWD LSTM LM + a Knowledge Base.

1 Like

Edit: I figured out how to deal with this. We should increase the cache parameter when creating the dataloader. Now the GPU does not go idle during training. Soi I will leave this post here so others can see it if they face the same issue. Is there any rule of thumb for defining the optimal cache size?

Original post:

Hi @sgugger. Is there an option to preprocess the texts and not use lazy loading using Datasets and Dataloaders?

For a huge dataset of small documents the lazy loading approach is perfect because the whole Dataset can be processed without the need to fit in memory. However, a Dataset of big documents that fit in memory end up taking much more time each epoch during training.

I am trying to train a language model using a Dataset of big text documents. In fastai v1, this aproach takes 1h 50m each epoch with 95% utilization of GPU:

data_src = (TextList.from_df(df, processor=[SPProcessor.load(save_data,'spm15kpt','spm15kpt')])
        .split_by_rand_pct(0.1, seed=42)
        .label_for_lm())

data = data_src.databunch(bs=bs, num_workers=1, bptt=72)

learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=1., pretrained=False, 
                               pretrained_fnames = [weights, vocab_lm],
                               metrics=[error_rate, accuracy, perplexity]).to_fp16()

In fastai2, the same approach takes around 10 hours each epoch with the GPU usage varying between 0 and 70% most of the time:

Fastai2:

 SentencePiece=SentencePieceTokenizer(sp_model='spm15kpt/spm15kpt.model')

 tfms = [attrgetter("lm_text"), Tokenizer(tokenizer=SentencePiece), Numericalize(vocab)]
 splits = RandomSplitter(valid_pct=0.1, seed=42)(tp)

 dsrc = Datasets(tp, [tfms], splits=splits, dl_type=LMDataLoader)

 dbunch_lm = dsrc.dataloaders(bs=bs, seq_len=sl, num_workers=8, pin_memory=True, shuffle_train=False)

 learn_lm = language_model_learner(dbunch_lm, 
                            AWD_LSTM, 
                            config, 
                            pretrained=False, 
                            drop_mult=1., 
                            pretrained_fnames = [weights, vocab_lm],
                            metrics=[error_rate, accuracy, Perplexity()],
                            path=path
                           ).to_fp16()

ps.: With shuffle_train=True and less or no workers, it is even slower.

I suspect the Dataloader is taking too much time to process each text in CPU before sending a batch to the GPU. Is there an option to force the early processing of the whole Dataset? Or can I do something to speed up the processing of the batch?

Edit: It indeed seems to be related with document size and the lazy loading approach. To experiment, I capped my text to 128 raw tokens (not yet tokenized) and now the GPU use is steady at 85% with no long periods in 0%. The lazy loading approach is very useful. I have been using it for a huge (100MM records) dataset of small texts and it works like a charm without OOM issues. But for medium or small size Datasets of big texts (> 256 tokens) it starts to lose performance while training because it underutilizes the GPU.

tp['capped_text'] = tp['lm_text'].apply(lambda x: " ".join([i for i in x.split()][:128]))

1 Like

@much_learner did you manage to get this working? From your posts Iā€™m guessing youā€™re doing the Kaggle Google Quest comp? I hit some of the same problems as you, getting the error about the targets now. I need to look a little closer at why the y is returning a tupleā€¦

No, I havenā€™t. Yes, thatā€™s Quest comp.

I think thereā€™s no ready API for multi labelled regression in text yet, so we need to add our own Block class, also show method.

If you guys want some help on how to get started with building something custom let me know :slight_smile: or run into trouble

1 Like

Thanks! yeah Iā€™m starting to bang my head against the wall here :smiley:

The objective is given a piece of text predict 30 different scores for that text. Essentially how people rated it on different traits.

I used RegressionBlock(c_out=len(label_cols) as it gave the correct number of outputs (30) in the final layer of the model. Using TransformBlock gave a final layer output of 21000ā€¦

Below is the text block I built, you can ignore mmgTokenizer, I had to tweak something (wrap 1 thing in str() ) in Tokenizer in order to get it work with the SentencePiece tokenizer.

class mmgTextBlock(TransformBlock):
    def __init__(self, tok_tfm, vocab=None, is_lm=False, seq_len=72):
        return super().__init__(type_tfms=[tok_tfm, Numericalize(vocab)],
                                dl_type=LMDataLoader if is_lm else SortedDL,
                                dls_kwargs={} if is_lm else {'before_batch': partial(pad_input_chunk, seq_len=seq_len)})
    @classmethod
    @delegates(Tokenizer.from_df, keep=True)
    def from_df(cls, text_cols, vocab=None, is_lm=False, seq_len=72, **kwargs):
        return cls(mmgTokenizer.from_df(text_cols,       # <--- Changed here, added mmgTokenizer so can use mmgSentencePieceTokenizer
                                        tok_func=mmgSentencePieceTokenizer,
                                        model_type='bpe', 
                                        special_toks=all_special_toks,
                                        **kwargs), 
                   vocab=vocab, is_lm=is_lm, seq_len=seq_len)  


dbch = DataBlock(blocks=(mmgTextBlock.from_df(vocab=lm_vocab, 
                                           text_cols="doc"), 
                          RegressionBlock(c_out=len(label_cols))), #TransformBlock ), 
                      get_x=ColReader('text'), 
                      get_y=ColReader(list(label_cols)),   
                      splitter=RandomSplitter(0.2, seed=42),
                      dl_type=SortedDL).dataloaders(quest_trn, verbose=True)

And the learner:

learn=text_classifier_learner(dbch, AWD_LSTM, loss_func=torch.nn.BCEWithLogitsLoss(),
                              cbs=[SaveModelCallback()],
                              opt_func=opt_func, cb_funcs=cb_funcs,
                              seq_len=72, config=awd_lstm_clas_config, pretrained=True, 
                              drop_mult=0.5, n_out=None,
                              lin_ftrs=None, ps=None, max_len=72*20)

Any help would be amazing, right now the above gives me:

Target size (torch.Size([64, 30])) must be the same as input size (torch.Size([30]))

So I donā€™t think the dataloader is assembling the data correctly

First thing Iā€™d look at are raw batches that come out post your transforms. For a hint of how to do this, look at the code for dblock.summary() and what itā€™s doing for transforming everything. This way you can also see what your data looks like at every step :slight_smile: (Iā€™ve also talked about how to do this on the forums if you look for that :wink: )

Edit: @morgan just realized it was actually in a conversation. Hereā€™s the bit to look at:

 for f in dls.train.after_item:
  name = f.name
  x = f(x)
  print(x[1])

So that just returns ToTensor (I commented out the x)

for i,v in enumerate(dbch.train):
print(i)
print(v[0].size())
print(v[1][0].size())
print(v[1][1].size())

OUPUT:

0
torch.Size([64, 2049])
torch.Size([30])
torch.Size([30])
1
torch.Size([64, 3783])
torch.Size([30])
torch.Size([30])

Printing the size of the batch outputs, it looks like its only serving up 1 y tensor value instead of 64ā€¦

Sorry I meant to show this entire thing:

for i in range(len(dsets.train)):
  x = dsets.train[i]
  for f in dls.train.after_item:
    name = f.name
    x = f(x)
    print(name, x)

The goal here is to print each transform and what itā€™s respective output is (I also recommend just doing the first in dsets.train) (you can then do size, etc too, x[0] is your input, x[1] is your y)

1 Like

Gotcha, thanks. Each each item looks ok, must be something with how the batch is constructedā€¦will keep looking

ToTensor
torch.Size([754])
torch.Size([30])

ToTensor
torch.Size([2120])
torch.Size([30])

ToTensor
torch.Size([883])
torch.Size([30])
1 Like

EDIT: was using RNNRegularizer callback twice by mistake, it comes in-built with text_classifier_learner, but then I added it to my learner as another callbackā€¦donā€™t do this :smiley:


PROGRESS!

Does anyone know why the ā€œafter_predā€ callback in the one_batch function below in learner.py might be called twice?

I think its the reason why Iā€™m only getting a single set of predictions being passed to the loss function instead of predictions for the entire batch, resulting in the "Target size (torch.Size([32, 29])) must be the same as input size (torch.Size([29]))" error.

def one_batch(self, i, b):
        self.iter = i
        try:
            self._split(b);                                  self('begin_batch')
            self.pred = self.model(*self.xb);   HERE ----> self('after_pred')<--- Gets called twice
            if len(self.yb) == 0: return
            self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')

RNNRegularizer
This is the where the after_pred callback is called, you can see it modifies self.pred to only take the value at the first index, so far so good, this will return our 32 predictions for each of the 29 classes, giving self.learn.pred a shape of torch.Size([32, 29])

class RNNRegularizer(Callback):
    "`Callback` that adds AR and TAR regularization in RNN training"
    def __init__(self, alpha=0., beta=0.): self.alpha,self.beta = alpha,beta

    def after_pred(self):
        self.raw_out,self.out = self.pred[1],self.pred[2]
        self.learn.pred = self.pred[0]

However this then gets called a second time on the new predictions, which reduce self.pred to torch.Size([29])

It would mean the callback is present twice. You can check learn.cbs to be sure.

Ah yes thats it thank you! I had added it as an additional callback when I copied the plain Learner from the wikitext tutorial, but looks like text_classifier_learner already has it built in :man_facepalming:

Man I learned far more about the guts of fastai2 this week than I had planned :smiley:

1 Like

Note that you have a method to see what happens during the training loop to hlp you debug those kinds of situation, itā€™s called learn.show_training_loop()

5 Likes

That is excellent to know thanks! So much goodness being built, I feel this must be yer motto:

3 Likes

This one is a user contribution actually, so not from Jeremy and I :wink:

1 Like

EDIT : SOLVED

Sorry I missed a commit from today, I was missing ModelReseter from my cbs. For reference this is my code that works now for training a language model:

lm_model = get_language_model(AWD_LSTM, vocab_sz=len(lm_vocab), config=awd_lstm_lm_config)

lm_opt_func = partial(Adam, wd=0.1, eps=1e-7)

lm_cbs = [MixedPrecision(clip=0.1), ModelReseter, RNNRegularizer(alpha=2, beta=1)]

lm_learn = Learner(dls=lmdls, model=lm_model, loss_func=CrossEntropyLossFlat(), 
                opt_func=lm_opt_func, cbs=lm_cbs,
                metrics=[accuracy, Perplexity()])

Looks like a recent change might have changed something in the language model setup? I trained a LM a few days ago with the same code, but now when I run lr_find() I get the error below:

Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu

I checked and it appears that my batches from the dataloader are on the gpu, so I think the RNN weights donā€™t get moved to the gpu as expected. This is how I have defined the model:

lm_model = get_language_model(AWD_LSTM, vocab_sz=len(lm_vocab), config=awd_lstm_lm_config)

lm_opt_func = partial(Adam, wd=0.1, eps=1e-7)
lm_cbs = [SaveModelCallback(), RNNRegularizer(alpha=2, beta=1)]

lm_learn = Learner(dls=lmdls, model=lm_model, loss_func=CrossEntropyLossFlat(), 
                opt_func=lm_opt_func, cbs=lm_cbs,
                metrics=[accuracy, Perplexity()]).to_fp16()

And this is the stacktrace:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-39-800e454c6490> in <module>
----> 1 lm_learn.lr_find()

~/fastai2/fastai2/callback/schedule.py in lr_find(self, start_lr, end_lr, num_it, stop_div, show_plot, suggestions)
    196     n_epoch = num_it//len(self.dls.train) + 1
    197     cb=LRFinder(start_lr=start_lr, end_lr=end_lr, num_it=num_it, stop_div=stop_div)
--> 198     with self.no_logging(): self.fit(n_epoch, cbs=cb)
    199     if show_plot: self.recorder.plot_lr_find()
    200     if suggestions:

~/fastai2/fastai2/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    292                     try:
    293                         self.epoch=epoch;          self('begin_epoch')
--> 294                         self._do_epoch_train()
    295                         self._do_epoch_validate()
    296                     except CancelEpochException:   self('after_cancel_epoch')

~/fastai2/fastai2/learner.py in _do_epoch_train(self)
    267         try:
    268             self.dl = self.dls.train;                        self('begin_train')
--> 269             self.all_batches()
    270         except CancelTrainException:                         self('after_cancel_train')
    271         finally:                                             self('after_train')

~/fastai2/fastai2/learner.py in all_batches(self)
    245     def all_batches(self):
    246         self.n_iter = len(self.dl)
--> 247         for o in enumerate(self.dl): self.one_batch(*o)
    248 
    249     def one_batch(self, i, b):

~/fastai2/fastai2/learner.py in one_batch(self, i, b)
    251         try:
    252             self._split(b);                                  self('begin_batch')
--> 253             self.pred = self.model(*self.xb);                self('after_pred')
    254             if len(self.yb) == 0: return
    255             self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')

~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

~/fastai2/fastai2/text/models/awdlstm.py in forward(self, inp, from_embeds)
    100         new_hidden,raw_outputs,outputs = [],[],[]
    101         for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):
--> 102             raw_output, new_h = rnn(raw_output, self.hidden[l])
    103             new_hidden.append(new_h)
    104             raw_outputs.append(raw_output)

~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

~/fastai2/fastai2/text/models/awdlstm.py in forward(self, *args)
     49             #To avoid the warning that comes because the weights aren't flattened.
     50             warnings.simplefilter("ignore")
---> 51             return self.module.forward(*args)
     52 
     53     def reset(self):

~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/rnn.py in forward(self, input, hx)
    562             return self.forward_packed(input, hx)
    563         else:
--> 564             return self.forward_tensor(input, hx)
    565 
    566 

~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/rnn.py in forward_tensor(self, input, hx)
    541         unsorted_indices = None
    542 
--> 543         output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
    544 
    545         return output, self.permute_hidden(hidden, unsorted_indices)

~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/rnn.py in forward_impl(self, input, hx, batch_sizes, max_batch_size, sorted_indices)
    524         if batch_sizes is None:
    525             result = _VF.lstm(input, hx, self._get_flat_weights(), self.bias, self.num_layers,
--> 526                               self.dropout, self.training, self.bidirectional, self.batch_first)
    527         else:
    528             result = _VF.lstm(input, batch_sizes, hx, self._get_flat_weights(), self.bias,

RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu

> /home/morgan/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/rnn.py(526)forward_impl()
    524         if batch_sizes is None:
    525             result = _VF.lstm(input, hx, self._get_flat_weights(), self.bias, self.num_layers,
--> 526                               self.dropout, self.training, self.bidirectional, self.batch_first)
    527         else:
    528             result = _VF.lstm(input, batch_sizes, hx, self._get_flat_weights(), self.bias,

Anyone else encountered this?

1 Like

EDIT: yes it is expected :slight_smile:

~ ~ ~ ~ ~ ~ ~

Expected Behaviour? - SortedDL Transform returns reversed order

Iā€™ve been getting terrible results for my first submissions to kaggle (Google QUEST) using fastai2, turns out its because my test submissions are backwards :frowning: Am using SortedDl in my dataloader, just wondering if this is by design? The original order is restored when I set reverse=False in get_idxs in SortedDl

@delegates(TfmdDL)
    class SortedDL(TfmdDL):
       def __init__(self, dataset, sort_func=None, res=None, **kwargs):
           super().__init__(dataset, **kwargs)
           self.sort_func = _default_sort if sort_func is None else sort_func
           if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset)
           self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res
           self.idx_max = np.argmax(self.res)

def get_idxs(self):
     idxs = super().get_idxs()
      if self.shuffle: return idxs
     return sorted(idxs, key=lambda i: self.res[i], **reverse=True**) <- SHOULD IT BE DEFAULT TRUE?

Dataloaders setup here:

bs = 32
sl = 72
tst_splits = [list(range(len(tst_df))),list(range(len(tst_df)))]
x_tfms = [attrgetter('doc'), tok_fn, Numericalize(vocab=lm_vocab)]

tst_cls_dsets = Datasets(tst_df, splits=tst_splits, tfms=[x_tfms], dl_type=SortedDL)

tst_cls_dls = tst_cls_dsets.dataloaders(bs=bs, sl=sl, shuffle_train=False, device='cuda', 
                                    before_batch=pad_input_chunk, dl_type=SortedDL)

The reverse=True means the samples are sorted by length from the bigger to the smaller. This is because we need the biggest batch first for CUDA-optimization of memory.

You will need to unsort them after you get the results of your get preds, yes (as you see the permutation is given by dl.get_idxs(), you just need to reverse it).

2 Likes

Ah understood, thanks!