Another approach suggested for NER without CRF layer is as below
They are using an AWD LSTM LM + a Knowledge Base.
Another approach suggested for NER without CRF layer is as below
They are using an AWD LSTM LM + a Knowledge Base.
Edit: I figured out how to deal with this. We should increase the cache parameter when creating the dataloader. Now the GPU does not go idle during training. Soi I will leave this post here so others can see it if they face the same issue. Is there any rule of thumb for defining the optimal cache size?
Original post:
Hi @sgugger. Is there an option to preprocess the texts and not use lazy loading using Datasets and Dataloaders?
For a huge dataset of small documents the lazy loading approach is perfect because the whole Dataset can be processed without the need to fit in memory. However, a Dataset of big documents that fit in memory end up taking much more time each epoch during training.
I am trying to train a language model using a Dataset of big text documents. In fastai v1, this aproach takes 1h 50m each epoch with 95% utilization of GPU:
data_src = (TextList.from_df(df, processor=[SPProcessor.load(save_data,'spm15kpt','spm15kpt')])
.split_by_rand_pct(0.1, seed=42)
.label_for_lm())
data = data_src.databunch(bs=bs, num_workers=1, bptt=72)
learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=1., pretrained=False,
pretrained_fnames = [weights, vocab_lm],
metrics=[error_rate, accuracy, perplexity]).to_fp16()
In fastai2, the same approach takes around 10 hours each epoch with the GPU usage varying between 0 and 70% most of the time:
Fastai2:
SentencePiece=SentencePieceTokenizer(sp_model='spm15kpt/spm15kpt.model')
tfms = [attrgetter("lm_text"), Tokenizer(tokenizer=SentencePiece), Numericalize(vocab)]
splits = RandomSplitter(valid_pct=0.1, seed=42)(tp)
dsrc = Datasets(tp, [tfms], splits=splits, dl_type=LMDataLoader)
dbunch_lm = dsrc.dataloaders(bs=bs, seq_len=sl, num_workers=8, pin_memory=True, shuffle_train=False)
learn_lm = language_model_learner(dbunch_lm,
AWD_LSTM,
config,
pretrained=False,
drop_mult=1.,
pretrained_fnames = [weights, vocab_lm],
metrics=[error_rate, accuracy, Perplexity()],
path=path
).to_fp16()
ps.: With shuffle_train=True and less or no workers, it is even slower.
I suspect the Dataloader is taking too much time to process each text in CPU before sending a batch to the GPU. Is there an option to force the early processing of the whole Dataset? Or can I do something to speed up the processing of the batch?
Edit: It indeed seems to be related with document size and the lazy loading approach. To experiment, I capped my text to 128 raw tokens (not yet tokenized) and now the GPU use is steady at 85% with no long periods in 0%. The lazy loading approach is very useful. I have been using it for a huge (100MM records) dataset of small texts and it works like a charm without OOM issues. But for medium or small size Datasets of big texts (> 256 tokens) it starts to lose performance while training because it underutilizes the GPU.
tp['capped_text'] = tp['lm_text'].apply(lambda x: " ".join([i for i in x.split()][:128]))
@much_learner did you manage to get this working? From your posts Iām guessing youāre doing the Kaggle Google Quest comp? I hit some of the same problems as you, getting the error about the targets now. I need to look a little closer at why the y is returning a tupleā¦
No, I havenāt. Yes, thatās Quest comp.
I think thereās no ready API for multi labelled regression in text yet, so we need to add our own Block
class, also show
method.
If you guys want some help on how to get started with building something custom let me know or run into trouble
Thanks! yeah Iām starting to bang my head against the wall here
The objective is given a piece of text predict 30 different scores for that text. Essentially how people rated it on different traits.
I used RegressionBlock(c_out=len(label_cols)
as it gave the correct number of outputs (30) in the final layer of the model. Using TransformBlock
gave a final layer output of 21000ā¦
Below is the text block I built, you can ignore mmgTokenizer
, I had to tweak something (wrap 1 thing in str() ) in Tokenizer in order to get it work with the SentencePiece tokenizer.
class mmgTextBlock(TransformBlock):
def __init__(self, tok_tfm, vocab=None, is_lm=False, seq_len=72):
return super().__init__(type_tfms=[tok_tfm, Numericalize(vocab)],
dl_type=LMDataLoader if is_lm else SortedDL,
dls_kwargs={} if is_lm else {'before_batch': partial(pad_input_chunk, seq_len=seq_len)})
@classmethod
@delegates(Tokenizer.from_df, keep=True)
def from_df(cls, text_cols, vocab=None, is_lm=False, seq_len=72, **kwargs):
return cls(mmgTokenizer.from_df(text_cols, # <--- Changed here, added mmgTokenizer so can use mmgSentencePieceTokenizer
tok_func=mmgSentencePieceTokenizer,
model_type='bpe',
special_toks=all_special_toks,
**kwargs),
vocab=vocab, is_lm=is_lm, seq_len=seq_len)
dbch = DataBlock(blocks=(mmgTextBlock.from_df(vocab=lm_vocab,
text_cols="doc"),
RegressionBlock(c_out=len(label_cols))), #TransformBlock ),
get_x=ColReader('text'),
get_y=ColReader(list(label_cols)),
splitter=RandomSplitter(0.2, seed=42),
dl_type=SortedDL).dataloaders(quest_trn, verbose=True)
And the learner:
learn=text_classifier_learner(dbch, AWD_LSTM, loss_func=torch.nn.BCEWithLogitsLoss(),
cbs=[SaveModelCallback()],
opt_func=opt_func, cb_funcs=cb_funcs,
seq_len=72, config=awd_lstm_clas_config, pretrained=True,
drop_mult=0.5, n_out=None,
lin_ftrs=None, ps=None, max_len=72*20)
Any help would be amazing, right now the above gives me:
Target size (torch.Size([64, 30])) must be the same as input size (torch.Size([30]))
So I donāt think the dataloader is assembling the data correctly
First thing Iād look at are raw batches that come out post your transforms. For a hint of how to do this, look at the code for dblock.summary() and what itās doing for transforming everything. This way you can also see what your data looks like at every step (Iāve also talked about how to do this on the forums if you look for that )
Edit: @morgan just realized it was actually in a conversation. Hereās the bit to look at:
for f in dls.train.after_item:
name = f.name
x = f(x)
print(x[1])
So that just returns ToTensor
(I commented out the x)
for i,v in enumerate(dbch.train):
print(i)
print(v[0].size())
print(v[1][0].size())
print(v[1][1].size())
OUPUT:
0
torch.Size([64, 2049])
torch.Size([30])
torch.Size([30])
1
torch.Size([64, 3783])
torch.Size([30])
torch.Size([30])
Printing the size of the batch outputs, it looks like its only serving up 1 y tensor value instead of 64ā¦
Sorry I meant to show this entire thing:
for i in range(len(dsets.train)):
x = dsets.train[i]
for f in dls.train.after_item:
name = f.name
x = f(x)
print(name, x)
The goal here is to print each transform and what itās respective output is (I also recommend just doing the first in dsets.train) (you can then do size, etc too, x[0] is your input, x[1] is your y)
Gotcha, thanks. Each each item looks ok, must be something with how the batch is constructedā¦will keep looking
ToTensor
torch.Size([754])
torch.Size([30])
ToTensor
torch.Size([2120])
torch.Size([30])
ToTensor
torch.Size([883])
torch.Size([30])
EDIT: was using RNNRegularizer callback twice by mistake, it comes in-built with text_classifier_learner
, but then I added it to my learner as another callbackā¦donāt do this
PROGRESS!
Does anyone know why the āafter_predā callback in the one_batch
function below in learner.py might be called twice?
I think its the reason why Iām only getting a single set of predictions being passed to the loss function instead of predictions for the entire batch, resulting in the "Target size (torch.Size([32, 29])) must be the same as input size (torch.Size([29]))"
error.
def one_batch(self, i, b):
self.iter = i
try:
self._split(b); self('begin_batch')
self.pred = self.model(*self.xb); HERE ----> self('after_pred')<--- Gets called twice
if len(self.yb) == 0: return
self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')
RNNRegularizer
This is the where the after_pred
callback is called, you can see it modifies self.pred to only take the value at the first index, so far so good, this will return our 32 predictions for each of the 29 classes, giving self.learn.pred a shape of torch.Size([32, 29])
class RNNRegularizer(Callback):
"`Callback` that adds AR and TAR regularization in RNN training"
def __init__(self, alpha=0., beta=0.): self.alpha,self.beta = alpha,beta
def after_pred(self):
self.raw_out,self.out = self.pred[1],self.pred[2]
self.learn.pred = self.pred[0]
However this then gets called a second time on the new predictions, which reduce self.pred to torch.Size([29])
It would mean the callback is present twice. You can check learn.cbs to be sure.
Ah yes thats it thank you! I had added it as an additional callback when I copied the plain Learner
from the wikitext tutorial, but looks like text_classifier_learner
already has it built in
Man I learned far more about the guts of fastai2 this week than I had planned
Note that you have a method to see what happens during the training loop to hlp you debug those kinds of situation, itās called learn.show_training_loop()
This one is a user contribution actually, so not from Jeremy and I
EDIT : SOLVED
Sorry I missed a commit from today, I was missing ModelReseter from my cbs. For reference this is my code that works now for training a language model:
lm_model = get_language_model(AWD_LSTM, vocab_sz=len(lm_vocab), config=awd_lstm_lm_config)
lm_opt_func = partial(Adam, wd=0.1, eps=1e-7)
lm_cbs = [MixedPrecision(clip=0.1), ModelReseter, RNNRegularizer(alpha=2, beta=1)]
lm_learn = Learner(dls=lmdls, model=lm_model, loss_func=CrossEntropyLossFlat(),
opt_func=lm_opt_func, cbs=lm_cbs,
metrics=[accuracy, Perplexity()])
Looks like a recent change might have changed something in the language model setup? I trained a LM a few days ago with the same code, but now when I run lr_find()
I get the error below:
Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu
I checked and it appears that my batches from the dataloader are on the gpu, so I think the RNN weights donāt get moved to the gpu as expected. This is how I have defined the model:
lm_model = get_language_model(AWD_LSTM, vocab_sz=len(lm_vocab), config=awd_lstm_lm_config)
lm_opt_func = partial(Adam, wd=0.1, eps=1e-7)
lm_cbs = [SaveModelCallback(), RNNRegularizer(alpha=2, beta=1)]
lm_learn = Learner(dls=lmdls, model=lm_model, loss_func=CrossEntropyLossFlat(),
opt_func=lm_opt_func, cbs=lm_cbs,
metrics=[accuracy, Perplexity()]).to_fp16()
And this is the stacktrace:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-39-800e454c6490> in <module>
----> 1 lm_learn.lr_find()
~/fastai2/fastai2/callback/schedule.py in lr_find(self, start_lr, end_lr, num_it, stop_div, show_plot, suggestions)
196 n_epoch = num_it//len(self.dls.train) + 1
197 cb=LRFinder(start_lr=start_lr, end_lr=end_lr, num_it=num_it, stop_div=stop_div)
--> 198 with self.no_logging(): self.fit(n_epoch, cbs=cb)
199 if show_plot: self.recorder.plot_lr_find()
200 if suggestions:
~/fastai2/fastai2/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
292 try:
293 self.epoch=epoch; self('begin_epoch')
--> 294 self._do_epoch_train()
295 self._do_epoch_validate()
296 except CancelEpochException: self('after_cancel_epoch')
~/fastai2/fastai2/learner.py in _do_epoch_train(self)
267 try:
268 self.dl = self.dls.train; self('begin_train')
--> 269 self.all_batches()
270 except CancelTrainException: self('after_cancel_train')
271 finally: self('after_train')
~/fastai2/fastai2/learner.py in all_batches(self)
245 def all_batches(self):
246 self.n_iter = len(self.dl)
--> 247 for o in enumerate(self.dl): self.one_batch(*o)
248
249 def one_batch(self, i, b):
~/fastai2/fastai2/learner.py in one_batch(self, i, b)
251 try:
252 self._split(b); self('begin_batch')
--> 253 self.pred = self.model(*self.xb); self('after_pred')
254 if len(self.yb) == 0: return
255 self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')
~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
539 result = self._slow_forward(*input, **kwargs)
540 else:
--> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():
543 hook_result = hook(self, input, result)
~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
90 def forward(self, input):
91 for module in self._modules.values():
---> 92 input = module(input)
93 return input
94
~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
539 result = self._slow_forward(*input, **kwargs)
540 else:
--> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():
543 hook_result = hook(self, input, result)
~/fastai2/fastai2/text/models/awdlstm.py in forward(self, inp, from_embeds)
100 new_hidden,raw_outputs,outputs = [],[],[]
101 for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):
--> 102 raw_output, new_h = rnn(raw_output, self.hidden[l])
103 new_hidden.append(new_h)
104 raw_outputs.append(raw_output)
~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
539 result = self._slow_forward(*input, **kwargs)
540 else:
--> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():
543 hook_result = hook(self, input, result)
~/fastai2/fastai2/text/models/awdlstm.py in forward(self, *args)
49 #To avoid the warning that comes because the weights aren't flattened.
50 warnings.simplefilter("ignore")
---> 51 return self.module.forward(*args)
52
53 def reset(self):
~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/rnn.py in forward(self, input, hx)
562 return self.forward_packed(input, hx)
563 else:
--> 564 return self.forward_tensor(input, hx)
565
566
~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/rnn.py in forward_tensor(self, input, hx)
541 unsorted_indices = None
542
--> 543 output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
544
545 return output, self.permute_hidden(hidden, unsorted_indices)
~/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/rnn.py in forward_impl(self, input, hx, batch_sizes, max_batch_size, sorted_indices)
524 if batch_sizes is None:
525 result = _VF.lstm(input, hx, self._get_flat_weights(), self.bias, self.num_layers,
--> 526 self.dropout, self.training, self.bidirectional, self.batch_first)
527 else:
528 result = _VF.lstm(input, batch_sizes, hx, self._get_flat_weights(), self.bias,
RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu
> /home/morgan/anaconda3/envs/fastai2_me/lib/python3.7/site-packages/torch/nn/modules/rnn.py(526)forward_impl()
524 if batch_sizes is None:
525 result = _VF.lstm(input, hx, self._get_flat_weights(), self.bias, self.num_layers,
--> 526 self.dropout, self.training, self.bidirectional, self.batch_first)
527 else:
528 result = _VF.lstm(input, batch_sizes, hx, self._get_flat_weights(), self.bias,
Anyone else encountered this?
EDIT: yes it is expected
~ ~ ~ ~ ~ ~ ~
Expected Behaviour? - SortedDL Transform returns reversed order
Iāve been getting terrible results for my first submissions to kaggle (Google QUEST) using fastai2, turns out its because my test submissions are backwards Am using SortedDl
in my dataloader, just wondering if this is by design? The original order is restored when I set reverse=False
in get_idxs
in SortedDl
@delegates(TfmdDL)
class SortedDL(TfmdDL):
def __init__(self, dataset, sort_func=None, res=None, **kwargs):
super().__init__(dataset, **kwargs)
self.sort_func = _default_sort if sort_func is None else sort_func
if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset)
self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res
self.idx_max = np.argmax(self.res)
def get_idxs(self):
idxs = super().get_idxs()
if self.shuffle: return idxs
return sorted(idxs, key=lambda i: self.res[i], **reverse=True**) <- SHOULD IT BE DEFAULT TRUE?
Dataloaders setup here:
bs = 32
sl = 72
tst_splits = [list(range(len(tst_df))),list(range(len(tst_df)))]
x_tfms = [attrgetter('doc'), tok_fn, Numericalize(vocab=lm_vocab)]
tst_cls_dsets = Datasets(tst_df, splits=tst_splits, tfms=[x_tfms], dl_type=SortedDL)
tst_cls_dls = tst_cls_dsets.dataloaders(bs=bs, sl=sl, shuffle_train=False, device='cuda',
before_batch=pad_input_chunk, dl_type=SortedDL)
The reverse=True
means the samples are sorted by length from the bigger to the smaller. This is because we need the biggest batch first for CUDA-optimization of memory.
You will need to unsort them after you get the results of your get preds, yes (as you see the permutation is given by dl.get_idxs()
, you just need to reverse it).
Ah understood, thanks!