I have looked through the forums and wasn’t able to find any solution to my problem.
To provide context, I am trying to build an abstractive summarization model. So I though about fine-tuning BART from the huggingface library. The problem is when creating my dataloaders I am facing two issues:
- When declaring the tokenizer in the datablock, it is taking a long time to run it. I tried simply running the tokenizer on the text separately and it took me 2 min, whereas when reading through the datablocks it has been 45 min and its still running
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn') dblock = DataBlock( blocks=(TextBlock.from_df('paper', seq_len=sl, tok=tokenizer), TextBlock.from_df('summary', seq_len=sl, tok=tokenizer)), get_x=ColReader('text'), get_y=ColReader('text'), splitter=RandomSplitter(0.2))
- When declaring a separate class (for both encoding and decoding) or function (only for encoding) it uses the standard fastai tokenizer instead
class TransformersTokenizer(Transform): def __init__(self, tokenizer): self.tokenizer = tokenizer def encodes(self, x): return x if isinstance(x, Tensor) else tokenize(x) def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy())) txtblock = TextBlock(TransformersTokenizer(tokenizer), seq_len=sl) dblock = DataBlock( blocks=(txtblock.from_df('paper', seq_len=sl), txtblock.from_df('summary', seq_len=sl)), get_x=ColReader('text'), get_y=ColReader('text'), splitter=RandomSplitter(0.2))
Any help would be appreciated