Facing issues with using huggingface tokenizer

ribhu · July 20, 2021, 6:11pm

Hey everyone,

I have looked through the forums and wasn’t able to find any solution to my problem.

To provide context, I am trying to build an abstractive summarization model. So I though about fine-tuning BART from the huggingface library. The problem is when creating my dataloaders I am facing two issues:

When declaring the tokenizer in the datablock, it is taking a long time to run it. I tried simply running the tokenizer on the text separately and it took me 2 min, whereas when reading through the datablocks it has been 45 min and its still running

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

dblock = DataBlock(
    blocks=(TextBlock.from_df('paper', seq_len=sl, tok=tokenizer), 
            TextBlock.from_df('summary', seq_len=sl, tok=tokenizer)),
            get_x=ColReader('text'), get_y=ColReader('text'), splitter=RandomSplitter(0.2))

When declaring a separate class (for both encoding and decoding) or function (only for encoding) it uses the standard fastai tokenizer instead

class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        return x if isinstance(x, Tensor) else tokenize(x)
        
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

txtblock = TextBlock(TransformersTokenizer(tokenizer), seq_len=sl)

dblock = DataBlock(
    blocks=(txtblock.from_df('paper', seq_len=sl), 
            txtblock.from_df('summary', seq_len=sl)),
            get_x=ColReader('text'), get_y=ColReader('text'), splitter=RandomSplitter(0.2))

Any help would be appreciated