Hey everyone,
I have looked through the forums and wasn’t able to find any solution to my problem.
To provide context, I am trying to build an abstractive summarization model. So I though about fine-tuning BART from the huggingface library. The problem is when creating my dataloaders I am facing two issues:
- When declaring the tokenizer in the datablock, it is taking a long time to run it. I tried simply running the tokenizer on the text separately and it took me 2 min, whereas when reading through the datablocks it has been 45 min and its still running
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
dblock = DataBlock(
blocks=(TextBlock.from_df('paper', seq_len=sl, tok=tokenizer),
TextBlock.from_df('summary', seq_len=sl, tok=tokenizer)),
get_x=ColReader('text'), get_y=ColReader('text'), splitter=RandomSplitter(0.2))
- When declaring a separate class (for both encoding and decoding) or function (only for encoding) it uses the standard fastai tokenizer instead
class TransformersTokenizer(Transform):
def __init__(self, tokenizer): self.tokenizer = tokenizer
def encodes(self, x):
return x if isinstance(x, Tensor) else tokenize(x)
def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))
txtblock = TextBlock(TransformersTokenizer(tokenizer), seq_len=sl)
dblock = DataBlock(
blocks=(txtblock.from_df('paper', seq_len=sl),
txtblock.from_df('summary', seq_len=sl)),
get_x=ColReader('text'), get_y=ColReader('text'), splitter=RandomSplitter(0.2))
Any help would be appreciated