I have some trouble with my data loading for the classifier. It works, but it seems to repeat tokenization (already done for the language model). Currently I do:
data_label_blocks = (TextBlock.from_folder(path=unsupervised_folder, vocab=self.vocab),
MultiCategoryBlock(vocab=my_classes))
dsrc = DataBlock(blocks=data_label_blocks,
splitter=RandomSplitter(),
get_x=lambda x: unsupervised_folder / f'{x[0]}.txt',
get_y=lambda x: x[1].split(' '),
dl_type=SortedDL)
But I already have the data tokenized in another folder! Very naively I tried to pass that folder but then Fastai created a tokenized version of the tokenized version (which is mostly right, but not quite, and still repeating work). I suppose there is some option to say “skip tokenization”? I tried some silly ideas like TextBlock(None, vocab=self.vocab)
but no luck so far.
Update: We have been exploring the code and perhaps this is already fixed by design?
From the Tokenizer class in fastai2/text/core.py we have:
@delegates(tokenize_folder, keep=True)
def from_folder(cls, path, tok_func=SpacyTokenizer, **kwargs):
path = Path(path)
output_dir = Path(ifnone(kwargs.get('output_dir'), path.parent/f'{path.name}_tok'))
if not output_dir.exists(): tokenize_folder(path, **kwargs)
res = cls(get_tokenizer(tok_func, **kwargs), counter=(output_dir/fn_counter_pkl).load(),
lengths=(output_dir/fn_lengths_pkl).load(), mode='folder')
res.path,res.output_dir = path,output_dir
return res
if not output_dir.exists(): tokenize_folder(path, kwargs)
So it seems like no double work is being done