Hello.
I’ve been trying to implement a custom tokenizer to work with DataBlock & Text Block (see below). I keep getting an error when trying to generate a dataloaders. Does anyone have experience with custom tokenizers that can point me in the right direction? Here is a Jupyter Notebook with the code in question and the error I’m seeing.
class CharacterTokenizer():
def __init__(self, split_char=' ', **kwargs):
self.split_char=split_char
def __call__(self, items):
# List where I temporarly store the tokens ['xxbos', 'h', 'e', 'l', 'l', 'o', 'xxeos'] as
# they are being parsed.
final_list = []
# We don't want to mess with the special fastai tokens
special_chars = ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj']
# Break up string into words, if word in special_chars dont touch it. Otherwise break up each
# word into each character.
for words in items:
for word in words.split():
if word not in special_chars:
for char in word:
final_list.append([char])
else:
final_list.append([word])
# Return a generator?? I'm not sure why we need to do this...
return (x for x in final_list)
logs = DataBlock(
blocks=(
TextBlock.from_df('from_txt', is_lm=False, tok=CharacterTokenizer()),
TextBlock.from_df('to_txt' , is_lm=False, tok=CharacterTokenizer())),
# The TestBlock tokenization process puts tokenized inputs into a column called text.
# The ColReader for get_x will always reference text, even if the original text inputs
# were in a column with another name in the dataframe.
get_x=ColReader('text'),
get_y=ColReader('text'),
# The dataframe needs to have a is_valid column for this to work.
splitter=ColSplitter()
)
dls = logs.dataloaders(df)
Could not do one pass in your dataloader, there is something wrong in it