Custom Tokenizer - Text Block

goralpl · February 9, 2021, 1:17am

Hello.

I’ve been trying to implement a custom tokenizer to work with DataBlock & Text Block (see below). I keep getting an error when trying to generate a dataloaders. Does anyone have experience with custom tokenizers that can point me in the right direction? Here is a Jupyter Notebook with the code in question and the error I’m seeing.

class CharacterTokenizer():
    
    def __init__(self, split_char=' ', **kwargs): 
        self.split_char=split_char
        
    def __call__(self, items):
        
        # List where I temporarly store the tokens ['xxbos', 'h', 'e', 'l', 'l', 'o', 'xxeos'] as 
        # they are being parsed.
        final_list = []
        
        # We don't want to mess with the special fastai tokens
        special_chars = ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj']
        
        # Break up string into words, if word in special_chars dont touch it. Otherwise break up each
        # word into each character.
        for words in items:
            for word in words.split():
                if word not in special_chars:
                    for char in word:
                        final_list.append([char])
                else:
                    final_list.append([word])

        # Return a generator?? I'm not sure why we need to do this...        
        return (x for x in final_list)

logs = DataBlock(
        
    blocks=(
        TextBlock.from_df('from_txt', is_lm=False, tok=CharacterTokenizer()),
        TextBlock.from_df('to_txt'  , is_lm=False, tok=CharacterTokenizer())),
    
    # The TestBlock tokenization process puts tokenized inputs into a column called text. 
    # The ColReader for get_x will always reference text, even if the original text inputs 
    # were in a column with another name in the dataframe.
    get_x=ColReader('text'),
    get_y=ColReader('text'),
    
    # The dataframe needs to have a is_valid column for this to work.
    splitter=ColSplitter()

)



dls = logs.dataloaders(df)



Could not do one pass in your dataloader, there is something wrong in it

goralpl · February 10, 2021, 12:44am

I figured it out.

TLDR: For a tokenizer to work within fastai, it needs to return a Python generator. The way I implemented this was by first creating a simple Python list with all my tokens, and then putting that list into another list and using Python Generator Comprehension syntax to generate the final generator.

Below is my custom tokenizer class and here is a Jupyter Notebook with a working example on a toy dataset.

class CharacterTokenizer():
        
    def __call__(self, items):
        
        # List where I temporarly store the tokens ['xxbos', 'h', 'e', 'l', 'l', 'o', 'xxeos'] as 
        # they are being parsed.
        final_list = []
        
        # We don't want to mess with the special fastai tokens
        special_chars = ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj']
        
        # Break up string into words, if word in special_chars dont touch it. Otherwise break up each
        # word into each character.
        for words in items:
            tmp = []
            for word in words.split():
                if word not in special_chars:
                    for char in word:
                        tmp.append(char)
                else:
                    tmp.append(word)
            # tmp has each token 'xxbos', 'xxmaj', 'h', 'e', 'l', 'l', 'o', ',', 'w', 'h', ....]
            # We need to put the tmp list into another list to generate a generator below
            final_list.append(tmp)
        
        # Returns a generator
        return (t for t in final_list)