CharTokenizer in fastai2

Hi everyone,

I’d like to implement a Char-LSTM in fastai2.I’m struggling a bit with how to wrap the CharTokenizer into the Tokenizer class for use on dataframes. The code below results in the error Could not do one pass in your dataloader, there is something wrong in it. I’ve tried playing with the text.default settings, but I haven’t gotten it to work yet. Does anyone see where it’s going wrong?

Thanks!

from fastai.text.all import *
class CharTokenizer():
def init(self, lang = ‘en’, special_tokens = None):
self.lang = lang
self.special_tokens = special_tokens

def __call__(self, seq):
    return (list(s) for s in seq)  

tok = CharTokenizer()
txt = [‘testing’, ‘anothertest’]
#test tokenization
first(CharTokenizer()(txt))

#Wrap into Tokenizer class
tok = Tokenizer(CharTokenizer, rules=[])

#Make test dataframe
df1 = pd.DataFrame({‘test_text’ : [‘Some_text’]*50+[‘Some_other_text’]*50,
‘test_value’ : [‘No’]*50+ [‘Yes’]*50})

dls1 = TextDataLoaders.from_df(df1, text_col = ‘test_text’, tok = tok)

1 Like

Did you try something like the below?

tok = CharTokenizer()
dls1 = TextDataLoaders.from_df(df1, text_col = ‘test_text’, tok = tok)

This won’t generate a vocab for you though (if you need it). You’d have to use Numericalize (or pul some code from it) to do that

Hi Morgan, thanks for your reply. Just tried that, but it fails with the same error unfortunately.

It seems to work when I use a DataBlock instead of directly calling TextDataLoaders.from_df.

dblock = DataBlock(blocks = (TextBlock(tok_tfm = tok)),
getters = [ColReader(‘test_text’)],
splitter = RandomSplitter())

1 Like