CharTokenizer in fastai2

neural · September 26, 2020, 9:16am

Hi everyone,

I’d like to implement a Char-LSTM in fastai2.I’m struggling a bit with how to wrap the CharTokenizer into the Tokenizer class for use on dataframes. The code below results in the error Could not do one pass in your dataloader, there is something wrong in it. I’ve tried playing with the text.default settings, but I haven’t gotten it to work yet. Does anyone see where it’s going wrong?

Thanks!

from fastai.text.all import *
class CharTokenizer():
def init(self, lang = ‘en’, special_tokens = None):
self.lang = lang
self.special_tokens = special_tokens
def __call__(self, seq):
    return (list(s) for s in seq)  
tok = CharTokenizer()
txt = [‘testing’, ‘anothertest’]
#test tokenization
first(CharTokenizer()(txt))

#Wrap into Tokenizer class
tok = Tokenizer(CharTokenizer, rules=[])

#Make test dataframe
df1 = pd.DataFrame({‘test_text’ : [‘Some_text’]*50+[‘Some_other_text’]*50,
‘test_value’ : [‘No’]*50+ [‘Yes’]*50})

dls1 = TextDataLoaders.from_df(df1, text_col = ‘test_text’, tok = tok)

morgan · September 27, 2020, 10:17am

Did you try something like the below?

tok = CharTokenizer()
dls1 = TextDataLoaders.from_df(df1, text_col = ‘test_text’, tok = tok)

This won’t generate a vocab for you though (if you need it). You’d have to use Numericalize (or pul some code from it) to do that

neural · September 27, 2020, 3:03pm

Hi Morgan, thanks for your reply. Just tried that, but it fails with the same error unfortunately.

neural · September 27, 2020, 3:18pm

It seems to work when I use a DataBlock instead of directly calling TextDataLoaders.from_df.

dblock = DataBlock(blocks = (TextBlock(tok_tfm = tok)),
getters = [ColReader(‘test_text’)],
splitter = RandomSplitter())