Changing the tokenizer default language in text data

iskode · March 4, 2021, 2:53am

Hi,
I’m trying to a language model. By default the tokenizer uses english.
To set it to french, I’ve first defined a custom tokenizer:

spacy_fr = partial(WordTokenizer, lang=‘fr’)

Then I did the following without sucess:

dls1 = TextDataLoaders.from_df(df, text_col=‘review_body’, is_lm=True, tok_tfm=spacy_fr)

I got this error: AttributeError: 'Series' object has no attribute 'text'

lm = DataBlock(blocks=TextBlock.from_df(‘review_body’, is_lm=True, tok=spacy_fr),
get_x=ColReader(‘text’),
splitter=RandomSplitter())
dls = lm.dataloaders(df, bs=16, seq_len=72)
This time the error is: TypeError: __init__() got multiple values for argument 'lang'

Please could you share any idea how to make it work?

iskode · March 4, 2021, 6:24am

Hi, I found one error I did.
I shouldn’t use partial. So:

spacy_fr = WordTokenizer(lang=‘fr’)

works for 2. case but not for the 1. and the error is still the same.

meanpenguin · March 6, 2021, 12:26am

Hi,

Not sure where the error is coming from exactly.

But, this is likely related to the transform creating the tokenized data in a new column called “text” see the tok_text_col argument: https://docs.fast.ai/text.data.html#TextDataLoaders.from_df

Note in #2, get_x reads from the “text” column: ColReader(‘text’),

iskode · March 8, 2021, 10:43pm

You’re right. I’ve to dig deeper as it seems to be a bug.
I’ve tried to pass in a function that raises an exception when it’s called.
But when executing:

TextDataLoaders.from_df(df, text_col=‘review_body’, is_lm=True, tok_tfm=raise_exception_function),

Nothing happens, my first guess is that the new tokenizer is not called at all otherwise I would have seen the exception my function raises !