Changing the tokenizer default language in text data

I’m trying to a language model. By default the tokenizer uses english.
To set it to french, I’ve first defined a custom tokenizer:

spacy_fr = partial(WordTokenizer, lang=‘fr’)

Then I did the following without sucess:

  1. dls1 = TextDataLoaders.from_df(df, text_col=‘review_body’, is_lm=True, tok_tfm=spacy_fr)

I got this error: AttributeError: 'Series' object has no attribute 'text'

  1. lm = DataBlock(blocks=TextBlock.from_df(‘review_body’, is_lm=True, tok=spacy_fr),
    dls = lm.dataloaders(df, bs=16, seq_len=72)
    This time the error is: TypeError: __init__() got multiple values for argument 'lang'

Please could you share any idea how to make it work?

Hi, I found one error I did.
I shouldn’t use partial. So:

spacy_fr = WordTokenizer(lang=‘fr’)

works for 2. case but not for the 1. and the error is still the same.


Not sure where the error is coming from exactly.

But, this is likely related to the transform creating the tokenized data in a new column called “text” see the tok_text_col argument:

Note in #2, get_x reads from the “text” column: ColReader(‘text’),

You’re right. I’ve to dig deeper as it seems to be a bug.
I’ve tried to pass in a function that raises an exception when it’s called.
But when executing:

TextDataLoaders.from_df(df, text_col=‘review_body’, is_lm=True, tok_tfm=raise_exception_function),

Nothing happens, my first guess is that the new tokenizer is not called at all otherwise I would have seen the exception my function raises !