[SOLVED] Problem with TextList.from_folder - Reading the file paths instead of the file content


I am trying to create a language model in Portuguese from scratch based on the notebook in this link:


If I follow the complete tutorial without changing anything it works well. However, since there is no parameter passed to the tokenizer in the TextList.from_folder, I would imagine that the default tokenizer “en” is used, which would be weird. Is this assumption correct?

data = (TextList.from_folder(dest)
            .split_by_rand_pct(0.1, seed=42)
            .databunch(bs=bs, num_workers=6))

which gets me this data.batch:

I then tried to pass the correct tokenizer using the processor attribute:

data = (TextList.from_folder(dest,processor=[TokenizeProcessor(tokenizer=tokenizer),NumericalizeProcessor(vocab=60000)])
            .split_by_rand_pct(0.1, seed=42)
            .databunch(bs=bs, num_workers=6))

But when I check my data.batch I get this:

Which means that the script is reading the text paths but not the text content

This error seems to be similar to what happened in this thread

TextLMDataBunch.from_folder seems to be broken #1578

I tried to debug the problem myself, but I could not solve it. Can anyone help?


You need to pass an OpenFileProcessor() before your tokenize processor.

I tried that before, but now I found my error. I was using the following code

data = (TextList.from_folder(dest,processor=[OpenFileProcessor(), TokenizeProcessor(tokenizer=tokenizer),NumericalizeProcessor(vocab=60000)])
            .databunch(bs=bs, num_workers=6))

And getting the error

AttributeError: 'int' object has no attribute 'numericalize'

If I don’t pass the argument in vocab, NumericalizeProcessor(), then the code works fine. I guess I was working too much on other things before and did not notice that.

Thanks for the help!

hey do u find the solution i try this with fastai 2.0.16 and getting
name ‘TextList’ is not defined so what should i do