[SOLVED] Problem with TextList.from_folder - Reading the file paths instead of the file content

Mendes · July 27, 2019, 12:19am

Hello,

I am trying to create a language model in Portuguese from scratch based on the notebook in this link:

https://github.com/fastai/course-nlp/blob/master/nn-vietnamese.ipynb

If I follow the complete tutorial without changing anything it works well. However, since there is no parameter passed to the tokenizer in the TextList.from_folder, I would imagine that the default tokenizer “en” is used, which would be weird. Is this assumption correct?

data = (TextList.from_folder(dest)
            .split_by_rand_pct(0.1, seed=42)
            .label_for_lm()           
            .databunch(bs=bs, num_workers=6))

which gets me this data.batch:

I then tried to pass the correct tokenizer using the processor attribute:

data = (TextList.from_folder(dest,processor=[TokenizeProcessor(tokenizer=tokenizer),NumericalizeProcessor(vocab=60000)])
            .split_by_rand_pct(0.1, seed=42)
            .label_for_lm()           
            .databunch(bs=bs, num_workers=6))

But when I check my data.batch I get this:

Which means that the script is reading the text paths but not the text content

This error seems to be similar to what happened in this thread

TextLMDataBunch.from_folder seems to be broken #1578

I tried to debug the problem myself, but I could not solve it. Can anyone help?

Thanks

sgugger · July 27, 2019, 7:04pm

You need to pass an OpenFileProcessor() before your tokenize processor.

Mendes · July 27, 2019, 8:21pm

I tried that before, but now I found my error. I was using the following code

data = (TextList.from_folder(dest,processor=[OpenFileProcessor(), TokenizeProcessor(tokenizer=tokenizer),NumericalizeProcessor(vocab=60000)])
            .split_none()
            .label_for_lm()           
            .databunch(bs=bs, num_workers=6))

And getting the error

AttributeError: 'int' object has no attribute 'numericalize'

If I don’t pass the argument in vocab, NumericalizeProcessor(), then the code works fine. I guess I was working too much on other things before and did not notice that.

Thanks for the help!

anishjain · October 31, 2020, 10:07am

hey do u find the solution i try this with fastai 2.0.16 and getting
name ‘TextList’ is not defined so what should i do