Custom NLP data set - getting StopIteration - encoding issue?

tk777 · November 24, 2019, 4:59am

Hello, all! This is my first post. Please be gentle if I’m doing something something obviously stupid.

I’m attempting to take the IMDB classifier covered in Part 1, lessons 3 and 4, and I replaced the text in the IMDB folder with text that I scraped.

it’s giving me this error when I get to the show_batch() step.

I remember seeing/hearing something about setting text encodings. I’m not sure if that’s the solution here. I’m sure someone’s done this before. Can someone gently point me in the right direction?

Thank you!

StopIteration Traceback (most recent call last)
in ()
----> 1 data_lm.show_batch()

1 frames
/usr/local/lib/python3.6/dist-packages/fastai/basic_data.py in one_batch(self, ds_type, detach, denorm, cpu)
166 w = dl.num_workers
167 dl.num_workers = 0
–> 168 try: x,y = next(iter(dl))
169 finally: dl.num_workers = w
170 if detach: x,y = to_detach(x,cpu=cpu),to_detach(y,cpu=cpu)

StopIteration:

tk777 · November 24, 2019, 7:52pm

I dug into it a but more. This appears to have indeed been a text encoding issue. I used the fastai doc() function on the TextLMDataBunch.from_folder function. After digging enough through the code and comments, it appears that the expectation is that text is to be utf-8 encoded.

this post showed me how to check the encoding of my files and the imdb files downloaded by fastai.

(if you can’t find the fastai downloaded files, add the dest parameter to untar_data) path = untar_data(URLs.IMDB, dest="/content")

here’s how to check text encoding types on a linux filesystem: (on coloab, I had to do apt-get file in order to run the file command)

Sure enough. my scraper wasn’t saving utf8. I needed to add a parameter to my file open line in my scraper.
https://www.tutorialspoint.com/How-to-read-and-write-unicode-UTF-8-files-in-Python

That took me about 2 weeks of poking at it, but I finally got it! On to the next issue!