Error encountered in Lesson4-imdb Part Sentiment

sam2 · March 12, 2018, 3:59pm

For some reason when I execute the second cell in Sentiment:

IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, ‘data/’)

I see a message that says downloading aclImdb_v1.tar.gz

Eventually I see an error that says UnicodeDecodeError

UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 803: character maps to

Has anyone else experienced the same thing? Any way to resolve this?

hafidz · March 12, 2018, 4:42pm

Got the same error as above. It only appeared when I was running the notebook in Windows 10. Running the same code in Ubuntu/AWS results in no error.

sam2 · March 12, 2018, 5:03pm

Good to know Hafidz,

My error was from line 32 in my local imdb.py

with open(fname, ‘r’) as f:

I noticed the latest version of imdb.py on torchtext github has some updates as:

with open(fname, ‘r’, encoding=“utf-8”) as f:

I guess the error should go away with this update

hafidz · March 12, 2018, 5:08pm

Interesting. Did you manage to test the updated files?

sam2 · March 12, 2018, 5:11pm

Sorry Hafidz. A different error after the update:

TypeError Traceback (most recent call last)
in ()
1 IMDB_LABEL = data.Field(sequential=False)
----> 2 splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, ‘data/’)

c:\anaconda3\envs\mxnet-vc14\lib\site-packages\torchtext\datasets\imdb.py in splits(cls, text_field, label_field, root, train, test, **kwargs)

TypeError: super(type, obj): obj must be an instance or subtype of type

It seems I should restart the kernel…as said here:

sam2 · March 13, 2018, 12:57am

@hafidz and all others who may read this.

To fix this error, clone this repository i.e. https://github.com/pytorch/text

and install torchtext from here (python setup.py install --force)

It has updates that release 2.0.1 does not cover. (updates about encoding are not limited to imdb.py but involve dataset.py, field.py etc.)

I can can confirm that after this install all my errors encoding as well as TypeError: super(type, obj): obj must be an instance or subtype of type went away

hafidz · March 13, 2018, 3:31am

Great stuff @sam2. Will test them out later tonight (GMT+8).

Hit the same error as you mentioned earlier ( TypeError: super(type, obj): obj must be an instance or subtype of type) - didn’t go through with finding the root cause though. Glad you found a solution.

neves · March 17, 2018, 6:11pm

After cloning and installing, I’ve started to get the following error in the beginning of the notebook:

NameError: name 'spacy_tok' is not defined

I don’t know exactly if this is the cause of the problem, since I’ve also made a pull and a conda update. I’ve executed this point of the notebook without problems before.

neves · March 17, 2018, 7:06pm

There were not a problem with the install of pytorch/text. It was a problem introduced in commit 3e4b5a9, I checked out the commit before it. There’s already an issue in GitHub about it.

neves · March 18, 2018, 3:57am

SOLVED! See the update at the end of the message.

it didn’t work for me. It looks like the problem just happens in Windows 10 machines.

Here is what I did step by step. Please see if you can spot anything diferent.

I cloned the repository (master branch),
entered in a cmd prompt and activated the fastai env
ran python setup.py install --force
started the notebook
ran the first 2 cells and the Sentimet section.
At the second step of the sentiment section, I got the same error:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-7-30850761a448> in <module>()
      1 IMDB_LABEL = data.Field(sequential=False)
----> 2 splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

D:\Anaconda3\envs\fastai\lib\site-packages\torchtext\datasets\imdb.py in splits(cls, text_field, label_field, root, train, test, **kwargs)
     52         return super(IMDB, cls).splits(
     53             root=root, text_field=text_field, label_field=label_field,
---> 54             train=train, validation=None, test=test, **kwargs)
     55 
     56     @classmethod

D:\Anaconda3\envs\fastai\lib\site-packages\torchtext\data\dataset.py in splits(cls, path, root, train, validation, test, **kwargs)
     70             path = cls.download(root)
     71         train_data = None if train is None else cls(
---> 72             os.path.join(path, train), **kwargs)
     73         val_data = None if validation is None else cls(
     74             os.path.join(path, validation), **kwargs)

D:\Anaconda3\envs\fastai\lib\site-packages\torchtext\datasets\imdb.py in __init__(self, path, text_field, label_field, **kwargs)
     31             for fname in glob.iglob(os.path.join(path, label, '*.txt')):
     32                 with open(fname, 'r') as f:
---> 33                     text = f.readline()
     34                 examples.append(data.Example.fromlist([text, label], fields))
     35 

D:\Anaconda3\envs\fastai\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 803: character maps to <undefined>

I’m also with autoreload 2. I’m stucked.

UPDATE: I’ve just managed to fix it. It was necessary to uninstall the older version with a pip uninstall torchtext. The older version as installed directly in site-packages and was taking precedence to the newer one installed with python setup.py install. SOLVED!

sam2 · March 18, 2018, 6:11pm

@neves,
good for you!
In fact I may also have uninstalled torchtext instictively before installing from source (github).

utkb · September 2, 2018, 7:38am

Might be unrelated to your Unicode error, but I managed to skip downloading the tar.gz file, as detailed here. Might be helpful to someone else. Thanks.