Error encountered in Lesson4-imdb Part Sentiment

For some reason when I execute the second cell in Sentiment:

IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, ‘data/’)

I see a message that says downloading aclImdb_v1.tar.gz

Eventually I see an error that says UnicodeDecodeError

UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 803: character maps to

Has anyone else experienced the same thing? Any way to resolve this?

2 Likes

Got the same error as above. It only appeared when I was running the notebook in Windows 10. Running the same code in Ubuntu/AWS results in no error.

1 Like

Good to know Hafidz,

My error was from line 32 in my local imdb.py

with open(fname, ‘r’) as f:

I noticed the latest version of imdb.py on torchtext github has some updates as:

with open(fname, ‘r’, encoding=“utf-8”) as f:

I guess the error should go away with this update

Interesting. Did you manage to test the updated files?

Sorry Hafidz. A different error after the update:

TypeError Traceback (most recent call last)
in ()
1 IMDB_LABEL = data.Field(sequential=False)
----> 2 splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, ‘data/’)

c:\anaconda3\envs\mxnet-vc14\lib\site-packages\torchtext\datasets\imdb.py in splits(cls, text_field, label_field, root, train, test, **kwargs)

TypeError: super(type, obj): obj must be an instance or subtype of type

It seems I should restart the kernel…as said here:

1 Like

@hafidz and all others who may read this.

To fix this error, clone this repository i.e. https://github.com/pytorch/text

and install torchtext from here (python setup.py install --force)

It has updates that release 2.0.1 does not cover. (updates about encoding are not limited to imdb.py but involve dataset.py, field.py etc.)

I can can confirm that after this install all my errors encoding as well as TypeError: super(type, obj): obj must be an instance or subtype of type went away

1 Like

Great stuff @sam2. Will test them out later tonight (GMT+8).

Hit the same error as you mentioned earlier ( TypeError: super(type, obj): obj must be an instance or subtype of type) - didn’t go through with finding the root cause though. Glad you found a solution.

After cloning and installing, I’ve started to get the following error in the beginning of the notebook:

NameError: name 'spacy_tok' is not defined

I don’t know exactly if this is the cause of the problem, since I’ve also made a pull and a conda update. I’ve executed this point of the notebook without problems before.

There were not a problem with the install of pytorch/text. It was a problem introduced in commit 3e4b5a9, I checked out the commit before it. There’s already an issue in GitHub about it.

SOLVED! See the update at the end of the message.

it didn’t work for me. It looks like the problem just happens in Windows 10 machines.

Here is what I did step by step. Please see if you can spot anything diferent.

  1. I cloned the repository (master branch),
  2. entered in a cmd prompt and activated the fastai env
  3. ran python setup.py install --force
  4. started the notebook
  5. ran the first 2 cells and the Sentimet section.
  6. At the second step of the sentiment section, I got the same error:
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-7-30850761a448> in <module>()
      1 IMDB_LABEL = data.Field(sequential=False)
----> 2 splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

D:\Anaconda3\envs\fastai\lib\site-packages\torchtext\datasets\imdb.py in splits(cls, text_field, label_field, root, train, test, **kwargs)
     52         return super(IMDB, cls).splits(
     53             root=root, text_field=text_field, label_field=label_field,
---> 54             train=train, validation=None, test=test, **kwargs)
     55 
     56     @classmethod

D:\Anaconda3\envs\fastai\lib\site-packages\torchtext\data\dataset.py in splits(cls, path, root, train, validation, test, **kwargs)
     70             path = cls.download(root)
     71         train_data = None if train is None else cls(
---> 72             os.path.join(path, train), **kwargs)
     73         val_data = None if validation is None else cls(
     74             os.path.join(path, validation), **kwargs)

D:\Anaconda3\envs\fastai\lib\site-packages\torchtext\datasets\imdb.py in __init__(self, path, text_field, label_field, **kwargs)
     31             for fname in glob.iglob(os.path.join(path, label, '*.txt')):
     32                 with open(fname, 'r') as f:
---> 33                     text = f.readline()
     34                 examples.append(data.Example.fromlist([text, label], fields))
     35 

D:\Anaconda3\envs\fastai\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 803: character maps to <undefined>

I’m also with autoreload 2. I’m stucked.

UPDATE: I’ve just managed to fix it. It was necessary to uninstall the older version with a pip uninstall torchtext. The older version as installed directly in site-packages and was taking precedence to the newer one installed with python setup.py install. SOLVED!

2 Likes

@neves,
good for you!
In fact I may also have uninstalled torchtext instictively before installing from source (github).

Might be unrelated to your Unicode error, but I managed to skip downloading the tar.gz file, as detailed here. Might be helpful to someone else. Thanks.