10_nlp - UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1757

In the chapter 10_NLP, executing the below code gives an error -

txts = L(o.open().read() for o in files[:2000])

UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 1757: character maps to

So I changed the code to -

txts = L(o.open(encoding=“utf8”).read() for o in files[:2000])

and it runs fine.

However executing the next line gives an error-

subword(1000)

UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\x96’ in position 799: character maps to

And I dont know how to fix this.

Note this runs fine on Colab but throws an error when run locally on Jupyter.

2 Likes

I worked-around that 2nd problem (UnicodeEncodeError) by modifying fastai\text\core.py, in class SentencePieceTokenizer, I added a encoding='utf8' in setup():

...
        with open(raw_text_path, 'w', encoding='utf8') as f:
            for t in progress_bar(maps(*rules, items), total=len(items), leave=False):
                f.write(f'{t}\n')
...

BTW I encounter the issue on Windows, it might be a Windows-specific issue.

I think we also need to pass encoding=encoding in the same fastai/text/core.py file, inside the _tokenize_files function:

...
for i,tok in parallel_tokenize(files, tok, rules, n_workers=n_workers):
        out = func(i,output_dir)
        out.mk_write(' '.join(tok), encoding=encoding)
        lengths[str(files[i].relative_to(path))] = len(tok)
        counter.update(tok)
...

Update:

  • Python 3.7 automatically uses UTF8 encoding if you set the environment variable PYTHONUTF8=1. This might solve the problem without code changes. (I haven’t tested it. Details: PEP 540 – Add a new UTF-8 Mode | Python.org)
  • I also added an encoding='utf8' in Tokenizer.encodes():
    def encodes(self, o:Path):
        if self.mode=='folder' and str(o).startswith(str(self.path)):
            tok = self.output_dir/o.relative_to(self.path)
            return L(tok.read_text(encoding='utf8').split(' '))
        else: return self._tokenize1(o.read_text())
  • Make sure you have plenty of disk space left in your ~ or C: drive, otherwise you may encounter a hang when you call:
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

I can confirm that setting the environment variable PYTHONUTF8=1 worked for me without making any other code changes.