Lesson 10. the UnicodeEncodeError

wufeng · January 5, 2021, 7:32am

Hello!
When I’m running this code in lesson 10：

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

And I got following error:

Blockquote
UnicodeEncodeError : ‘gbk’ codec can’t encode character ‘\x96’ in position 749: illegal multibyte sequence

I think the reason may be the differences between the win10 system code(GBK) and the default python code(UTF-8). However, it seems no interface to change the encoding in fastai. So I have to follow the traceback.

In the file fastai\text\core.py, I can see the default parameter encoding='utf8' in function _tokenize_files. Somehow the parameter don’t pass into the mk_write.

Blockquote
def mk_write(self:Path, data, encoding=None, errors=None, mode=511):
“Make all parent dirs of self”
self.parent.mkdir(exist_ok=True, parents=True, mode=mode)
self.write_text(data, encoding=encoding, errors=errors)

So, the parameter encoding here is None, means python will use the system default encoding( in my computer it’s GBK)

If I change the default value of the encoding parameter to UFT-8. It works well.

So, my question is, How to solve this coding problem without changing the internal code? Did fastai provide some tools to solve this problem?