Hello!
When I’m running this code in lesson 10:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
And I got following error:
Blockquote
UnicodeEncodeError : ‘gbk’ codec can’t encode character ‘\x96’ in position 749: illegal multibyte sequence
I think the reason may be the differences between the win10 system code(GBK) and the default python code(UTF-8). However, it seems no interface to change the encoding in fastai. So I have to follow the traceback.
In the file fastai\text\core.py, I can see the default parameter encoding='utf8'
in function _tokenize_files
. Somehow the parameter don’t pass into the mk_write
.
Blockquote
def mk_write(self:Path, data, encoding=None, errors=None, mode=511):
“Make all parent dirs ofself
”
self.parent.mkdir(exist_ok=True, parents=True, mode=mode)
self.write_text(data, encoding=encoding, errors=errors)
So, the parameter encoding here is None, means python will use the system default encoding( in my computer it’s GBK)
If I change the default value of the encoding parameter to UFT-8. It works well.
So, my question is, How to solve this coding problem without changing the internal code? Did fastai provide some tools to solve this problem?