I worked-around that 2nd problem (UnicodeEncodeError) by modifying fastai\text\core.py, in class SentencePieceTokenizer, I added a encoding='utf8' in setup():
...
with open(raw_text_path, 'w', encoding='utf8') as f:
for t in progress_bar(maps(*rules, items), total=len(items), leave=False):
f.write(f'{t}\n')
...
BTW I encounter the issue on Windows, it might be a Windows-specific issue.
Python 3.7 automatically uses UTF8 encoding if you set the environment variable PYTHONUTF8=1. This might solve the problem without code changes. (I haven’t tested it. Details: PEP 540 – Add a new UTF-8 Mode | Python.org)
I also added an encoding='utf8' in Tokenizer.encodes():
def encodes(self, o:Path):
if self.mode=='folder' and str(o).startswith(str(self.path)):
tok = self.output_dir/o.relative_to(self.path)
return L(tok.read_text(encoding='utf8').split(' '))
else: return self._tokenize1(o.read_text())
Make sure you have plenty of disk space left in your ~ or C: drive, otherwise you may encounter a hang when you call:
I just bumped into this problem and after hours and hours…
the solution I found and used
go to Windows Settings > Time & language > Language & region > Administrative language settings > Change system locale, and check Beta: Use Unicode UTF-8 for worldwide language support . Then reboot the PC for the change to take effect.