The code that gives an error (and that you copied/pasted in your first post) in the notebook lm3.french.ipynb is SPProcessor(max_vocab_sz=15000)
in the data block.
By searching the class SPProcessor in the fastai v1 documentation, you get the link to its code souce where you can see that the object SPProcessor(max_vocab_sz=15000)
imports SentencePieceTrainer
and SentencePieceProcessor
, and create a temporary folder tmp
(in order to store the model and vocabulary that will be found by training the SentencePiece tokenizer).
By searching in the fastai v1 github (or by reading the post of @nandakumar212
), you’ll find that SentencePieceTrainer
is changed in the file data.py in the folder https://github.com/fastai/fastai/tree/master/fastai/text.
By searching in the file data.py, you’ll find at the line 431, the following code:
SentencePieceTrainer.Train(" ".join([
f"--input={quotemark}{raw_text_path}{quotemark} --max_sentence_length={max_sentence_len}",
f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
f"--user_defined_symbols={','.join(spec_tokens)}",
f"--model_prefix={quotemark}{cache_dir/'spm'}{quotemark} --vocab_size={vocab_sz} --model_type={model_type}"]))
And as @nandakumar212 said, you can try updating this code (it means updating the file data.py) by removing {quotemark}
from the line f"--model_prefix={quotemark}{cache_dir/'spm'}{quotemark} --vocab_size={vocab_sz} --model_type={model_type}"
. You’ll get:
SentencePieceTrainer.Train(" ".join([
f"--input={quotemark}{raw_text_path}{quotemark} --max_sentence_length={max_sentence_len}",
f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
f"--user_defined_symbols={','.join(spec_tokens)}",
f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
I did not try but @nandakumar212 did and it worked.
(PS: at the time I created my notebook lm3.french.ipynb one year ago, I did not get any problem. I guess the code of the file data.py was changed after that.)
@nandakumar212: you should open an issue in fastai v1 github with your solution which can possibly help a lot of people.