Looking at the SP params here
Wouldn’t it be better to explicitly assign UNK, PAD, BOS, and EOS tokens/ids as such:
SentencePieceTrainer.Train(" ".join([
f"--input={raw_text_path} --max_sentence_length={max_sentence_len}",
f"--character_coverage={ifnone(char_coverage, 1 if lang in full_char_coverage_langs else 0.99)}",
f"--unk_id=0 --pad_id=1 --bos_id=2 --eos_id=3",
f"--unk_piece={text.transform.UNK} --pad_piece={text.transform.PAD} --bos_piece={text.transform.BOS} --eos_piece={text.transform.EOS} ",
f"--user_defined_symbols={','.join(defaults.text_spec_tok[1:])}",
f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
I think this would help tokenization to line up better with fastai, especially in cases where a pad_idx
has to be specified (and in most cases defaults to 1). Also, it would use the same fastai custom tokens vs the one’s prefixed by '\u2581'
.