Question on SentencePiece params in latest fastai codebase

Looking at the SP params here

Wouldn’t it be better to explicitly assign UNK, PAD, BOS, and EOS tokens/ids as such:

    SentencePieceTrainer.Train(" ".join([
        f"--input={raw_text_path} --max_sentence_length={max_sentence_len}",
        f"--character_coverage={ifnone(char_coverage, 1 if lang in full_char_coverage_langs else 0.99)}",
        f"--unk_id=0 --pad_id=1 --bos_id=2 --eos_id=3",
        f"--unk_piece={text.transform.UNK} --pad_piece={text.transform.PAD} --bos_piece={text.transform.BOS} --eos_piece={text.transform.EOS} ",
        f"--user_defined_symbols={','.join(defaults.text_spec_tok[1:])}",
        f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))

I think this would help tokenization to line up better with fastai, especially in cases where a pad_idx has to be specified (and in most cases defaults to 1). Also, it would use the same fastai custom tokens vs the one’s prefixed by '\u2581'.

Nope, doesn’t work since sentencepiece has its own names for those tokens that are incompatible with the fastai ones (and they don’t let us change them).

Doesn’t the addition of unk_piece ,pad_piece, bos_piece, an eos_piece allow us to change the tokens use for these entities?

See https://github.com/google/sentencepiece/releases/tag/v0.1.7

Oh nice find, I hadn’t seen those. @piotr.czapla @mkardas were you aware of those? Did you discqrd their use because they don’t work as we need them?

The whitespaces are still going to be needed in any case, otherwise sentencepiece will add one whitespace token before each special token.

I actually saw this in @Kaspar’s notebook.

The only problem I can potentially see is in how SP uses them. For example:

print(sp.EncodeAsPieces("xxbos xxfld 1 xxmaj domincan xxmaj republic"))
# ['▁', 'xxbos', '▁', 'xxfld', '▁1', '▁', 'xxmaj', '▁do', 'min', 'can', '▁', 'xxmaj', '▁republic']

print(sp.EncodeAsIds("xxbos xxfld 1 xxmaj domincan xxmaj republic"))
# [9, 2, 9, 4, 51, 9, 5, 376, 732, 1455, 9, 5, 978]

Seem like it might be better to use __xxbos instead of just xxbos to eliminate some of the additional tokens required to make up a special fastai default token.

That’s why we passed them as _xxbos, _xxmaj and all to sentence piece

1 Like

I’ve updated my wiki-preparation gist here to include the appropriate SP friendly tokens (see my forum post here as to my rationale for why I’m doing what I’m doing).

Any feedback would be great.

I didn’t know about these settings, thanks for mentioning. The options were not present when we were integrating sentencepiece with fast.ai. And it seems that in order to learn about them one would have to track changes to protobufs, as the options are not documented. Or track fast.ai forum which I prefer :wink:

1 Like

I notice it still adding the __ token either way.

After updating everything so that all special tokens are prefixed with __:

print(sp.EncodeAsPieces("\u2581xxbos \u2581xxfld 1 \u2581xxmaj domincan \u2581xxmaj republic"))
# ['▁', '▁xxbos', '▁', '▁xxfld', '▁1', '▁', '▁xxmaj', '▁do', 'min', 'can', '▁', '▁xxmaj', '▁republic']

So not sure what the benefit it so prefixing these tokens???

No you should pass a usual fastai text with xxunk (no \uXXXX) and so forth

Got it. Thanks.

Updated wiki-prep code for SP training again … here

Results look right …

# all European langs
full_char_coverage_langs = [
    "bg", "cs", "da", "de", "el", "en", "es", "et", "fi", "fr", "ga", "hr", "hu",
    "it","lt","lv","mt","nl","pl","pt","ro","sk","sl","sv"
] 

spec_tokens = ['\u2581'+s for s in defaults.text_spec_tok]

sp_params = f"--input={txt_files} "  \
            f"--max_sentence_length={max_sentence_len} " \
            f"--character_coverage={ifnone(char_coverage, 1 if LANG in full_char_coverage_langs else 0.99988)}" \
            f"--unk_id=0 " \
            f"--pad_id=1 " \
            f"--bos_id=2 " \
            f"--eos_id=3 " \
            f"--unk_piece=\u2581{text.transform.UNK} " \
            f"--pad_piece=\u2581{text.transform.PAD} " \
            f"--bos_piece=\u2581{text.transform.BOS} " \
            f"--eos_piece=\u2581{text.transform.EOS} " \
            f"--user_defined_symbols={','.join(spec_tokens[1:])} " \
            f"--model_prefix={model_prefix} " \
            f"--vocab_size={vocab_size} " \
            f"--model_type={model_type}" 

Results look like this …

print(sp.EncodeAsPieces("xxbos xxfld 1 xxmaj domincan xxmaj republic"))
# ['▁xxbos', '▁xxfld', '▁1', '▁xxmaj', '▁do', 'min', 'can', '▁xxmaj', '▁republic']

sp.EncodeAsIds("xxbos xxfld 1 xxmaj domincan xxmaj republic")
# [2, 4, 51, 5, 270, 1026, 1183, 5, 1181]

See sp’ user_defined_symbols to define your special tokens. I Alis use them to avoid number and other charscter like “ and () to merge with words

Thanks, I know about them, we use user_defined_symbols for similar purposes as yours. I was talking here specifically about , , and tokens, which couldn’t be overwritten when I last checked (now they can be since sentencepiece v0.1.7, as pointed out by @wgpubs).

Can you explain what you mean by this? I noticed this in your notebook and was wondering about this.

to see for yourself you have to open the vocab-file as a text doc