Question on SentencePiece params in latest fastai codebase

wgpubs · June 10, 2019, 11:02pm

Looking at the SP params here

Wouldn’t it be better to explicitly assign UNK, PAD, BOS, and EOS tokens/ids as such:

    SentencePieceTrainer.Train(" ".join([
        f"--input={raw_text_path} --max_sentence_length={max_sentence_len}",
        f"--character_coverage={ifnone(char_coverage, 1 if lang in full_char_coverage_langs else 0.99)}",
        f"--unk_id=0 --pad_id=1 --bos_id=2 --eos_id=3",
        f"--unk_piece={text.transform.UNK} --pad_piece={text.transform.PAD} --bos_piece={text.transform.BOS} --eos_piece={text.transform.EOS} ",
        f"--user_defined_symbols={','.join(defaults.text_spec_tok[1:])}",
        f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))

I think this would help tokenization to line up better with fastai, especially in cases where a pad_idx has to be specified (and in most cases defaults to 1). Also, it would use the same fastai custom tokens vs the one’s prefixed by '\u2581'.

sgugger · June 10, 2019, 11:43pm

Nope, doesn’t work since sentencepiece has its own names for those tokens that are incompatible with the fastai ones (and they don’t let us change them).

wgpubs · June 10, 2019, 11:47pm

Doesn’t the addition of unk_piece ,pad_piece, bos_piece, an eos_piece allow us to change the tokens use for these entities?

See https://github.com/google/sentencepiece/releases/tag/v0.1.7

sgugger · June 10, 2019, 11:49pm

Oh nice find, I hadn’t seen those. @piotr.czapla @mkardas were you aware of those? Did you discqrd their use because they don’t work as we need them?

The whitespaces are still going to be needed in any case, otherwise sentencepiece will add one whitespace token before each special token.

wgpubs · June 10, 2019, 11:53pm

I actually saw this in @Kaspar’s notebook.

The only problem I can potentially see is in how SP uses them. For example:

print(sp.EncodeAsPieces("xxbos xxfld 1 xxmaj domincan xxmaj republic"))
# ['▁', 'xxbos', '▁', 'xxfld', '▁1', '▁', 'xxmaj', '▁do', 'min', 'can', '▁', 'xxmaj', '▁republic']

print(sp.EncodeAsIds("xxbos xxfld 1 xxmaj domincan xxmaj republic"))
# [9, 2, 9, 4, 51, 9, 5, 376, 732, 1455, 9, 5, 978]

Seem like it might be better to use __xxbos instead of just xxbos to eliminate some of the additional tokens required to make up a special fastai default token.

sgugger · June 11, 2019, 12:02am

That’s why we passed them as _xxbos, _xxmaj and all to sentence piece

wgpubs · June 11, 2019, 12:29am

I’ve updated my wiki-preparation gist here to include the appropriate SP friendly tokens (see my forum post here as to my rationale for why I’m doing what I’m doing).

Any feedback would be great.

mkardas · June 11, 2019, 12:39am

I didn’t know about these settings, thanks for mentioning. The options were not present when we were integrating sentencepiece with fast.ai. And it seems that in order to learn about them one would have to track changes to protobufs, as the options are not documented. Or track fast.ai forum which I prefer

wgpubs · June 11, 2019, 12:39am

I notice it still adding the __ token either way.

After updating everything so that all special tokens are prefixed with __:

print(sp.EncodeAsPieces("\u2581xxbos \u2581xxfld 1 \u2581xxmaj domincan \u2581xxmaj republic"))
# ['▁', '▁xxbos', '▁', '▁xxfld', '▁1', '▁', '▁xxmaj', '▁do', 'min', 'can', '▁', '▁xxmaj', '▁republic']

So not sure what the benefit it so prefixing these tokens???

sgugger · June 11, 2019, 12:50am

No you should pass a usual fastai text with xxunk (no \uXXXX) and so forth

wgpubs · June 11, 2019, 1:16am

Got it. Thanks.

Updated wiki-prep code for SP training again … here

Results look right …

# all European langs
full_char_coverage_langs = [
    "bg", "cs", "da", "de", "el", "en", "es", "et", "fi", "fr", "ga", "hr", "hu",
    "it","lt","lv","mt","nl","pl","pt","ro","sk","sl","sv"
] 

spec_tokens = ['\u2581'+s for s in defaults.text_spec_tok]

sp_params = f"--input={txt_files} "  \
            f"--max_sentence_length={max_sentence_len} " \
            f"--character_coverage={ifnone(char_coverage, 1 if LANG in full_char_coverage_langs else 0.99988)}" \
            f"--unk_id=0 " \
            f"--pad_id=1 " \
            f"--bos_id=2 " \
            f"--eos_id=3 " \
            f"--unk_piece=\u2581{text.transform.UNK} " \
            f"--pad_piece=\u2581{text.transform.PAD} " \
            f"--bos_piece=\u2581{text.transform.BOS} " \
            f"--eos_piece=\u2581{text.transform.EOS} " \
            f"--user_defined_symbols={','.join(spec_tokens[1:])} " \
            f"--model_prefix={model_prefix} " \
            f"--vocab_size={vocab_size} " \
            f"--model_type={model_type}"

Results look like this …

print(sp.EncodeAsPieces("xxbos xxfld 1 xxmaj domincan xxmaj republic"))
# ['▁xxbos', '▁xxfld', '▁1', '▁xxmaj', '▁do', 'min', 'can', '▁xxmaj', '▁republic']

sp.EncodeAsIds("xxbos xxfld 1 xxmaj domincan xxmaj republic")
# [2, 4, 51, 5, 270, 1026, 1183, 5, 1181]

Kaspar · June 11, 2019, 8:39pm

See sp’ user_defined_symbols to define your special tokens. I Alis use them to avoid number and other charscter like “ and () to merge with words

mkardas · June 11, 2019, 8:58pm

Thanks, I know about them, we use user_defined_symbols for similar purposes as yours. I was talking here specifically about , , ~~and~~ tokens, which couldn’t be overwritten when I last checked (now they can be since sentencepiece v0.1.7, as pointed out by @wgpubs).

wgpubs · June 11, 2019, 9:04pm

Can you explain what you mean by this? I noticed this in your notebook and was wondering about this.

Kaspar · June 12, 2019, 1:30am

to see for yourself you have to open the vocab-file as a text doc