A few questions on tokenization strategies after reading "Attention Is All You Need"

wgpubs · May 9, 2019, 9:43pm

From the paper … “We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding, which has a shared source-target vocabulary of about 37000 tokens.”

1. What is a “shared source-target vocabulary”? I’m assuming this means a single vocabulary was trained by looking at both languages … but not sure.

2. In what cases is a “shared source-target vocabulary” more preferable to dedicated vocabularies for each language?

“… For English- French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary.”

3. Did they use a “shared source-target vocabulary” here as well?

… and one other question raised after reading the paper:

4. Under what circumstances would one choose something like SentencePiece over Spacy, and vice-versa?