From the paper … “We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding, which has a shared source-target vocabulary of about 37000 tokens.”
1. What is a “shared source-target vocabulary”? I’m assuming this means a single vocabulary was trained by looking at both languages … but not sure.
2. In what cases is a “shared source-target vocabulary” more preferable to dedicated vocabularies for each language?
“… For English- French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary.”
3. Did they use a “shared source-target vocabulary” here as well?
… and one other question raised after reading the paper:
4. Under what circumstances would one choose something like SentencePiece over Spacy, and vice-versa?