I’m a bit confused with how the vocabularies work with Sentencepiece.
If I compare the notebooks from course-nlp (https://github.com/fastai/course-nlp/blob/master/nn-vietnamese.ipynb, https://github.com/fastai/course-nlp/blob/master/nn-turkish.ipynb) on how to make a wikipedia language model up to the classifier part with or without sentencepiece, I see different ways of making the classifier.
Without sentencepiece, we do like this:
data_clas = (TextList.from_df(train_df, path, vocab=data_lm.vocab, cols='comment')
.split_by_rand_pct(0.1, seed=42)
.label_from_df(cols='label')
.databunch(bs=bs, num_workers=1, backwards=True))
The vocabulary comes from the fine tuned mode, which makes sense.
However, with sentencepiece:
data_clas = (TextList.from_df(df, path_clas, cols='text', processor=SPProcessor.load(dest))
.split_by_rand_pct(0.1, seed=42)
.label_from_df(cols='pos')
.databunch(bs=bs, num_workers=1))
Here, the vocabulary parameter is not specified and we load the SPProcessor created with the original wikipedia dataset, which suggest we use the vocabulary created before fine tuning?
Do I understand correctly what happens here and if yes, why is it done in such way?