I’m trying to build a text classifier based on language model.
I’m wondering how words in vocab are chosen? are all words’ frequencies are calculated and then choose the first max_vocab words whose frequency>=min_freq? Or they calculate a word’s frequency, if >=min_freq, put it in vocab, until meet max_vocab, so if some words appear later, even if their frequency>=min_freq, they won’t be in vocab?
Pretty sure that unless you are using SPProcessor, it does this: https://github.com/fastai/fastai/blob/cdcebcdab8520c790fd90afaa97cbf54013c92c0/fastai/text/transform.py#L148
So it should be picking the most frequent ones.
Got it thank you!