How vocab words are chosen

nancyC · October 3, 2019, 3:24pm

Hi ~
I’m trying to build a text classifier based on language model.
I’m wondering how words in vocab are chosen? are all words’ frequencies are calculated and then choose the first max_vocab words whose frequency>=min_freq? Or they calculate a word’s frequency, if >=min_freq, put it in vocab, until meet max_vocab, so if some words appear later, even if their frequency>=min_freq, they won’t be in vocab?

juvian · October 3, 2019, 4:33pm

Pretty sure that unless you are using SPProcessor, it does this: https://github.com/fastai/fastai/blob/cdcebcdab8520c790fd90afaa97cbf54013c92c0/fastai/text/transform.py#L148

So it should be picking the most frequent ones.

nancyC · October 9, 2019, 5:51pm

Got it thank you!