How are `<unk>` tokens assigned in language models?

Hi,

I wonder how the <unk> tokens are assigned when creating a torchtext field because I don’t actually see a vocabulary size argument in data.Field (I don’t think it’s created by default).

Thanks :slight_smile:

I have figured it out myself: add either max_size or min_freq when calling LanguageModelData.from_text_files (or LanguageModelData.from_dataframes). It inherits from the Vocab class in torchtext.

1 Like