Vocab maybe too small (lots of xxunk)

Hi, I’m trying to make a language model (so I can transfer the encoder to a classifier later), but when I create the DataLoaders and look at the batches, it looks like it is really filled with xxunks. I don’t have that much data (~1800 samples), so I’m wondering whether this is normal, or if I’m just doing something wrong.

Making the DataLoader

lm_dls = DataBlock(
    blocks=TextBlock.from_df(lmdf, is_lm=True),

Data sample

By default the word tokenizer using spacy is limited with max vocab to 60,000 and with a min freq of 3. The rest of the tokens are set to xxunk.

Either you have tokens appearing less than 3 in the corpus or vocab size is reached. You can tweak either of two and see what happens.

Dataloaders.vocab - provides your vocab info

Thanks, I will definitely take a look!