Vocab maybe too small (lots of xxunk)

JohnnyWobble · December 23, 2020, 7:09pm

Hi, I’m trying to make a language model (so I can transfer the encoder to a classifier later), but when I create the DataLoaders and look at the batches, it looks like it is really filled with xxunks. I don’t have that much data (~1800 samples), so I’m wondering whether this is normal, or if I’m just doing something wrong.

Making the DataLoader

lm_dls = DataBlock(
    blocks=TextBlock.from_df(lmdf, is_lm=True),
    splitter=RandomSplitter(0.15)
).dataloaders(lmdf)

Data sample

msivanes · December 24, 2020, 12:54pm

By default the word tokenizer using spacy is limited with max vocab to 60,000 and with a min freq of 3. The rest of the tokens are set to xxunk.

Either you have tokens appearing less than 3 in the corpus or vocab size is reached. You can tweak either of two and see what happens.

Dataloaders.vocab - provides your vocab info

JohnnyWobble · December 25, 2020, 6:20am

Thanks, I will definitely take a look!