Question about how vocab is built in LanguageModelData

LanguageModelData appears to build it’s vocab from the training dataset:

field.build_vocab(self.trn_ds, **kwargs)

If we are creating a validation dataset from the original training corpus, wouldn’t this mean we are missing words potentially? And if we are doing something like a k-fold cross validation, wouldn’t this mean are vocab should be shifting with each fold?

It seems to me that the vocab should be a product of both the training and validation datasets.

May be its because there isn’t a test set here for learning the sequence of words. The training set itself needs to be divided into both - training and validation / test. Ideally, you would miss vocab from the test anyway.

You can’t really use vocab that’s not in the training set, so there’s no point creating embeddings for them.

So does this preclude doing k-fold cross validation, where the training set is changing with each fold?

And depending on the size of the dataset, could it be a better solution to not use a validation dataset so that all the words are being captured and learned? I’m kinda thinking a good strategy might be to get a good enough model with a validation dataset and then build a final model without it for this purpose.

I don’t think so - you’d just have to rebuild the vocab each time.

Yup that’s the best approach. We’ve talked about it for image datasets elsewhere on the forum - same idea here.

1 Like

But in the end, if we’re using the model for sentiment analysis, we would need a single vocab.

Maybe the right process if using k-fold would be to build a vocab for each fold, and then merge/average the embeddings for each word. That way we capture words that are in one training set but no any of the others and get whatever benefits may come from ensembling the vocab.


I think of k-fold as being for validating your model. Once done, I’d retrain on the whole dataset. That handles your vocab issue, I believe.

1 Like

I feel like that puts you at risk of overfitting.

Shouldn’t do so at all, as long as you use identical process and params you used in CV.

Aren’t there times when you want to hold out? Like the fisheries competition for example. Their hidden test set contained images of boats that weren’t in the training data

Great point! Yes absolutely. You can’t use CV for that. For NLP, if you have totally different words in your test set to training set then you’re in big trouble, because you can’t learn an embedding for them… Not sure if that happens in practice though - if you come across a real world example, let us know, and we can figure it out.

For word-based LSTM I feel like phrases are more similar to boats than words. In practice, there are often new contexts. Social media or cutting edge tech would add new words. I don’t have any specific problem in mind.

For NLP I am guessing the best thing to do if that happened is to just put it into the <unk> category and hope some other words will help your model determine the sentence.