Question about how vocab is built in LanguageModelData

wgpubs · November 28, 2017, 9:59pm

LanguageModelData appears to build it’s vocab from the training dataset:

field.build_vocab(self.trn_ds, **kwargs)

If we are creating a validation dataset from the original training corpus, wouldn’t this mean we are missing words potentially? And if we are doing something like a k-fold cross validation, wouldn’t this mean are vocab should be shifting with each fold?

It seems to me that the vocab should be a product of both the training and validation datasets.

asawant · November 28, 2017, 10:03pm

May be its because there isn’t a test set here for learning the sequence of words. The training set itself needs to be divided into both - training and validation / test. Ideally, you would miss vocab from the test anyway.

jeremy · November 29, 2017, 4:00pm

You can’t really use vocab that’s not in the training set, so there’s no point creating embeddings for them.

wgpubs · November 29, 2017, 7:51pm

So does this preclude doing k-fold cross validation, where the training set is changing with each fold?

And depending on the size of the dataset, could it be a better solution to not use a validation dataset so that all the words are being captured and learned? I’m kinda thinking a good strategy might be to get a good enough model with a validation dataset and then build a final model without it for this purpose.

jeremy · November 29, 2017, 8:12pm

I don’t think so - you’d just have to rebuild the vocab each time.

Yup that’s the best approach. We’ve talked about it for image datasets elsewhere on the forum - same idea here.

wgpubs · November 29, 2017, 8:57pm

But in the end, if we’re using the model for sentiment analysis, we would need a single vocab.

Maybe the right process if using k-fold would be to build a vocab for each fold, and then merge/average the embeddings for each word. That way we capture words that are in one training set but no any of the others and get whatever benefits may come from ensembling the vocab.

Thoughts?

jeremy · November 29, 2017, 9:16pm

I think of k-fold as being for validating your model. Once done, I’d retrain on the whole dataset. That handles your vocab issue, I believe.

rob · November 29, 2017, 10:15pm

I feel like that puts you at risk of overfitting.

jeremy · November 29, 2017, 11:11pm

Shouldn’t do so at all, as long as you use identical process and params you used in CV.

rob · November 29, 2017, 11:26pm

Aren’t there times when you want to hold out? Like the fisheries competition for example. Their hidden test set contained images of boats that weren’t in the training data

jeremy · November 29, 2017, 11:29pm

Great point! Yes absolutely. You can’t use CV for that. For NLP, if you have totally different words in your test set to training set then you’re in big trouble, because you can’t learn an embedding for them… Not sure if that happens in practice though - if you come across a real world example, let us know, and we can figure it out.

rob · November 29, 2017, 11:44pm

For word-based LSTM I feel like phrases are more similar to boats than words. In practice, there are often new contexts. Social media or cutting edge tech would add new words. I don’t have any specific problem in mind.

KevinB · December 1, 2017, 12:30am

For NLP I am guessing the best thing to do if that happened is to just put it into the <unk> category and hope some other words will help your model determine the sentence.