Should train/valid splits be the same for NLP language and classification models with ULMFit

cudawarped · January 18, 2019, 10:13am

From the docs for NLP on fine tuning a language model the data loader for the language model and for the classification model can both use different training and validation sets (because they are split randomly) with the data loader for the classification model using the vocab from the language model.

In this case it’s likely that the training set for both the language model and classification model will have different vocabulary present in the text. However we only train the classification model on the vocabulary present in the language model. Wouldn’t it be better if both training sets were the same so that the classification model does not have its vocabulary restricted?

I realize my explanation is confusing so hopefully an example will explain this better.
Given a small data set where the training set for the language model has a vocab of say 1000, and the training set for the classification model has a vocab of say 1500. My understanding is that the classification model will be reduced to using only 1000 embeddings to make its decision, if we create its data loader using the vocab from the language model. My question is, is this a good idea, wouldn’t it be better for the classification model to use the same training set as the language model and use all 1500 embeddings?

miko · January 18, 2019, 10:43am

I have wondered about the same thing and in the end I decided that no, they should not, as the two tasks are actually different, so you don’t risk real data leakage (i.e. classifying for validation a text that was in the training set of the language model should not be any easier, except that you are sure that your encoder has seen all the vocabulary before, but the classification part is untouched).

For the problem you pose, the vocabulary size is fixed and, as far as I understand, the encoder will get trained in the fine tuning stages of your classification, so it should pick up vocabulary that is particularly relevant to the classification task.

Keep also in mind that you are not just using single words, you are building language models, with some degree of context, so single words are unlikely to make a big difference.

cudawarped · January 18, 2019, 11:17am

Is it simply a case that words which appear in both models are much much more important than words which do not, when the splits are performed randomly. Therefore this should have little to no effect.

If that is correct what is the disadvantage of using the same split to increase the size of the vocabulary which the classification model can use, when you have a really small data set?

sgugger · January 18, 2019, 2:48pm

There is no disadvantage IMO. In the case of IMDB we were using a different validation set for the language model because the validation is way too big (same size as the training).

miko · January 18, 2019, 3:17pm

In my specific case (and I believe this to be a common case), I usually have many more unlabled texts than labled ones. So I can use the former for the language model, but not for classification. So it would have been a disadvantage to actually use the same split (as I would have been forced to use less data).

sgugger · January 18, 2019, 3:48pm

Yes of course it’s different if the datasets you can use for modelling and classification are different.