Fine-tuning a language model with some data you will later predict

Curious to get everyone’s thoughts on this. We were having a discussion at work. We trained a language model on 1.5 million responses we had and we are now trying to use that to make predictions on roughly 10k of them. Now keep in mind as part of the language model training no labels were ever seen. Someone has concerns that because 10k of our 1.5 million responses contained the text we will later try to predict ratings for the “predictors” have been seen and thus data leakage is possible. I’m having trouble seeing how this could be remotely impactful data leakage if it does happen. But it started a pretty big argument. Figured I’d come here to get the expert’s perspective. I seem to remember Jeremy doing exactly this when teaching sections on ULMFiT, so curious as to why he would do this if it’s potentially causing leakage that could lead to over-fitting according to colleagues.

Disclaimer: not an expert in NLP, mainly do CV

Isn’t training a language model and training a classifier different tasks? Embeddings learn relationship between words. They don’t know if something is good or bad. Just that these group of words seem to always be near each other. I think training a language model, and then training a classifier are different enough tasks to not be data leakage. The classifier part is where you instill direct relationships, eg positive/negative sentiment, into the group of words.

It’s fine to train 1.5 million responses on a language model, and then transfer learn the model into a classifier. You can then use 10k labeled examples from the 1.5 million. (I think this is your case, as you said unlabeled).

It’s not fine if you trained a classifier on 1.5 million responses, and then used 10k of those responses as validation or test data. That is data leakage.

As a side note, why not just try run some experiments and see? You can probably use a smaller subset of the responses.

Those are my thoughts exactly.

The main problem with running experiments is we included most of the labeled data in the original LM. We do have a small set that was not included and I told him if he’s that concerned I’d point him to all of the data he needed to test it, but that I wasn’t all that concerned. That didn’t sit well with him. At the end of the day I have product deadlines to meet and this is literally one of the last things that I’m concerned about. Just wanted to check with some of the fastai practitioners to see if I was missing something here.

I wouldn’t completely disregard the possibility, although it does seem very unlikely. I would guess there are situations where a language model transfer learning might have data leakage problem if the task is similar enough.

I’d guess the person gave the concern to try to help, and not destroy the product. From my experience, there is a lot of value in being able to explain concepts simply, and convincing others. Yes, it does take extra effort to do this.

*Added
One indirect way is to list examples of other people/companies doing the same thing. For example, if google, facebook, amazon all do this on a certain benchmark, then it’s probably ok. Otherwise, everyone is wrong (which could be true). This way you don’t need to re-run stuff, and get some confidence fast. I think you’re already doing this by getting feedback here, but maybe there’s some well-regarded NLP benchmark where people do it all the time. If someone said Geoffrey Hinton does this, then I probably wouldn’t say more.

Yeah in this case the 1.5 million responses the LM was trained on are extremely diverse and any given thing we are trying to label might represent 1-2% of the data the LM was trained on. Also given the simplicity of many of the responses (within the label sets) it’s entirely possible regardless of whether or not we held out our “test set” this form of data leakage would have still occurred because there are so many very similar responses within the specific labels already.

It’d be like training an LM on descriptions of food in a grocery store and holding out all references to milk because milk was mentioned in the test set.

Either way I told him I’d help him find the data he could use to test it, but that I wasn’t concerned about it and didn’t have time do look into it right now. I’d honestly be way more concerned about our relatively small labeled datasets than this.

I guess the way I look at it in a more simplistic way is…if you trained a word2vec model on all of your data and then fed those embeddings into your predictive model and some of the corpus that is being predicted was including in your word2vec would you consider that data leakage? Effectively an LM is a much more robust word2vec model. Personally I’ve always considered data leakage to be specifically related to labeled data in the way you described it above. Not really relevant to transfer learning.