Lesson 10 - IMDB ULMFIT - pretraining on the test set?

Hey guys, thanks for course. I have been using the lesson 10 notebook to get familiar with the ULMFIT model. Something seems a bit amiss to me. In the 'Standardize format’ section we have the lines

trn_texts,val_texts = sklearn.model_selection.train_test_split(
    np.concatenate([trn_texts,val_texts]), test_size=0.1)

…

df_trn = pd.DataFrame({'text':trn_texts, 'labels':[0]*len(trn_texts)}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':[0]*len(val_texts)}, columns=col_names)

df_trn.to_csv(LM_PATH/'train.csv', header=False, index=False)
df_val.to_csv(LM_PATH/'test.csv', header=False, index=False)

This means you are pretraining the model on about 90% of the test set - this is surely not good practice, and is brushed over in the video. The whole point of a test set is that it is supposed to represent data that is ‘unseen’ by the model - how else can you reasonably evaluate what will happen if you show it something completely new? I know this is standard practice in Kaggle competitions, but it’s not something that should be encouraged for people looking to do real world ML applications.

Also - hate to ask - was this done in the ULMFIT paper?

We just use the unlabeled texts of the test set for training the language modeling model. We don’t access the labels in the test set for classification. We are always given the texts in the test set during the testing step, so there are no reasons that we can’t use them for training.

Disagree. In real life you can’t always just retrain a model every time a new piece of data comes along. Bear in mind, the goal is normally not just ‘get the highest score you can on the test set’, it’s trying to build a model that can actually perform the task on new data. For this reason, when evaluating, the test score should be evaluated on completely unseen data, as the test set simulates new data. See for example, this link for more..

Secondly, on a more pedantic note, you actually don’t always have access to the test set, partly for this very reason - see for example this kaggle contest.

So there is a difference between the engineering constraints of time and resources and the mathematical constraints around making a strong model.

There is (usually) no mathematical reason not to do unsupervised training on new, unlabelled data.

There may be engineering reasons not to do it, but at the same time it isn’t as uncommon as you seem to believe. We built systems that rebuilt unsupervised embeddings nightly to capture changes in work relationships.

I read that Quora answer, and it’s true to some extent, but incomplete. For example there is along history of using things like clustering as part of a prediction pipeline, and those clusters are rebuilt at prediction time.

Ok, say I have more than one model to choose from - how do I mathematically evaluate which model is the best? Well it depends what I mean by best, right? If I want to know which model will perform best on completely new, completely unseen data, I clearly can’t pretrain on the test set without adding unfair bias. If I just want to know how good a score I can get on the test set, just without using the test labels, it is clearly OK.

The first one should be default in my opinion, as it has a clear, concrete meaning, and covers more use cases. FWIW, it is also standard in academic papers (I looked up the code to reproduce the result in the ULMFIT paper, and they do not pretrain on the test set).

I’ll just quote the ULMFiT paper here:

If we allow ULMFiT to also utilize unlabeled examples … we match the performance of training from scratch.

Not sure where are you seeing the ULMFiT reproduction code, since the evaluation code isn’t there.

Also

If I want to know which model will perform best on completely new, completely unseen data

There is no way to know that. You can show which performs best on a known test set, but performance maybe different on other data especially if it comes from a different distribution. That’s the whole idea of unsupervised language mode training (or even word embedding): it allows one to easily capture patterns in unlabelled data.

1 Like

That is the code I am referring to. There is a test set, and they don’t pretrain on it.

There is no way to know that. You can show which performs best on a known test set, but performance maybe different on other data especially if it comes from a different distribution. That’s the whole idea of unsupervised language mode training (or even word embedding): it allows one to easily capture patterns in unlabelled data.

I don’t follow - you seem to be saying that test set scores are not useful because on other data you may be drawing from a different distribution? That’s throwing the baby out with the bath water a little isn’t it? And how does fine tuning on a test set help us with this? If you’re drawing from a different distribution later on, fine-tuning potentially hurts you.

Just so I’m sure we’re on the same page - I am not at all saying that unsupervised training is a bad idea. I’m saying that you shouldn’t pretrain on your held-out test set. True, evaluation might not accurately reflect what will happen if you get completely new data in, but surely you want to get as close as you can get to that.

Trick is that one could use test set features to make model generalize even better and not overfit. You don’t want to train directly on validation and test set because then model will overfit immediately. What you want to do is try to extract as much useful information from absolutely everywhere to make your model generalize better. That’s the hope at least.

Test scores are useful to a point, but you need to understand exactly what they are showing

In a static dataset with test data drawn appropriately they are very useful. In a production system, especially in something like social media analysis they rapidly become outdated.

A couple of specific examples:

  • covfefe was a nonsense set of characters until March 30 2017, when it suddenly became one of the most common words on Twitter.

  • “trump” has almost zero similarity with “president” in the default word2vec models from 2014.

In both these cases unsupervised techniques can help combat this.

There are circumstances where you can’t pretrain, so yes it is good to know what kind of performance you would expect then.

But the whole point of using unsupervised techniques is that they work on unseen data, so it seems silly to restrict yourself for doing that.

Yes, if you are publishing you need to explain this, and you often want to publish both numbers. But it is entirely justified to do this.