Kaggle test set AUC is lower than local validation set AUC

(Malcolm McLean) #1

I’d like to understand why my Kaggle test set score (.94) is so much lower than the area under ROC calculated on the validation set (.99). Using fastai 1.0.34.

  1. Are the results returned by learn.get_preds(ds_type=DatasetType.Test) guaranteed to be in the same order as those in data.test_ds.to_df()? Here is my code in context:

    data = ImageDataBunch.from_csv(csv_labels=‘train_labels.csv’, suffix=’.tif’, path=DATA, folder=‘train’, test=‘test’, ds_tfms=None, bs=BATCH_SIZE, size=SIZE2).normalize(imagenet_stats)

    testprobs,val_labels = learn.get_preds(ds_type=DatasetType.Test)
    testdf = data.test_ds.to_df()
    testdf.columns = [‘id’,‘label’]
    testdf[‘label’] = testprobs[:,1]
    testdf[‘id’] = testdf[‘id’].apply(lambda fp: Path(fp).stem)
    testdf.to_csv(SUBM/‘rn34s2x.csv’, index=False, float_format=’%.7f’)

  2. Is this the best way (correct, clear, not fragile) to prepare a test set submission using fastai 1.0?

My local AUC is calculated from the validation set by:

def auc_score(y_score,y_true):
    return torch.tensor(roc_auc_score(y_true,y_score[:,1])) # use as metric

probs,val_labels = learn.get_preds()
auc_score(probs,val_labels) (.99)
accuracy(probs,val_labels) (.985)

As said above, Kaggle’s AUC score on their test set is .94. The problem is not so much that the Kaggle score is low (though I’d certainly like a higher one), but that I do not have a reliable way to measure various experiments.

Thanks so much for any hints.

P.S. The local AUC and accuracy scores remain about the same even when refreshing the DataBunch (train/validation split).



I don’t know the details of that particular competition, but in general, one’s validation performance does not necessarily match perfectly the Kaggle’s leaderboard performance. There is often a gap between the two (I experienced this almost always). One possible reason is that the validation set comes from the training set, while training and testing sets do not necessarily have the same distribution. I’ve read that some people spend significant time building a validation set that resembles the test set, rather than simply using a random sample of training as validation.

(Malcolm McLean) #3

I’m getting a handle on the cause of this issue and would appreciate some help with it.

When it’s time to quit for a meal, appointment, or sleep, I sometimes save the partially trained model with Learn.save() to come back to later.

On returning, I recreate the model with
learn = create_cnn(data, arch, metrics=[accuracy,auc_score])

create_cnn() requires making and passing to it a DataBunch. This new ImageDataBunch.from_csv() internally calls random_split_by_pct, making a different training/validation split than in the previous session. Therefore some images that were originally in the training set leak into the validation set. The validation measure will be erroneously increased, causing the large discrepancy between the validation and the Kaggle test sets.

Searching the forums, I see seed=42 used to to circumvent this problem. However, in my tests, passing num_workers=0, seed=42 to ImageDataBunch.from_csv() does not help - the train/validate split is different after each kernel restart.

FYI, this sequence DOES yield the same split when used along with num_workers=0 ( I don’t know which calls are critical.)

np.random.seed(seed_value) # cpu vars
torch.manual_seed(seed_value) # cpu vars
random.seed(seed_value) # Python
if use_cuda:
torch.cuda.manual_seed_all(seed_value) # gpu vars

@sgugger, would you please advise on the best way to deal with this problem? I need to be able to take a break and resume experimenting with the same training and validation sets (same DataBunch). Then the training regime is stable and the validation measures remain valid.

Thanks so much for your help!

(Malcolm McLean) #4

Thanks for responding, I appreciate hearing your insights about Kaggle. In this case, the predictions on the test set have the same distribution as training and validation. At least checked visually by histogram. But I think I have found the actual cause of the mismatch, see nearby.


Don’t create a random validation set :wink: Or only create it once randomly, then save the names you picked and reuse those.
For a Kaggle competition you really need to carefully build your validation set in any case. Then, you should use the data block API to create your DataBunch since the factory method won’t work anymore. Just copy paste the code of the factory method and change the line for splitting.


Would you mind talking a little more about the necessity of a bespoke validation set for Kaggle? I can see why conceptually, but this goes against all the practices I’ve ever learned about ML.


Carefully creating a validation dataset so that it matches the distribution of your data is kind of the best practice in ML, I don’t understand what you’re referring to. Note that it might be a fully random split, but it depends on your data.
And you specifically need to be careful in a Kaggle competition because otherwise you’ll have lower results on the test set.


Gotcha – I guess I’ve always thought of taking a sufficiently large random sample of all the classes in classification as the best way to achieve that.

Rather – anything I might do by hand is likely to introduce bias that a large random sample won’t, I guess.