Overfitting validation set?

I’m seeing results I can’t quite explain…

I’ve got 3000,000 medical images with binary labels (positive/negative). I’ve set aside around 30,000 for validation and another 30,000 as a “blind” validation to be used after I feel comfortable that the model is performing well. I’ve chosen the validation data specifically to avoid any correlations between training and validation data.

My network consists of four convolutional layers and three fully connected layers with no layer wider than 128 nodes.

I’m seeing my training accuracy increase to more than 95% after training only around 1/3 of the data. This is reflected in the validation accuracy too. Excellent, right? Not quite…

Long story short, from my tests, I think I’m overfitting my training set and it’s somehow reflected in my validation data set even though the validation set was created from separate data.

So here’s my question: how is it possible to overfit all 300,000 training images after only training on 100,000 and have this reflected in the validation data too? My intuition tells me that if 100,000 images can correctly predict the remaining 200,000 plus the 30,000 validation images, then it’s actually finding a predicting pattern. The accuracy is maintained even when I add an image generator that horizontally and vertically flips the validation images. The definition of overfitting is the inability of the model to accurately predict data the model has not seen. How can this be overfitting?

Any shared intuition would be greatly appreciated.

Working on the DREAM challenge too, eh?

Are you randomizing on images, exams or subjects? Anything less than subject level randomization will cause information leakage.

You don’t say why you think you are overfitting. You mention you are using data augmentation to reduce overfitting, what about the other techniques he describes – dropout, batch norm, and ensemble?

Nope, not familiar with that one, but I can see how there’s a danger for correlation there from what you’ve said.

My running theory is that there’s an edge-effect artifact from how I generate the images that seems to be distinct between positive and negative images. I added a progressive “vignette” to remove edge information and that seems to flatline the accuracy at around 85%. Although I may actually be removing enough information from the small images that it could be failing to find a pattern.

Sorry, I’ve gone off on a rant a bit here…

The short of it is that the images are very large so I randomly sample smaller images for training and validation. When I take an image that occurs in the validation set and fully sample it into smaller images then send those through the model for prediction, I get bad results. Nearly everything rails to positive or negative. When I repeat this with a separate (and lower accuracy) re-trained VGG network, I don’t see this railing behavior so it’s unlikely that I’m doing something stupid with my full sampling. Unlikely, though not impossible :slight_smile:

I’m using augmentation (horizontal and vertical flipping), batch normalization, and a fair amount of dropout, progressively increasing between each convolutional layer and maxing out at 0.5 between each fully connected layer.

Without data-level correlation, how can one get overfitting that is reflected in the validation set?