Accuracy and Validation Accuracy are Good, Test is no better than Random

I have been working on using the Keras VGG16 model and I am running into some weird issues. I am having decent results from my acc and val_acc, but then when I go to actually predict my test, I am getting results that are no better than random. I’ve tried this with the dogs and cats competition and the Deep Learning Challenge # 1 and I am getting similar results. When I do a prediction on the validation set, it looks really good, but then I try to predict something from test and my model acts like it’s never seen anything like it before. Has anyone seen this and am I missing some easy gotcha?

Here is what I am using:

%matplotlib inline
import keras
from keras import backend as K
from keras import applications
from keras.models import Sequential, Model
from keras.applications import VGG16
from keras.layers.core import Dense, Flatten, Dropout
from keras.preprocessing import image
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
from glob import glob
import os
import sys
from keras.applications.vgg16 import preprocess_input
batch_size = 64
sys.tracebacklimit = 2

idg = image.ImageDataGenerator()#preprocessing_function=preprocess_input)
trn_batches = idg.flow_from_directory("label_train_img", target_size=(224,224))
val_batches = idg.flow_from_directory("label_valid_img", target_size=(224,224))
trn_batches = idg.flow_from_directory("label_train_img", target_size=(224,224),batch_size=trn_batches.n)
val_batches = idg.flow_from_directory("label_valid_img", target_size=(224,224),batch_size=val_batches.n)
trn_batch, trn_labels = trn_batches.next()
val_batch, val_labels = val_batches.next()

vgg16 = VGG16()
vgg16.layers.pop()
for layer in vgg16.layers: layer.trainable=False
m = Dropout(0.5)(vgg16.layers[-1].output)
m = Dense(25, activation='softmax')(m)
vgg16 = Model(vgg16.input, m)
vgg16.compile(optimizer="adam", loss="categorical_crossentropy", metrics=['accuracy'])

for i in range(trn_batch.shape[0]):
    trn_batch[i] = preprocess_input(trn_batch[i])
for i in range(val_batch.shape[0]):
    val_batch[i] = preprocess_input(val_batch[i])

vgg16.fit(trn_batch, trn_labels, epochs=10,validation_data=(val_batch, val_labels))

test_batches = idg.flow_from_directory("label_test_img", target_size=(224,224))
test_batches = idg.flow_from_directory("label_test_img", target_size=(224,224),batch_size=test_batches.n)
test_batch, test_labels = test_batches.next()

test_filenames=test_batches.filenames
legend = trn_batches.class_indices
legend = {y:x for x,y in legend.items()}

submission = pd.DataFrame(columns=['image_id','label'])
for i in range(0,test_batches.n):
    im = test_batch[i]
    im = preprocess_input(im)
    im_name = test_filenames[i].split("/")[1].split(".")[0]
    prediction = legend[np.argmax(vgg16.predict(np.array(im,ndmin=4)))]
    submission = submission.append({'image_id':im_name, 'label':prediction},ignore_index=True)
submission.to_csv("Submission_ImageClass.csv")

I am just at a loss because everthing I’m doing seems to be giving me really good results, but then I try with the test dataset and I am getting random results.

Accuracy Readout:

Train on 2591 samples, validate on 620 samples
Epoch 1/10
2591/2591 [==============================] - 11s - loss: 2.8741 - acc: 0.3350 - val_loss: 1.9315 - val_acc: 0.4661
Epoch 2/10
2591/2591 [==============================] - 11s - loss: 1.4932 - acc: 0.5901 - val_loss: 1.5955 - val_acc: 0.5468
Epoch 3/10
2591/2591 [==============================] - 11s - loss: 1.1251 - acc: 0.6708 - val_loss: 1.4350 - val_acc: 0.5935
Epoch 4/10
2591/2591 [==============================] - 11s - loss: 0.8492 - acc: 0.7240 - val_loss: 1.5399 - val_acc: 0.5806
Epoch 5/10
2591/2591 [==============================] - 11s - loss: 0.6500 - acc: 0.7862 - val_loss: 1.4338 - val_acc: 0.6097
Epoch 6/10
2591/2591 [==============================] - 11s - loss: 0.5653 - acc: 0.8163 - val_loss: 1.4801 - val_acc: 0.6016
Epoch 7/10
2591/2591 [==============================] - 11s - loss: 0.4419 - acc: 0.8530 - val_loss: 1.5429 - val_acc: 0.6081
Epoch 8/10
2591/2591 [==============================] - 11s - loss: 0.4050 - acc: 0.8680 - val_loss: 1.4104 - val_acc: 0.6371
Epoch 9/10
2591/2591 [==============================] - 11s - loss: 0.3518 - acc: 0.8842 - val_loss: 1.5256 - val_acc: 0.6226
Epoch 10/10
2591/2591 [==============================] - 11s - loss: 0.3148 - acc: 0.8927 - val_loss: 1.4675 - val_acc: 0.6452

Results on Test Data:

Also, side question about this post: Should it go in a different area of the forums since it isn’t technically part of Part 1? I decided to put it here because it is using the skills from Part 1 and it could potentially help somebody that is doing Part 1 work.

If you compare your submission CSV to the example submission, do they have the same format? One way to get bad scores on Kaggle is to submit a CSV that’s in a different format than they expect.

If that all looks fine, you could try creating your own test set (with labels) and run your model against that. Just make sure to never train or validate on those particular images. It’s possible you somehow “trained” on the validation images and that your validation set is no longer representative of the “real” data. (Although that seems unlikely since your validation accuracy is lower than the training accuracy.)

One way to get such a test set is to hand-label a bunch of the test set images.

1 Like

I think that might be what I have to do. Thanks for looking at it.

So I just manually did 101 images and ran my model against them and got 69/101 correct so the model is definitely good so I can stop banging my head against that. I must somehow be bringing down the wrong index from what I am training up top to what I am using as my legend below. I will keep updating this as I figure out my issue, but this is how I tested my manually trained images.

test_batches = idg.flow_from_directory("label_test_img", target_size=(224,224)) #These are manually labeled.
test_batches = idg.flow_from_directory("label_test_img", target_size=(224,224),batch_size=test_batches.n)
test_batch, test_labels = test_batches.next()

test_filenames=test_batches.filenames 
legend = trn_batches.class_indices 
legend = {y:x for x,y in legend.items()} 

total = 0
correct = 0
submission = pd.DataFrame(columns=['image_id','label_prediction', 'label_actual'])
for i in range(0,test_batches.n):
    im = test_batch[i]
    im = preprocess_input(im)
    im_name = test_filenames[i].split("/")[1].split(".")[0]
    prediction = legend[np.argmax(vgg16.predict(np.array(im,ndmin=4)))]
    actual = legend[np.argmax(test_labels[i])]
    total += 1
    if prediction == actual:
        correct += 1
    submission = submission.append({'image_id':im_name, 'label_prediction':prediction, 'label_actual':actual},ignore_index=True)
submission.to_csv("Submission_ImageClass_testknown.csv")
print(correct/total)

Now that I’ve looked into it further, it almost has to be something with my Submission, but I have no idea what I could be doing incorrectly. It’s crazy because I have had a decent score in the competition using Jeremy’s VGG16, but now that I have switched to Keras’ VGG16 I must have done something to screw up the file I am submitting. If anyone has any gotchas on submitting files like this or ways to test them better, I would be interested in hearing them. I compared my submission to the sample submission and besides the actual predictions made, they are identical. No spaces or weird white space. I am at a complete loss. I’m almost wondering if something is wrong with their submission, but then I remember that I have had an answer that was received well just 18 days ago. I will keep looking into it and I would love to hear some ideas of where other people have had their submissions look really bad after having good results locally.

They are using F1 scoring which I think is just the number correct/total number.

New plan is to re-download the datasets just in case they have changed. Going to be really frustrated if that’s the issue, but at this point I would just be happy. At least it doesn’t appear to be anything flawed with how I am training my model which would be a lot more demoralizing.

Finally was able to figure out my issues and got a 0.68255 score!

My issue was all in my submission file. So I was trying to use the flow_from_directory command and I finally figured out where I was going wrong. I was trying to match my predictions back to the filenames.

from keras.preprocessing import image
idg_tst = image.ImageDataGenerator()

test_batches = idg_tst.flow_from_directory("unlabel_test_img", target_size=(224,224))
test_batches = idg_tst.flow_from_directory("unlabel_test_img", target_size=(224,224),batch_size=test_batches.n) #this line is incorrect
test_batch, test_labels = test_batches.next()

This pulls all of the images and puts them into test_batch, test_labels is meaningless here since these are unlabeled images.

What I did next was

test_filenames=test_batches.filenames

This was my mistake. I assumed that this would line up to the test_images from above, but by default, shuffle=True for flow_from_directory so I needed to explicitly define shuffle=False on my second flow_from_directory above. Huge lesson learned, but I am guessing this isn’t really how the flow_from_directory should work, but it is how it does work so watch out if using that!