Low accuracy but overfitting?

I’ve just gone through the first three lessons of the first course, and am now trying to practice my skills by applying them to the first data science bowl competition, the one on plankton classification. I’ve made a vgg like model and am trying to train it from scratch on the plankton data. While my model has an okay training accuracy (23% first epoch, 31% second epoch, etc.), the validation accuracy is quite low - 0.5% first epoch, and not improving.

Initially, I thought this was a sign of overfitting and tried some of the techniques mentioned in lesson 3 - adding data augmentation and dropout layers - but this did not seem to help much. My model has a 50% dropout layer before each dense layer (it has 3) and a 20% dropout halfway through the 4 convolution layers, and the validation accuracy has not significantly changed. I’ve also tried simplifying the model, but with 3 convolution layers and 2 dense layers the training accuracy drops to 6% first epoch, 8% second epoch, etc. and the validation accuracy is still below 1%. Is the problem still overfitting? What can I do to increase the validation accuracy?

I think maybe I should rephrase what I asked. Basically, I have two questions:

  1. Does the training accuracy of a model have to be higher than a certain point for overfitting to be a problem? If the training accuracy is 6% and the validation accuracy is 0.6%, am I overfitting or underfitting (or neither)? What about at 30% and 0.6%?

  2. If overfitting is the problem, what can I do to avoid it beyond what I have already done?

If you need some more information about my model, I am using the kaggle datasciencebowl competition dataset (the one on plankton classification) with a 90:10 split of training to validation data. I am using the following data augmentations:

rotation_range=360,
width_shift_range=0.05,
height_shift_range=0.05,
shear_range=0.10,
zoom_range=0.3,
rescale=1./255,
horizontal_flip=True,
vertical_flip=True

And the following model (which does have some random stuff in it I don’t fully understand, added in an attempt to increase accuracy and decrease overfitting)

model = Sequential([
    Convolution2D(32,3,3, border_mode='same', init='he_uniform'), # convolution layers
    PReLU(),
    BatchNormalization(axis=1),
    Convolution2D(32,1,1, border_mode='same', init='he_uniform'), # depthwise seperate?
    PReLU(),
    BatchNormalization(axis=1),
    Convolution2D(16,3,3, border_mode='same', init='he_uniform'),
    MaxPooling2D((4,4)), # max pooling
    PReLU(),
    BatchNormalization(axis=1),
    Dropout(0.2),
    Convolution2D(64,3,3, border_mode='same', init='he_uniform'),
    PReLU(),
    BatchNormalization(axis=1),
    Convolution2D(128,3,3, border_mode='same', init='he_uniform'),
    PReLU(),
    BatchNormalization(axis=1),
    Convolution2D(128,1,1, border_mode='same', init='he_uniform'), # depthwise seperate?
    PReLU(),
    BatchNormalization(axis=1),
    Convolution2D(64,3,3, border_mode='same', init='he_uniform'),
    PReLU(),
    BatchNormalization(axis=1),
    
    Flatten(),
    Dropout(0.5),
    Dense(512, init='he_uniform'),
    PReLU(),
    BatchNormalization(),
    Dropout(0.5),
    Dense(512, init='he_uniform'),
    PReLU(),
    BatchNormalization(),
    Dropout(0.5),
    Dense(121, activation='softmax', init='he_uniform')
])

and getting the following results using a learning rate of 0.02 and a batch size of 64 (notice val_acc does not change) :

Epoch 1/1
25786/25786 [==============================] - 966s - loss: 3.5803 - acc: 0.2331 - val_loss: 15.8057 - val_acc: 0.0055
Epoch 1/1
25786/25786 [==============================] - 966s - loss: 2.7237 - acc: 0.3129 - val_loss: 16.0225 - val_acc: 0.0055
Epoch 1/1
25786/25786 [==============================] - 966s - loss: 2.5801 - acc: 0.3360 - val_loss: 15.9979 - val_acc: 0.0055

I am quite new to this and would appreciate any pointers on relevant material to read or just a general direction on how to improve these results.

If training accuracy is much higher than validation accuracy, then what the neural net has learned from your training data does not accurately represent your validation data.

I would first verify that your validation data is similar to your training data.

In other words, if your training data is images of a certain size, does your validation set consist of similar images? If preprocessing happens on the training images (subtracting mean RGB etc), does that also happen on the validation images? And so on…

So first make sure your data is OK before you do anything else.

Thanks for the advice!

I’ve visually verified about a hundred of the training and validation set data (with the scaling turned off), and they appear to be similar enough to me - the augmentations seem to be realistic enough, shuffling shouldn’t make a difference, and all other preprocessing is the same . They are all grayscale 224x224 images scaled by a factor of 1/255. The only possible problem I can think of with regards to the data is the split. The validation set is a random 10% of the original training data from kaggle, so the distribution of categories could be slightly different (edit: just checked, biggest difference in category representation is about 0.3%). Is this a problem? And could 4550 validation images be too few for reliable results?

What I would do to make 100% sure is print out some of the training images after they had preprocessing applied to them, and also some of the validation images, and make sure the numbers are all in the same range (similar min/max/mean/std values).

Maybe you missed something and the validation images get preprocessed slightly differently than the training images (also, make sure no data augmentation is done on the validation images).

A 10% random split should work fine. Even if you “only” have 4550 validation images, your validation accuracy should still be close to the training accuracy – not the huge difference you’re seeing here.

Hmm, not sure if I am missing something here. I am checking the training and validation images after preprocessing/augmentation, using the save_to_dir parameter of the flow_from_dir method in keras. I’m not doing any augmentation on the validation set. I’m not sure how to check the min/max/mean/stddev values of the image data set, but visually they look identical to my eyes. My concern with only 4550 validation images is that some categories only have 1 image in the validation set (the 121 categories are not evenly distributed), but perhaps this is not a problem.

You can call next(your_generator) and it will give you a new batch of data.

That is exactly what I’ve been doing to generate preprocessed images to inspect. I’m just not sure how to get the summary statistics from the generator. You can see my code below - the shuffle=True is temporary, so I get most of the plankton types in the first 3 batches.

def get_batches(dirname, temp_dir=None, augment=True, shuffle=True):
    if augment:
        gen = image.ImageDataGenerator(rotation_range=360,
                                       width_shift_range=0.05,
                                       height_shift_range=0.05,
                                       shear_range=0.10,
                                       zoom_range=0.3,
                                       rescale=1./255,
                                       horizontal_flip=True,
                                       vertical_flip=True)
    else:
        gen = image.ImageDataGenerator(rescale=1./255)

    return gen.flow_from_directory(dirname,
                                  target_size=(224,224),
                                  class_mode='categorical',
                                  color_mode='grayscale',
                                  shuffle=shuffle,
                                  save_to_dir=temp_dir,
                                  batch_size=batch_size)

batches = get_batches(train_path, temp_dir=temp_path)
val_batches = get_batches(valid_path, temp_dir=temp_path2, shuffle=True, augment=False)

for i in range(3):
    batch = batches.next() # save 3 batches of images
    val_batch = val_batches.next()


plots([image.load_img(os.path.join(temp_path, img)) for img in random.sample(os.listdir(temp_path), 8)])
plots([image.load_img(os.path.join(temp_path2, img)) for img in random.sample(os.listdir(temp_path2), 8)])

Since batch is just a numpy array (or probably two arrays, one for the data, one for the labels) you can do things like:

X, y = batch
print(np.min(X), np.max(X), np.mean(X), np.std(X))

Your data is probably fine since you were able to visually confirm that the images look OK, and I see nothing weird in the code above, but by printing these statistics you can be sure. Foe example, if the training images have min 0 and max 255 but the validation images have min -1 and max +1, then something is up there.

Once you’ve verified that the images really are 100% correct, my next step would be to create a very basic model with just two Dense layers and nothing else, and see how that behaves while training.

I’ve checked the statistics between the sets and they appear to be match reasonably well - there’s no consistent pattern of difference between the mins, maxes, means, and stddevs. I went ahead and tested the idea of a very basic model with just two dense layers, as follows:

# super-simple model for testing
model = Sequential([
    BatchNormalization(axis=1, input_shape=(1,224,224)),
    Flatten(),
    Dense(512, init='he_uniform'),
    PReLU(),
    BatchNormalization(),
    Dense(121, activation='softmax', init='he_uniform')
])

and somehow got the following (with 0.02 learning rate):

Epoch 1/1
24269/24269 [==============================] - 127s - loss: 3.4677 - acc: 0.2222 - val_loss: 3.6337 - val_acc: 0.2100
Epoch 1/1
24269/24269 [==============================] - 127s - loss: 2.9663 - acc: 0.2755 - val_loss: 2.6406 - val_acc: 0.3221
Epoch 1/1
24269/24269 [==============================] - 127s - loss: 2.7689 - acc: 0.3072 - val_loss: 2.4368 - val_acc: 0.3610

I suppose that fixes the problem I was having with validation accuracy - clearly the 3 convolution layers and 3 dense layers I tried wasn’t simple enough to avoid overfitting in this case! I’m surprised at how well this model works even without special convolution layers, though it seems to max out at 40% accuracy

I suppose my only problem now is getting above 40% with a more complex model, which is proving rather difficult - adding more layers seems to only decrease the accuracy (at least in the first epoch), for some reason.

I’m still not convinced the issue is overfitting. Even if you’re massively overfitting, it should still predict some of the validation images correctly.

I suspect there is an issue with the design of the network, or how you’re training it. But it’s not immediately obvious to me from the code you’ve posted.

What I would do next is keep adding layers one-by-one to the simplified network, train for a few batches, and see at what point start getting these really bad validation accuracies.

Have you tried using a simplifier cnn model like a 64,3,3 -> Maxpool(2,2) -> 128,3,3? I’m quite new to cnns but the architecture you have looks new to me. I’ve noticed they usually have an increasing number of filters the deeper they get usually to make up for the data loss from the maxpooling layers. Also have you checked your val_acc without using data augmentation?

Hi
You could try lowering the learning rate.
I believe in lesson 4 Jeremy says something about a high learning rate making the loss function approximation making huge jumps and basically never converging.
I actually improved the val_acc by 10% today on a CNN model with 36 classes by lowering the LR from 0.001 to 0.0005

I did end up using a simpler mode with maxout layers, as shown below:

model = Sequential([
    BatchNormalization(axis=1, input_shape=(1,48,48)),
    
    Convolution2D(256,4,4,init='he_uniform'), # convolution layers
    PReLU(),
    BatchNormalization(axis=1),
    MaxPooling2D(strides=(2,2)),
    Dropout(0.1),
    
    Convolution2D(384,3,3,init='he_uniform'),
    PReLU(),
    BatchNormalization(axis=1),
    MaxPooling2D(strides=(2,2)),
    Dropout(0.2),
    
    Convolution2D(1024,3,3,init='he_uniform'),
    PReLU(),
    BatchNormalization(axis=1),
    MaxPooling2D(strides=(2,2)),
    #Lambda(Maxout),
    Dropout(0.3),
    
    Flatten(),
    
    MaxoutDense(1024, nb_feature=2, init='he_uniform'),
    PReLU(),
    BatchNormalization(),
    Dropout(0.5),
    
    MaxoutDense(1024, nb_feature=2, init='he_uniform'),
    PReLU(),
    BatchNormalization(),
    Dropout(0.5,),
    
    Dense(121, activation='softmax', init='he_uniform')
])

It seems to have maxed out at just under 75% accuracy, and kaggle says my submission would have been in the top 20% of submitted results. I used the Adam optimizer with an lr of .000001 for about 200 epochs. I’m sure there are some non-optimal things going on with my architecture (especially since the competition submissions are 2 years old), but I’m reasonably happy with it for now. The val_acc is checked without any data augmentation (besides scaling to the same resolution) and I think the lr is not too high anymore so it’s probably mostly a problem with the architecture.