Training new dense layers not training at all resulting 0.5 accuracy [SOLVED]

MarkD · December 8, 2017, 7:18pm

I’m working through the notebooks and I’m currently around lesson 2 and 3. I’m experimenting with layers in Keras and wanting to try training different dense layer configurations. To make this clean I thought I would export the convolutional part of the vgg net and then for each experiment load that part of the net, create completely new dense layers and train (in the example below to keep things simple i’m just using the same architecture of dense layers). My problem is that the training does not even start to converge and in fact seems to settle on exactly 50% error. Clearly I must be missing something in my code or understanding… but I’m not seeing it!

import utils; reload(utils)
from utils import *
%matplotlib inline

from vgg16 import Vgg16
vgg = Vgg16()
model = vgg.model
layers = model.layers

Get the index of the first dense layer…

first_dense_idx = [index for index,layer in enumerate(layers) if type(layer) is Dense][0]

also save the weights

path = "data/redux/sample/"
models_path = "data/redux/models/"
train_path = path + "train/"
valid_path = path + “valid/”

Drop this and all subsequent layers

num_del = len(layers) - first_dense_idx
for i in range (0, num_del): model.pop()

Set all layers to non-trainable (these are the conv layers)

for layer in model.layers: layer.trainable=False

serialize model to JSON and save to a file

model_json = model.to_json()
with open(models_path+“vgg16_conv.json”, “w”) as json_file:
json_file.write(model_json)
model.save_weights(models_path+“vgg16_conv.h5”)

Now load the model (I usually do this in a different notebook)

from keras.models import model_from_json

note i had to move this line inside vgg_preprocess in vgg16.py for the saving/loading to work properly

vgg_mean = np.array([123.68, 116.779, 103.939], dtype=np.float32).reshape((3,1,1))

load json and create model

json_file = open(models_path+“vgg16_conv.json”, “r”)
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)

load weights into new model

model.load_weights(models_path+“vgg16_conv.h5”)
print(“Loaded model from disk”)

#get the batches
batch_size = 64
train_batches = image.ImageDataGenerator().flow_from_directory(train_path, target_size=(224,224),
class_mode=‘categorical’, shuffle=True, batch_size=batch_size)

valid_batches = image.ImageDataGenerator().flow_from_directory(valid_path, target_size=(224,224),
class_mode=‘categorical’, shuffle=True, batch_size=batch_size)

num_target_classes = train_batches.nb_class

the model is now loaded and the weights loaded into it

add dense layers… in this example i’m just adding the same architecture of layers as the original vgg

model.add(Dense(4096, activation=‘relu’))
model.add(Dropout(0.5))
model.add(Dense(4096, activation=‘relu’))
model.add(Dropout(0.5))
model.add(Dense(num_target_classes, activation=‘softmax’))

as i understand it i should now have a model where the convolutional layers are the same as vgg

with the same weights. these layers should be non trainable

plus dense layers have the same architecture as vgg but with fresh random weights ready to be trained

I use SGD with a fairly large learning rate as the weights are completely random

model.compile(optimizer=SGD(lr=0.01), loss=‘categorical_crossentropy’, metrics=[‘accuracy’])

now retrain the model:

model.fit_generator(train_batches, samples_per_epoch=train_batches.nb_sample, nb_epoch=50,
validation_data=valid_batches, nb_val_samples=valid_batches.nb_sample)

i get the following output…!

Epoch 1/50
200/200 [==============================] - 7s - loss: 7.2976 - acc: 0.5200 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 2/50
200/200 [==============================] - 6s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 3/50
200/200 [==============================] - 7s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 4/50
200/200 [==============================] - 6s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 5/50
200/200 [==============================] - 7s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 6/50
200/200 [==============================] - 7s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 7/50
200/200 [==============================] - 7s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 8/50
200/200 [==============================] - 7s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 9/50
200/200 [==============================] - 7s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 10/50
200/200 [==============================] - 7s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 11/50
200/200 [==============================] - 6s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 12/50
200/200 [==============================] - 7s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 13/50
200/200 [==============================] - 7s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 14/50
200/200 [==============================] - 6s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600
Epoch 15/50
200/200 [==============================] - 7s - loss: 7.9785 - acc: 0.5050 - val_loss: 8.7038 - val_acc: 0.4600

… etc

model.summary gives the following output

Layer (type) Output Shape Param # Connected to

lambda_3 (Lambda) (None, 3, 224, 224) 0 lambda_input_6[0][0]

zeropadding2d_27 (ZeroPadding2D) (None, 3, 226, 226) 0 lambda_3[0][0]

convolution2d_27 (Convolution2D) (None, 64, 224, 224) 0 zeropadding2d_27[0][0]

zeropadding2d_28 (ZeroPadding2D) (None, 64, 226, 226) 0 convolution2d_27[0][0]

convolution2d_28 (Convolution2D) (None, 64, 224, 224) 0 zeropadding2d_28[0][0]

maxpooling2d_11 (MaxPooling2D) (None, 64, 112, 112) 0 convolution2d_28[0][0]

zeropadding2d_29 (ZeroPadding2D) (None, 64, 114, 114) 0 maxpooling2d_11[0][0]

convolution2d_29 (Convolution2D) (None, 128, 112, 112) 0 zeropadding2d_29[0][0]

zeropadding2d_30 (ZeroPadding2D) (None, 128, 114, 114) 0 convolution2d_29[0][0]

convolution2d_30 (Convolution2D) (None, 128, 112, 112) 0 zeropadding2d_30[0][0]

maxpooling2d_12 (MaxPooling2D) (None, 128, 56, 56) 0 convolution2d_30[0][0]

zeropadding2d_31 (ZeroPadding2D) (None, 128, 58, 58) 0 maxpooling2d_12[0][0]

convolution2d_31 (Convolution2D) (None, 256, 56, 56) 0 zeropadding2d_31[0][0]

zeropadding2d_32 (ZeroPadding2D) (None, 256, 58, 58) 0 convolution2d_31[0][0]

convolution2d_32 (Convolution2D) (None, 256, 56, 56) 0 zeropadding2d_32[0][0]

zeropadding2d_33 (ZeroPadding2D) (None, 256, 58, 58) 0 convolution2d_32[0][0]

convolution2d_33 (Convolution2D) (None, 256, 56, 56) 0 zeropadding2d_33[0][0]

maxpooling2d_13 (MaxPooling2D) (None, 256, 28, 28) 0 convolution2d_33[0][0]

zeropadding2d_34 (ZeroPadding2D) (None, 256, 30, 30) 0 maxpooling2d_13[0][0]

convolution2d_34 (Convolution2D) (None, 512, 28, 28) 0 zeropadding2d_34[0][0]

zeropadding2d_35 (ZeroPadding2D) (None, 512, 30, 30) 0 convolution2d_34[0][0]

convolution2d_35 (Convolution2D) (None, 512, 28, 28) 0 zeropadding2d_35[0][0]

zeropadding2d_36 (ZeroPadding2D) (None, 512, 30, 30) 0 convolution2d_35[0][0]

convolution2d_36 (Convolution2D) (None, 512, 28, 28) 0 zeropadding2d_36[0][0]

maxpooling2d_14 (MaxPooling2D) (None, 512, 14, 14) 0 convolution2d_36[0][0]

zeropadding2d_37 (ZeroPadding2D) (None, 512, 16, 16) 0 maxpooling2d_14[0][0]

convolution2d_37 (Convolution2D) (None, 512, 14, 14) 0 zeropadding2d_37[0][0]

zeropadding2d_38 (ZeroPadding2D) (None, 512, 16, 16) 0 convolution2d_37[0][0]

convolution2d_38 (Convolution2D) (None, 512, 14, 14) 0 zeropadding2d_38[0][0]

zeropadding2d_39 (ZeroPadding2D) (None, 512, 16, 16) 0 convolution2d_38[0][0]

convolution2d_39 (Convolution2D) (None, 512, 14, 14) 0 zeropadding2d_39[0][0]

maxpooling2d_15 (MaxPooling2D) (None, 512, 7, 7) 0 convolution2d_39[0][0]

flatten_3 (Flatten) (None, 25088) 0 maxpooling2d_15[0][0]

dense_17 (Dense) (None, 4096) 102764544 flatten_3[0][0]

dropout_13 (Dropout) (None, 4096) 0 dense_17[0][0]

dense_18 (Dense) (None, 4096) 16781312 dropout_13[0][0]

dropout_14 (Dropout) (None, 4096) 0 dense_18[0][0]

dense_19 (Dense) (None, 2) 8194 dropout_14[0][0]

Total params: 119554050

ramesh · December 8, 2017, 7:38pm

Interesting. Are you using Theano Backend? CH, h, w? With TF backend (Channel-last), input layer and images should be 224x224x3

MarkD · December 8, 2017, 7:55pm

it’s theano, i’m doing cats and dogs redux, there is very little difference. the only difference should be that the dense layers are completely new with fresh weights.

ramesh · December 8, 2017, 8:02pm

Take a look here, it might give you ideas on this problem -https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

Basically your model is not learning. If you want to normalize the values, you should pass them to ImagDataGenerator (using preprocessing_fn input as per example - https://keras.io/applications/#vgg16

Don’t do any Dropout yet until your model learns. Also reduce Learning Rate to 0.1 and run for a few epochs to see if training loss changes.

MarkD · December 8, 2017, 8:20pm

Ramesh, thanks for the link etc. I’m really trying to try out something that should be absolutely basic here. Am I missing something fundamental? With the conv part of vgg and some fresh dense layers and only 200 input images I had expected i would get rapid convergence and probably overfitting. Instead i seem to be rapidly converging on an accuracy of about 50% i.e. as good as a random guess. i tried increasing learning rate to 0.1 and removing the dropout but still get the following: (note after the second epoch nothing changes

Epoch 1/50
200/200 [==============================] - 7s - loss: 6.5257 - acc: 0.5200 - val_loss: 7.4143 - val_acc: 0.5400
Epoch 2/50
200/200 [==============================] - 7s - loss: 8.1396 - acc: 0.4950 - val_loss: 7.4143 - val_acc: 0.5400
Epoch 3/50
200/200 [==============================] - 7s - loss: 8.1396 - acc: 0.4950 - val_loss: 7.4143 - val_acc: 0.5400
Epoch 4/50
200/200 [==============================] - 6s - loss: 8.1396 - acc: 0.4950 - val_loss: 7.4143 - val_acc: 0.5400

ramesh · December 8, 2017, 8:23pm

Yeah…That’s very strange. But definitely happened to me before. Can you share your notebook at https://gist.github.com/ and I can try to take a look.

MarkD · December 8, 2017, 8:46pm

thank you. here is the file:

gist.github.com

https://gist.github.com/markddesimone/3501f9811f00562eb1ba831ced2c90f5

Simple-Train-of-New-Dense-Layers.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [

This file has been truncated. show original

ramesh · December 8, 2017, 9:59pm

Thanks. I tried to replicate it, but looks like I don’t have the env. setup for Keras 1.2 and Theano. One thing I would suggest you change the loss function to binary_crossentropy and replace softmax with sigmoid since there’s only two classes.

Also, can you do model.summary() in your notebook? May be others who know can chime in as well.

MarkD · December 8, 2017, 10:33pm

Ramesh, thanks for trying. I simplified the code by eliminating the save to disk and reload of the model and weights. So all it does is load vgg, pop the dense layers and re-add new dense layers. I updated the version on gist to this super simple version. I tried binary_crossentropy and sigmoid but same result unfortunately. Hopefully someone can shed some light on this. I’m sure it must be something stupid.

ramesh · December 8, 2017, 11:01pm

@MarkD - You may want to set class_mode to binary in the ImageDataGenerator.flow_from_directory call. if you are using binary_cross_entropy. Would be good if you can share your findings here once you figure it out.

MarkD · December 8, 2017, 11:28pm

changing to binary doesn’t help and introduces a new problem.
using class_mode=‘categorical’ and loss=‘categorical_crossentropy’ it runs as described above with accuracy of 0.5.
changing to class_mode=‘binary’ and loss=‘binary_crossentropy’ i get the following exception:

Exception: Error when checking model target: expected dense_6 to have shape (None, 2) but got array with shape (64, 1)

this is wierd because model.summary shows dense_6 as:
dense_6 (Dense) (None, 2) 8194 dense_5[0][0]

I’ll certainly share if I get a solution!

MarkD · December 9, 2017, 1:01am

When I remove and replace the last dense layer it works as expected, converges to about 99% accuracy (overfitting)
When I remove and replace the last two dense layers it works as above (overfitting)
When I remove and replace all three dense layers it gets reported problem behavior, accuracy 50%

So the mystery is why does replacing all three dense layers have this problem whereas just replacing the last two behaves as expected.

ramesh · December 9, 2017, 4:11am

That’s awesome. May be when you replace all three last layers, the weights are completely random and it’s not updating with large enough steps to get to a better place. May be you could try increasing learning rate from 0.01 to 0.1 and see if that helps when you replace all three layers.

One other idea - Change optimizer from SGD to rmsprop or adam

SALu · December 9, 2017, 12:04pm

The following are my thoughts on debugging your problem:

Try decrease learning rate to, say, 1e-3.
Look at the exact prediction outputs. Are they all 1s/0s?
Check if dense layer weights are actually updated (by something like model.get_weights())
For using model.pop(), this post about fine-tuning might be useful: https://flyyufelix.github.io/2016/10/08/fine-tuning-in-keras-part2.html
6~7sec per epoch on cats-vs-dogs dataset on VGG16 is unreasonably fast if I remember correct.

MarkD · December 9, 2017, 4:44pm

Thanks SALu, the problem was my learning rate was too high. By reducing to 1e-3 The model with all three dense layers replaced converges properly. Note I’m doing this on a sample of dogscats, not the whole set (hence the 7 second epoch times). My expectation, since I am just experimenting, was that I should get convergence and overfitting on this small dataset. Indeed with lr=0.001 this is what I now get as shown below:

even with 0.001 the convergence jumps around so I also tried with 0.0001 which gives the following results:

thanks ramesh and SALu for your help