Why are my Mnist results so bad?

I’m running through the basic mnist example that Jeremy gives at the end of lesson 3. All looks reasonably good at the start (except accuracies hovering around 60% rather than his 80%) for the linear model, but when I come to run the “Single Dense Layer” example I get some strange results.

The initial single epoch run with LR set to default:

fc.fit_generator(batches, batches.n, nb_epoch=1, 
                    validation_data=test_batches, nb_val_samples=test_batches.n)
Epoch 1/1
60000/60000 [==============================] - 11s - loss: nan - acc: 0.3430 - val_loss: nan - val_acc: 0.0965

And then followed by 4 epochs with LR set to 0.1:

fc.fit_generator(batches, batches.n, nb_epoch=4, 
                    validation_data=test_batches, nb_val_samples=test_batches.n)

Epoch 1/4
60000/60000 [==============================] - 11s - loss: nan - acc: 0.0983 - val_loss: nan - val_acc: 0.0987
Epoch 2/4
60000/60000 [==============================] - 11s - loss: nan - acc: 0.0985 - val_loss: nan - val_acc: 0.0988
Epoch 3/4
60000/60000 [==============================] - 11s - loss: nan - acc: 0.0991 - val_loss: nan - val_acc: 0.0976
Epoch 4/4
60000/60000 [==============================] - 11s - loss: nan - acc: 0.0988 - val_loss: nan - val_acc: 0.0974

It looks like it is overfitting in the initial single epoch run, as the training acc is >> val acc. The higher learning rate then seems to lead it to instability, like the graphical example he gave of overshooting the local minima on the graph example.

Is this correct? What have I missed? Apart from having to change the batches.N properties to batches.n, I haven’t made any other changes from the original workbook.

Versions of software are:
keras 1.2.2
numpy 1.13.1
python 2.7.13
theano 0.9.0
Ubuntu 16.04 on Azure N series with Nvidia K80 GPU.

Thanks for any help.

UPDATE

I changed the definition of the ‘Single Dense Layer’ to use a BatchNormalization initial step instead of the Lambda step and added a Dropout after the Dense(512) layer, so:

def get_fc_model():
    model = Sequential([
        # Lambda(norm_input, input_shape=(1,28,28)),
        BatchNormalization(axis=1, input_shape=(1,28,28)),
        Flatten(),
        Dense(512, activation='softmax'),
        Dropout(0.5),
        Dense(10, activation='softmax')
        ])
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

This improved the slow initial epoch results:

Epoch 1/1
60000/60000 [==============================] - 13s - loss: 0.5787 - acc: 0.4920 - val_loss: 0.3468 - val_acc: 0.8120

But still lead to weird results when the lr is decreased and 4 epochs are run instead of 1:

fc.optimizer.lr=0.1

fc.fit_generator(batches, batches.n, nb_epoch=4, 
                    validation_data=test_batches, nb_val_samples=test_batches.n)
Epoch 1/4
60000/60000 [==============================] - 12s - loss: nan - acc: 0.3916 - val_loss: nan - val_acc: 0.1004
Epoch 2/4
60000/60000 [==============================] - 12s - loss: nan - acc: 0.0974 - val_loss: nan - val_acc: 0.0991
Epoch 3/4
60000/60000 [==============================] - 12s - loss: nan - acc: 0.0994 - val_loss: nan - val_acc: 0.0979
Epoch 4/4
60000/60000 [==============================] - 12s - loss: nan - acc: 0.0991 - val_loss: nan - val_acc: 0.1004

Still not sure what is going on though.

@mistakenot
I tried training your model for one epoch (without data augmentation) and I got the same results as you (about 80% for validation accuracy).
I would first advise you to change the first dense layer activation to relu instead of softmax
=> Dense(512, activation=‘relu’)

That gives me a far more better result on the validation accuracy after training for one epoch:

Train on 60000 samples, validate on 10000 samples
Epoch 1/1
60000/60000 [==============================] - 3s - loss: 0.2772 - acc: 0.9150 - val_loss: 0.1179 - val_acc: 0.9637

If you still have bad results when using the data augmentation, I would advise to look at the image generator.
Might be a bug there.

Yeah you want ReLU on everything apart from the last layer, which for a classification problem should be Softmax

Thanks, changing functions to ReLu and Sigmoid have resolved it. Looks like I need to read more about activation functions. Thanks!

@mistakenot
You should have a look at this page https://github.com/Kulbear/deep-learning-nano-foundation/wiki/ReLU-and-Softmax-Activation-Functions
It gives a pretty clear explanation about ReLU, sigmoid and softmax activations.

2 Likes

The problem isn’t only with the choice of model. In your 4 Epoch run the loss is reported as NAN. This could be that the learning rate is too high leading to instability, or it could be a problem with the data and labels, or something else entirely.

Good luck!

I haven’t changed anything about the image loading code from what was originally posted in the repository, so hopefully it’s not that. Part of the article posted by @bennnun had this about relu:

Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

So I think the issue was due to me using learning rates incorrectly. After some playing with a basic model, I’ve got the hang of stopping it veering off into garbage. I’m still getting 20% lower accuracy that the example notebook does with the same model, but at least it is stable.
Here’s what I ended up with that got me to around 77% validation accuracy:

def get_new_model():
    model = Sequential([
        Lambda(norm_input, input_shape=(1,28,28)),
        Flatten(),
        Dense(512, activation='softmax'),
        Dense(10, activation='softmax')
    ])
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

fc = get_new_model()

def fit(lr, epoch=1):
    fc.optimizer.lr=lr
    fc.fit_generator(batches, batches.n, nb_epoch=epoch, 
                    validation_data=test_batches, nb_val_samples=test_batches.n)
fit(0.00001, 1)
fit(0.1, 1)
fit(0.00001, 1)

For reference for anyone else getting bad results on this, I also downgraded the VM’s Cuda version down to 8 and re-downloaded the git repo. Now up to > 90%. Weird. Thanks for everyone’s suggestions.