I’m running through the basic mnist example that Jeremy gives at the end of lesson 3. All looks reasonably good at the start (except accuracies hovering around 60% rather than his 80%) for the linear model, but when I come to run the “Single Dense Layer” example I get some strange results.

The initial single epoch run with LR set to default:

```
fc.fit_generator(batches, batches.n, nb_epoch=1,
validation_data=test_batches, nb_val_samples=test_batches.n)
Epoch 1/1
60000/60000 [==============================] - 11s - loss: nan - acc: 0.3430 - val_loss: nan - val_acc: 0.0965
```

And then followed by 4 epochs with LR set to 0.1:

```
fc.fit_generator(batches, batches.n, nb_epoch=4,
validation_data=test_batches, nb_val_samples=test_batches.n)
Epoch 1/4
60000/60000 [==============================] - 11s - loss: nan - acc: 0.0983 - val_loss: nan - val_acc: 0.0987
Epoch 2/4
60000/60000 [==============================] - 11s - loss: nan - acc: 0.0985 - val_loss: nan - val_acc: 0.0988
Epoch 3/4
60000/60000 [==============================] - 11s - loss: nan - acc: 0.0991 - val_loss: nan - val_acc: 0.0976
Epoch 4/4
60000/60000 [==============================] - 11s - loss: nan - acc: 0.0988 - val_loss: nan - val_acc: 0.0974
```

It looks like it is overfitting in the initial single epoch run, as the training acc is >> val acc. The higher learning rate then seems to lead it to instability, like the graphical example he gave of overshooting the local minima on the graph example.

Is this correct? What have I missed? Apart from having to change the batches.N properties to batches.n, I haven’t made any other changes from the original workbook.

Versions of software are:

keras 1.2.2

numpy 1.13.1

python 2.7.13

theano 0.9.0

Ubuntu 16.04 on Azure N series with Nvidia K80 GPU.

Thanks for any help.

UPDATE

I changed the definition of the ‘Single Dense Layer’ to use a BatchNormalization initial step instead of the Lambda step and added a Dropout after the Dense(512) layer, so:

```
def get_fc_model():
model = Sequential([
# Lambda(norm_input, input_shape=(1,28,28)),
BatchNormalization(axis=1, input_shape=(1,28,28)),
Flatten(),
Dense(512, activation='softmax'),
Dropout(0.5),
Dense(10, activation='softmax')
])
model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
return model
```

This improved the slow initial epoch results:

```
Epoch 1/1
60000/60000 [==============================] - 13s - loss: 0.5787 - acc: 0.4920 - val_loss: 0.3468 - val_acc: 0.8120
```

But still lead to weird results when the lr is decreased and 4 epochs are run instead of 1:

```
fc.optimizer.lr=0.1
fc.fit_generator(batches, batches.n, nb_epoch=4,
validation_data=test_batches, nb_val_samples=test_batches.n)
Epoch 1/4
60000/60000 [==============================] - 12s - loss: nan - acc: 0.3916 - val_loss: nan - val_acc: 0.1004
Epoch 2/4
60000/60000 [==============================] - 12s - loss: nan - acc: 0.0974 - val_loss: nan - val_acc: 0.0991
Epoch 3/4
60000/60000 [==============================] - 12s - loss: nan - acc: 0.0994 - val_loss: nan - val_acc: 0.0979
Epoch 4/4
60000/60000 [==============================] - 12s - loss: nan - acc: 0.0991 - val_loss: nan - val_acc: 0.1004
```

Still not sure what is going on though.