Bugged entropy loss calculation in Keras?

I am working on dogsvscats, redoing what @jeremy presented in lecture 3.

I get this strange result that I cannot find explanation for. What I do is the following:

  1. I save features, train the vgg16 FC layers with dropout at 0.5 (not touching that at all).
  2. Create a model of just the FC layers with dropout set to 0, adjusting weights.
  3. I do not do any training - as a sanity check I run evaluate_generator on both models with validation data.

Old FC model with Dropout gives me:
[0.14896876902467668, 0.98760000000000003]
(first value is categorical crossentropy, 2nd accuracy).
With the new model without Dropout (p=0), I get:
[0.045561284964613716, 0.98760000000000003]

How is it possible that the values are different and yet the accuracy is the same? I cannot find an explanation for this.

It gets even weirder as when I do the same steps but with values of 0.1 and 0.2 for p, I get the results as below:
p = 0.1: [0.057281951369761212, 0.98760000000000003]
p = 0.2: [0.072925884064820298, 0.98760000000000003]

Seems that evaluate generator is behaving as if it was running in training for the logloss but not for accuracy?

And what is even stranger, if I have

  • p = predictions of model with dropout
  • pd = predictions of model without dropout
    I get:
    np.mean(p-pd)
    -5.58385e-10

but:
np.sort(p.reshape((1, -1)) - pd.reshape((1, -1)))
array([[-0.36295992, -0.36100584, -0.36085624, …, 0.36085624,
0.3610059 , 0.36295998]], dtype=float32)

I am sorry - I just realized this is hard to read. Will try to use some different way for posting going forward - either linking to github for code or doing something else.

If anyone would have any idea what might be happening here, I would really appreciate your help.

BTW I cut off the FC layers at the bottom of the first FC layer and not how @jeremy did in the lecture (below flatten IIRC), but I do not think that that should make a difference.

Hello radek,

I think that you get the same accuracy makes sense because you dont train again and just predict - correct?

about:

  • p = predictions of model with dropout
  • pd = predictions of model without dropout
    np.mean(p-pd)
    -5.58385e-10

If the predction are the “probabilities” for each model:
I think this should be always 0. If you predict only one image than your vector will look like
p = (0.3 , 0.7) (sum is 1)
pd = (0.5, 0.5) (sum is 1)

then p-pd = (-0.2, 0.2) and the mean over this is 0… independent of the number of prediction it should look like the same, because each probabilities sums up to 1.

about the different loss values:
maybe look at 5 prediction for p and pd and print the result vectors (probability values) (not substracting before)?

Bests,
Benedikt

1 Like

Thank you very much for looking into this @benediktschifferer, really appreciate it.

Great point about the predictions summing to 1 which explains the mean being 0. I completely missed this.

Here are for example two predictions that differ significantly:
p[34] # => array([ 0.00118797, 0.99881208], dtype=float32)
pd[34] # => array([ 0.3641479, 0.6358521], dtype=float32)

According to the paper on dropout (I think it is also referenced from keras code), we are not subjecting the bias to drop out, which intuitively makes sense.

Now, I have been scaling everything that layer.get_weights() returns. layer.get_weights returns a list and as it turns out:

layer.get_weights # => [dense_13_W, dense_13_b]

and

layer.b # => dense_13_b

Ok, so that is one issue with my code that I will fix. But still this means that the loss should be better with Dropout than without as I am screwing up the biases via scaling when removing dropout.

However, this is what the code for the Dropout layer looks like. I have no clue how keras works but there is nothing in the code for the layer that to me would indicate that we are scaling the weights at test time.

Could it be that keras just simply hasn’t implemented the test time behavior correctly? It is quite implausible, still the VGG16 that comes with keras doesn’t seem to have dropout.

Found an issue touching on this . Will see if I can produce a simplified example of this (assuming the incorrect behavior indeed exists).

Hey @radek,

I didn’t run the same sanity checks you did, but i’ve got an implementation of dropout removal that works. I integrated it into the existing vgg16.py (and vgg16bn.py) with the amount of dropout as a parameter.

To instantiate it with no dropout call:

vgg = Vgg16(dropout=0.0)

I’m fairly certain it’s correct because of the accuracy results i’m getting.

Hope this helps.

I figured this out thanks to your help and also thx to a very helpful comment on github. :slight_smile:

Turns out with scaling the weights the model was outputting less confident predictions which helped with the loss (clipping effect) and also I think there was some luck involved that the new parameters produced loss that was so much better. Might be something else going on in here but mostly it was the clipping effect.

Nonetheless, this is all not really important - the important bit is that I learned that you shouldn’t rescale weights when you change the p value on dropout in keras! Turns out keras does something rather smart - instead of scaling weights at test time, it scales them at train time! This way when you do layer.get_weights() regardless what the dropout value was they will always be of correct scale! I tested this and indeed this is the case.

Let me cc @jeremy and @rachel please as this seems like a very useful tidbit to know.

2 Likes

This is a great point - and something I recently discovered myself! I’ll add a note to the appropriate video, and will also discuss it in part 2.

1 Like

That’s really strange. I get consistently better performance when I do the scaling myself, both in terms of the initial epoch test and validation prediction and the final results.

It’s beyond my capabilities math wise but it makes me suspect there some underlying function or mathematical principle at play here related to the Relu activation function. By reducing the weights you make room for other input activations possibly because of the way it forces a positive activation. That would be really strange though and it may just be this example.