Why do we half all the FC layer weights to remove Dropout instead of just the layers after Dropouts?

jeremy · March 2, 2017, 12:47am

It would be great if some of you could write up a note about inverted dropout and the errors in the lesson which I could put on the wiki and also link to from an annotation on youtube. If any of you get a chance to do this, please shoot me a message so I don’t miss it!

wgpubs · March 2, 2017, 2:09am

I’d be glad too.

In the meantime though, do you know if there is a reason for which removing Dropout using the VGG16BN model would require a different process than that which you illustrated for the VGG16 model in Lesson 3?

I’m trying to follow the same steps but get memory errors (which I’m pretty sure are a result of breaking up the convolutional and FC layers and using the output of the convolutional layers as input to the FC model).

jeremy · March 2, 2017, 2:45am

I have a lot of RAM in my box - so you may not be able to follow the exact same process. Feel free to share your code and I can try to find time to take a look.

wgpubs · March 2, 2017, 4:29am

Thanks much Jeremy! Knowing how busy you are with the courses I appreciate the time to look at this. I’ve been fighting with this for the past few days and I’m hoping it’s just something stupid I’m missing. Anyhow, here is the relevant code …

train_batches = get_batches(train_path, shuffle=False, batch_size=1)
val_batches = get_batches(val_path, shuffle=False, batch_size=1)
test_batches = get_batches(test_path, shuffle=False, batch_size=1)

train_classes = train_batches.classes
train_labels = onehot(train_classes)
train_filenames = train_batches.filenames

val_classes = val_batches.classes
val_labels = onehot(val_classes)
val_filenames = train_batches.filenames

test_filenames = test_batches.filenames

Here is my finetune function using VGG16BN:

def finetune(num_outputs):
    model = Vgg16BN().model
    model.pop()
    
    for l in model.layers: l.trainable = False
    model.add(Dense(num_outputs, activation='softmax'))
    
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

I fit the training data to the model returned from finetune() over several iterations and use the weights from the best iteration to move forward with

ft_model = finetune(2)
ft_model.load_weights(best_weights_f)

I then split up the convolutional and FC layers

last_conv_idx = [i for i,l in enumerate(ft_model.layers) if type(l) == Convolution2D][-1]

conv_layers = ft_model.layers[:last_conv_idx+1]
conv_model = Sequential(conv_layers)

fc_layers = ft_model.layers[last_conv_idx+1:]

and then use the “conv_model” to generate the “features” to use as input to a model built from just the FC layers

train_features_conv = conv_model.predict(train_data, batch_size=4)
val_features_conv = conv_model.predict(val_data, batch_size=4)

print(train_features_conv.shape, val_features_conv.shape)

I build the FC model as such. You can see that I’m not even trying to change the Dropout at this point as I get the “memory” error either way.

def build_fc_model(p):
    model = Sequential([
            MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
            Flatten(),
            Dense(4096, activation='relu'),
            BatchNormalization(),
            Dropout(p),
            Dense(4096, activation='relu'),
            BatchNormalization(),
            Dropout(p),
            Dense(2, activation='softmax')
        ])

    for l1,l2 in zip(model.layers, fc_layers): l1.set_weights(l2.get_weights())
    
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

fc_model = build_fc_model(0.5)

When I try to fit the fc_model as such below, I get the “MemoryError: ('Error allocating 411041792 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY) …” exception.

fc_model.fit(train_features_conv, train_labels, nb_epoch=8, 
             batch_size=4, validation_data=(val_features_conv, val_labels))

radek · March 2, 2017, 9:22am

Would you be able to try this with a smaller batch size? I am not sure how resource intensive batchnorm is (I have not even looked at the derivations) so maybe a lower batch size will help. Still, if you only have 2 GB of RAM on your GPU, sooner or later as you continue to increase the complexity of your model you will hit a wall with any batch size.

In order to troubleshoot this properly you might need to download a diagnostic tool for your GPU to monitor used / available memory. If you are using linux nvidia-smi is a great place to start.

radek · March 2, 2017, 9:53am

Not sure if this is useful - please feel free to alter in any way you see fit or don’t use it at all

Dropout is a technique of regularization. Like any other such method it trades some ability of your model to fit the training data in hopes that what it learns might generalize better to the data it has not seen. Beyond that, dropout also has some very nice unique properties that you can read more about here (section 2 is very interesting and can give you great intuition on why it works!)

Classical dropout is achieved by randomly disregarding some subset of nodes in a layer during training. The nodes which don’t participate in producing a prediction and subsequently do not take part in calculating the gradient are selected at random and will vary from training on one example to another. During test time we want to utilize the full predictive capacity of our network and thus all nodes are active. Essentially what we are doing is we are averaging over the contributions of all nodes. If during training we set nodes to be inactive with probability p = 0.5, we now will have to make the weights twice as small as during training. The intuition behind this is very simple - if during train time we were teaching our network to predict ‘1’ in the subsequent layer utilizing only 50% of its weights, now that it has all the weights at its disposal the contribution of each weight needs to only be half as big!

This was the classical dropout. What keras does is slightly different - it has something that can be referred to as inverted dropout. The weights are rescaled during training so that no rescaling needs to take place during test! This also has this nice property that you can move weights around calling get_weights and set_weights on a layer with easy and without any manipulations to the scale of the weights.

Thus, to summarize, regardless if you apply dropout to a layer, in keras the weights will always be of correct scale. This is not something that is evident from the lesson video - Jeremy assumed that keras would apply dropout in a classical way. Everything in the lesson still applies and the rescaling of weights is still 100% accurate should we be applying classical dropout but through the inner workings of keras this step can be disregarded (if you do the rescaling you will end up with weights that are either to small or to large!)

wgpubs · March 2, 2017, 4:25pm

Even with batch_size = 1 I get the same error … and this is just against the sample dataset as well.

wgpubs · March 2, 2017, 11:56pm

SOLVED: It was the output dimension of the Dense layers at 4,096.

I changed each of the two middle Dense layers to Dense(512, activatino=‘relu’) and everything worked fine. I’m not sure what the impact will be with having to reduce the output shape from 4,096 to 512 is (or even why 4,096 to begin with???), but at least its working.

Thanks to all for their comments/advice. Hopefully my pain will reduce that of others who come across this same issue; at least I’ll remember this

jeremy · March 3, 2017, 5:40pm

That’s great! I’ve added it to http://wiki.fast.ai/index.php/Lesson_3_Notes#Updating_Weights_with_Dropout . I’ll add a note to the lesson page too.

tuanng · September 26, 2017, 10:05pm

Does that mean we don’t need to re-scale weights when using Keras?

radek · September 27, 2017, 9:48am

Yes, IIRC that is correct, you never have to worry about rescaling weights cause Keras (at least used to in 1.x ;)) rescale the weights during training for dropout. Say you apply dropout of 0.5, then it will select a random subset of 50% of nodes in a layer and will multiply the weights by 2 during training.

Qutie a smart trick imho

MarkD · December 15, 2017, 2:19pm

Hi Jeremy, I don’t have access to edit the wiki else I would just fix it. The link to the dropout paper provided by radek http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf is missing in the wiki on this page http://wiki.fast.ai/index.php/Lesson_3_Notes#Updating_Weights_with_Dropout. Thanks for a great course!