Why do we half all the FC layer weights to remove Dropout instead of just the layers after Dropouts?

In the Lesson 3 notebook, we half all the FC layer weights before fitting after setting dropout to 0.

But, if Dropout(0.5) removes half of the activations, shouldn’t the only layers that need their weights halfed be those proceeding from Dropout?

1 Like

If you have dropout in layer n, what it does is it removes, or zeroes out, some portion of nodes in the preceding layer (n-1). ). 0.5 is a nice value to consider since it is easy to imagine half of the nodes in previous layer getting removed from a given calculation.

If you remove half the nodes, this doesn’t do anything to the target. Meaning, if in layer n + 1, directly following the dropout layer, you have just a single node, and the target value of that node for a given example is 1, than the network will be learning to produce 1 on that example with just half the nodes in the previous layer available. If, on average, we will have half the nodes in n - 1, than the weights going from n - 1 to n will have to be twice as big in order output the 1 we are after.

If then we remove the dropout, for any given example we will have twice as many nodes available in layer n - 1. That means, that on average, each weight connecting to n + 1 will only need to have half its original magnitude to achieve an equivalent result to the one under drop out.

That is the reasoning and it holds assuming dropout is implemented as in what I believe was the original paper on it. However, it turns out that this is not exactly how dropout was implemented in keras. Because keras adjusts the weights appropriately at train time, no further changes need to happen if you increase / decrease dropout. Quite convenient if you have a need for moving weights around - you simply don’t have to worry about the rescaling.

I experimented with this a little bit and it took me a while to get to the bottom of this - if you feel you would like to read about this a bit more here is the original thread.


Thanks for the reply!

I’m looking at Jeremy’s lesson 3 notebook where he remove Dropout from his finetuned model for reference. It looks like this:

Dense (4096)
Dropout (0.5)
Dense (4096)
Dropout (0.5)
Dense (2)

Given your explanation, shouldn’t we only half the weights then for just the last two Dense layers? The code in the notebook halves the weights for ALL these layers.

1 Like

Yes, I think you are right :slight_smile: Kudos for noticing this.

I can’t check atm but I don’t think that MaxPooling or Flatten have any weights (pretty sure they don’t) and so I am guessing calling set_weights on them has no effect. But we should not be doing any scaling to the weights of the first Dense layer as - as I understand it - the weights go from the flattened layer to it and there is no Dropout happening along the way.

def get_fc_model():
    model = Sequential([
        Dense(4096, activation='relu'),
        Dense(4096, activation='relu'),
        Dense(2, activation='softmax')

    for l1,l2 in zip(model.layers, fc_layers): l1.set_weights(proc_wgts(l2))

    model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

Just in case someone else stumbles upon this thread, above is the code I believe you are referring to.

Maybe someone else could also chime in and confirm, but now that you pointed this out I am quite convinced your reasoning is correct :slight_smile:

Yah that is the code I’m referencing.

I’m also wondering if the same applies to removing Dropout from the VGG16BN model or if having the BatchNormalization layers in there further complicate removing Dropout. There we have:


In that model, which layers need to have their weights halved and how does having BatchNormalization in there change the way we remove Dropout, if at all?

I don’t think there is any concept of halv-ing the number of weights because of dropout. I have seen constant 4096 neurons used in FC layers and still results very high accuracy. Dropout is removing random x percent of neurons and associated weights to force other neurons to learn features with duplicates and by this way it is removing overfitting and reducing bias at the same to help generalize learning.

Given the VGG16BN FC layers as stated above … assuming its input are the features learned from a model created from just the Convolutional layers, how would you change Dropout(0.5) to Dropout(0.0)?

Not sure what is going on … but even when I don’t change the Dropout, I get the same error when attempting to break apart VGG16BN up where I use the convolutional layers output as input to a model built using the same structure as the FC layers:

last_conv_idx = [i for i,l in enumerate(model.layers) if type(l) == Convolution2D][-1]
conv_layers = model.layers[:last_conv_idx+1]
conv_model = Sequential(conv_layers)
fc_layers = model.layers[last_conv_idx+1:]

train_features_conv = conv_model.predict(train_data, batch_size=batch_size)
val_features_conv = conv_model.predict(val_data, batch_size=batch_size*2)

def build_fc_model():
    model = Sequential([
            Dense(4096, activation='relu'),
            Dense(4096, activation='relu'),
            Dense(2, activation='softmax')
    for l1,l2 in zip(model.layers, fc_layers): 
        if (type(l1) == Dense): l1.set_weights(l2.get_weights())

    # Such a finely tuned model needs to be updated very slowly!
    opt = Adam(lr=0.00001)

    model.compile(opt, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

fc_model.fit(train_features_conv, train_labels, nb_epoch=8, 
             batch_size=batch_size, validation_data=(val_features_conv, val_labels))

The error I get here: MemoryError: (‘Error allocating 411041792 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).’, “you might consider using ‘theano.shared(…, borrow=True)’”)

What am I missing here?

@Gkarmakar the point of dropout is not necessarily to duplicate features though I guess this is one way to think of reducing the capacity of the model. Ideally we would like (and that is what I believe happens to a large extent) the neurons to learn slightly different ways of arriving at the answer - this way we get automatically an ensemble over exponentially many models and (maybe I am wrong on that one) but also something like Bayesian estimation of the parameters given the data. In general (and this is explained via a beautiful analogy in the original paper, section 2) we want to remove the dependency of what one neuron does on the other and not increase it.

@wgpubs this is a memory error. I found that sometimes I need to restart the kernel as I have some memory that is not getting garbage collected in the GPU - maybe there is still some reference to it from the kernel or the notebook or maybe it doesn’t get cleaned up properly when you assign a new model to the same variable, not sure. You might need to restart the kernel and just run the cells you absolutely need for the calculation - and if that doesn’t help you might need to reduce the batch_size. One way the batch_size influences the calculations you are doing is that for each batch it allocates the necessary amount of memory in the GPU - I am suspecting the needed memory grows linearly with the batch_size and is also influenced by how many layers of the model you are training. Hence as you make changes to your model you might need to make changes to that parameter or else the gpu will run out of memory at allocating the first batch.

@wgpubs @radek I had the same problem. I’d create a model twice, and the first model would stay in the GPU memory and I’d have to restart the kernel.
But not any more! Remove all ipython references to the model and invoke the garbage collector, and the memory should be freed up. Like so:

import gc %xdel model for i in range(3): gc.collect()

No matter what I do I still get the out of memory error any time I try to remove Dropout from VGG16BN

I’ve restarted the notebook’s kernel a bazillion times, I’ve tried slav’s garbage collector trick several times (btw, thanks as that is still a good tip), and can’t get this to work. I’m following the same general process Jeremy uses from lesson 3 to remove Dropout from VGG16 BUT can’t get it to work at all with VGG16BN.

What am I missing?

Is the problem with having BatchNorm in there?

I even tried to not change Dropout and just separate the training of the FC layers (using the output of the conv. layers as the features fit in the FC only model) … just like in lesson 3 notebooks. I get the same error!

You need smaller batch_size, you can set it in get_batches. It defaults to 64. Are you using the GPU? How much memory do you have? Try setting it to a smaller value - the training will take longer the smaller it is but smaller batch size is not necessarily a bad thing.

Batch size is set to 4

Yes, using the GPU

How much RAM on the GPU do you have? There are two types of memory error - one due to failing to allocate enough memory on the GPU, one for the kernel not being able to reserve enough RAM / swap. Post your error pls


Here is the exception in all its gory details:

MemoryError                               Traceback (most recent call last)
<ipython-input-21-991689d366a4> in <module>()
      1 fc_model.fit(train_features_conv, train_labels, nb_epoch=8, 
----> 2              batch_size=batch_size, validation_data=(val_features_conv, val_labels))

C:\Development\_tools\Anaconda3\envs\ml_py2\lib\site-packages\keras-1.2.1-py2.7.egg\keras\models.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, **kwargs)
    670                               class_weight=class_weight,
    671                               sample_weight=sample_weight,
--> 672                               initial_epoch=initial_epoch)
    674     def evaluate(self, x, y, batch_size=32, verbose=1,

C:\Development\_tools\Anaconda3\envs\ml_py2\lib\site-packages\keras-1.2.1-py2.7.egg\keras\engine\training.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch)
   1167         else:
   1168             ins = x + y + sample_weights
-> 1169         self._make_train_function()
   1170         f = self.train_function

C:\Development\_tools\Anaconda3\envs\ml_py2\lib\site-packages\keras-1.2.1-py2.7.egg\keras\engine\training.pyc in _make_train_function(self)
    759             training_updates = self.optimizer.get_updates(self._collected_trainable_weights,
    760                                                           self.constraints,
--> 761                                                           self.total_loss)
    762             updates = self.updates + training_updates

C:\Development\_tools\Anaconda3\envs\ml_py2\lib\site-packages\keras-1.2.1-py2.7.egg\keras\optimizers.pyc in get_updates(self, params, constraints, loss)
    427         shapes = [K.get_variable_shape(p) for p in params]
    428         ms = [K.zeros(shape) for shape in shapes]
--> 429         vs = [K.zeros(shape) for shape in shapes]
    430         self.weights = [self.iterations] + ms + vs

C:\Development\_tools\Anaconda3\envs\ml_py2\lib\site-packages\keras-1.2.1-py2.7.egg\keras\backend\theano_backend.pyc in zeros(shape, dtype, name)
    158     if dtype is None:
    159         dtype = floatx()
--> 160     return variable(np.zeros(shape), dtype, name)

C:\Development\_tools\Anaconda3\envs\ml_py2\lib\site-packages\keras-1.2.1-py2.7.egg\keras\backend\theano_backend.pyc in variable(value, dtype, name)
     85     else:
     86         value = np.asarray(value, dtype=dtype)
---> 87         variable = theano.shared(value=value, name=name, strict=False)
     88     variable._keras_shape = value.shape
     89     variable._uses_learning_phase = False

c:\development\_lib\theano\theano\compile\sharedvalue.pyc in shared(value, name, strict, allow_downcast, **kwargs)
    245             try:
    246                 var = ctor(value, name=name, strict=strict,
--> 247                            allow_downcast=allow_downcast, **kwargs)
    248                 utils.add_tag_trace(var)
    249                 return var

c:\development\_lib\theano\theano\sandbox\cuda\var.pyc in float32_shared_constructor(value, name, strict, allow_downcast, borrow, broadcastable, target)
    240         # type.broadcastable is guaranteed to be a tuple, which this next
    241         # function requires
--> 242         deviceval = type_support_filter(value, type.broadcastable, False, None)
    244     try:

MemoryError: ('Error allocating 411041792 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).', "you might consider using 'theano.shared(..., borrow=True)'")

2GB is very little - there is some memory that gets allocated when the model (computation graph?!) is loaded into the gpu (when you do model.add, model.compile, etc) and there also seems to be more memory needed for the calculations. With trainable conv layers I believe the model occupies ~ 1.6 GB in the memory of the GPU, and the biggest batch size I was able to set with 11GB of RAM on the GPU is 64 for the training data.

You might want to download an application for displaying stats on your GPU to learn more. On linux with nvidia there is the nvidia-smi application that gives you insight into power consumption, memory usage, etc. but depending your platform you might need to use something else.

Well, my 960 has been working fine thus far through the class with batch_size = 4. I’ve been able to use it for both the sample and full dogscats datasets at least without any issues.

I still think having BatchNormalization in there is causing the problem. I have received the out of memory error when I didn’t set the weights correctly in other models … and I think that is what is happening here as well.

Thanks for trying to help btw.

@wgpubs BatchNorm uses very little weights as compared to the connections between Dense layers. I believe it has 4 weights per neuron.
Also you might want to try training on the CPU, since probably having such a small batch might negatively affect training (don’t take my word for it though - test it).

Have you tried to remove Dropout from the VGG16BN model? If so, I’d be curious to see your code and how it compares to mine.

I did remove it by dividing the previous layer (Batch Norm or Dense) weights by a factor.
I’m not sure this was the right decision as @radek as pointed out since keras implements Inverted Dropout.