Dense vs convolutional vs fully connected layers

Hi there,

I’m a little fuzzy on what is meant by the different layer types. I’ve seen a few different words used to describe layers:

  • Dense
  • Convolutional
  • Fully connected
  • Pooling layer
  • Normalisation

There’s some good info on this page but I haven’t been able to parse it fully yet. Some things suggest a dense layer is the same a fully-connected layer, but other things tell me that a dense layer performs a linear operation from the input to the output and a fully connected layer doesn’t, so I’m kinda confused.



Dense and fully connected are two names for the same thing.

Did you have any questions or want any clarification about any of the other types of layer?


I’d love some clarification on all of the different layer types. Here’s my understanding so far:

Dense/fully connected layer: A linear operation on the layer’s input vector.
Convolutional layer: A layer that consists of a set of “filters”. The filters take a subset of the input data at a time, but are applied across the full input (by sweeping over the input). The operations performed by this layer are still linear/matrix multiplications, but they go through an activation function at the output, which is usually a non-linear operation.
Pooling layer: We utilise the fact that consecutive layers of the network are activated by “higher” or more complex features that are exhibited by a larger area of the networks input data. A pooling layer effectively down samples the output of the prior layer, reducing the number of operations required for all the following layers, but still passing on the valid information from the previous layer.
Normalisation layer: Used at the input for feature scaling, and in batch normalisation at hidden layers.


Those are pretty good definitions. Here’s my own version:

Dense layer: A linear operation in which every input is connected to every output by a weight (so there are n_inputs * n_outputs weights - which can be a lot!). Generally followed by a non-linear activation function
Convolutional layer: A linear operation using a subset of the weights of a dense layer. Nearby inputs are connected to nearby outputs (specifically - a convolution ). The weights for the convolutions at each location are shared. Due to the weight sharing, and the use of a subset of the weights of a dense layer, there’s far less weights than in a dense layer. Generally followed by a non-linear activation function
Pooling layer: Replace each patch in the input with a single output, which is the maximum (can also be average) of the input patch
Normalisation layer: Scale the input so that the output has near to a zero mean and unit standard deviation, to allow for faster and more resilient training


Regarding the convolutional layer - there is frequently the usage of the term “filters”. Is the goal of the neural network to compute the correct value for the filter, and thus the term “filter” can be replaced with the term “weights”?

1 Like

Yes, although ‘filter’ refers to a set of weights for a single convolution operation. For example, in the convolution intro notebook I showed 8 filters (an edge detector for each vertical, horizontal, and diagonal direction).

1 Like

So if I call each filter a neuron in the network, would one neuron be initialized with weights to make it a vertical edge detector, and another neuron initialized as a horizontal detector, etc.? And then these weights are adjusted during training?

The is one activation per filter, for each location in the input grid. So if it’s height 224 x width 224 x # filters 64 = 22422464 activations in the next layer.

So the weights of each filter are fixed? versus weights in a linear regression, where the objective is to adjust the weights until the right function approximation is achieved. Or is there something I’m missing about the activation function? Is there something that’s being adjusted in the activation function?

No, all of the weights in each filter are optimized using SGD.

1 Like

In the spreadsheet created for lesson 4, on convolutions, why does the second layer have 2 filter matrices for each input matrix?

For the convs from column AH, there are two matrices here since we’re creating 2 filters (there’s no particular reason we chose 2 - the first parameter to keras’ Convolution2D layer is the number of filters you want, so this example assumes we had asked for 2 of them). For the convs from column BM, each of the 2 filters we’ve created (and we could have chosen a different # of filters here too) has a 3x3x2 input, since the previous layer has 2 filters. So each filter is shown as 2 matrices, although it’s better to think of it as a single 3x3x2 3-dimensional array.

Here’s a wonderful in-depth look at convolutions, as they apply to deep learning:


why do we need a dense layer in CNN ? is it like dense layers combine the ‘features’ learnt by the filters from previous convolution layers to predict the output ? can we skip dense layers and just have convolution layers and still get it to work ?

‘Dense’ is a name for a Fully connected / linear layer in keras.

You are raising ‘dense’ in the context of CNNs so my guess is that you might be thinking of the densenet architecture. Those are two different things.

A CNN, in the convolutional part, will not have any linear (or in keras parlance - dense) layers. As an input we have 3 channels with RGB images and as we run convolutions we get some number of ‘channels’ or feature maps as a result. Feeding this to a linear layer directly would be impossible (you would need to first change it into a vector by calling view on it in pytorch).

In some newer CNN architectures they do go without a dense layer by just downsampling using pooling layers until the size is say 7x7 and then using a 7x7 average pool in place of a dense layer. I believe GoogLeNet did something like this.

1 Like

hi ,
Reading the xls by Jeremy I was able to understand till the max pooling , but still cannot get my head around the last part , Dense layer (or fully connected layer - Classic/traditional matrix product) . I understood that we are flattening the previous layer matrix to a 1 D array and feed it to a fully connected layer or dense layer.

  1. So how are we keeping the features intact after flattening ? How the next fully connected layer know the values it receives and what it does to remember, the things previous layers have done ?
  2. To flatten it , how it was done in the xls , which matrix is multiple with which matrix ?
  3. What is the math behind .
  4. The reason of doing it , and how it is helpful .

I understood the last layer softmax and after that one hot encoding . The softmax is not good for multiple label classification in a single picture still we used softmax and then applied one hot encoding , why so ?

After reading the wiki for lesson 1 , i got this link and it explains the above . Why we need a fully connected layer and what it does .

For the second question I think we later switched to Sigmoid function from softmax , for multi label classification

All these links I read to get more understanding

Some points (quite important imho) to underline regarding CNNs:

  • The presence of a fully connected layer at the end of any traditional (i.e. non-fully conv) CNN.
    It serves the purpose of doing the actual classification. Without it, a traditional CNN would be unable to spit out the predicted classes.

  • The dense layers should not be called linear to differentiate them from conv layers, since convolution is a linear operator in itself (and discrete convolution is just matrix product, in particular).

  • When discrete convolution is explained, generally people makes use of filters like the ones used by photoshop (& C.), that is, pre-established filters (small matrices already filled with meaningful numbers) which apply some effects to the image at hand (i.e., make the edges more evident). This makes people think that a NN just applies this kind of filters to extract features like edges, corners, etc. (see for example Roderick’s question).
    In our case, the filters are definitely NOT established. The entire point (and beauty) of a CNN is that those filters are learned during training: if a CNN wants to minimize the loss, it is forced to learn the correct filters (via sgd and backprop). At the beginning of training the filters are initialized with some random distribution (see Xavier init for example).


I have the same question as you have about the “dense layer”, i.e. why initialize with random number and intuitively how does the dense layer help? The video does not explain the dense layer initialize with random number.