Understanding fully convolutional networks

maxim.pechyonkin · March 17, 2018, 10:58am

Hello everyone!

I was looking for a good resource to learn fully convolutional neural nets but couldn’t find any. Jeremy mentioned in one of Part 1 v2 lectures that fully convolutional networks can accept input images of any size. I got curious about how it works but so far did not find a good resource that describes it. The forums also do not have a topic dedicated only to fully convolutional networks, this is why I created this thread.

Please share any links related to fully convolutional nets!

svaisakh · March 17, 2018, 11:49am

Fully Connected/Linear/Dense layers just do a matrix multiply.

In order to do that, they require, in advance, the size of the input, say, the size of the image (hypothetically, although, you’d have to flatten the image tensor first).

And so, for a linear layer, the input --> output shapes look like:

(-1, i) \rightarrow (-1, o)

The first dimension is a variable batch-size.

Note that i and o are pre-defined.
That is, you need to know them ahead of time.

A convolutional layer on the other hand, uses filters.
One needs to pre-define the size of the filters, how many channels of filters there need to be and how many channels there are in the input.

In this case, the shapes look like the following:

(-1, c_{in}, l_{in}) \rightarrow (-1, c_{out}, l_{out})

This is, of course, the case for 1-D signals. The idea is the same for a 2-D image.

Note, here that only c_{in} and c_{out} need be predefined.

l_{out} is related to l_{in} depending on factors such as the filter size, dilation, stride and any padding.
However, l_{in} is variable.

This means that with a convolutional layer, we could use any size image and it would go through.

Let’s take the example of an MNIST classification task. Pretend for a moment that the digits have shape 28 \times 28 \times 1.

Let’s say our convolutions are chosen to halve the input-size.

Then, the sequence of sizes looks as follows:

(-1, 1, 28, 28) \rightarrow (-1, 32, 14, 14) \rightarrow (-1, 64, 7, 7) \rightarrow (-1, 128, 3, 3)

All this is good, but what if, in production, I get an MNIST image of size 28 \times 16?

Well, our fully-convolutional sequence of layers can handle this…

(-1, 1, 28, 16) \rightarrow (-1, 32, 14, 8) \rightarrow (-1, 64, 7, 4) \rightarrow (-1, 128, 3, 2)

In order to classify the 10 MNIST digits, let’s say we use a 1 \times 1 convolution with 10 channels, to get (-1, 10, 3, 3) in training and (-1, 10, 3, 2) in production.

Now, we need to somehow squeeze these last two dimensions in a size-agnostic manner.

A good way to do it is through Global Average Pooling, which just takes the mean of the last two dimensions.

Now, for both cases, we have a (-1, 10) shaped tensor, which we can treat as the class-scores.

This is an instance of a fully-convolutional net named so because, well…
there are only convolutional layers involved (you could count the Average Pooling as an activation function).

Notes:

It is never a good idea to have different data for training and testing (even if the shapes are different). This was just a hypothetical example. Usually, the training set itself would have variable-sized images/inputs.
Although, I suppose it shouldn’t do much harm.
Fully-Convolutional networks are by no means the only architectures that accept variable-sized inputs.
seq-to-seq models like the WaveNet and Tacotron 2 have other layers like LSTMs and even Dense layers!
Look into these other exotic architectures to understand why they don’t have trouble with varying sizes.

maxim.pechyonkin · March 17, 2018, 3:57pm

@svaisakh Thank you! This makes it much more clear! Essentially, for classification task the last convolutional layer will have the number of convolutional kernels equal to the number of classes, followed by the global average pooling to reduce spatial dimensionality to one and then activation function. Am I correct?

svaisakh · March 18, 2018, 3:06am

Yes.

iirc, you could do it the other way around too.

Generally speaking, I just take a look at the ways (sequence of layers) in which I can go from my input tensor shape to my output tensor shape, within logical constraints (eg. Convolutions for spatially coherent data).

alwc · May 16, 2018, 3:17am

Do you mean 10 filters of dimension 1 \times 1 \times 128?

svaisakh · May 16, 2018, 6:15am

Yes.
By 10 channels, I meant the channels in the output.