Fully Connected/Linear/Dense layers just do a matrix multiply.
In order to do that, they require, in advance, the size of the input, say, the size of the image (hypothetically, although, you’d have to flatten the image tensor first).
And so, for a linear layer, the input --> output shapes look like:
(-1, i) \rightarrow (-1, o)
The first dimension is a variable batch-size.
Note that i and o are pre-defined.
That is, you need to know them ahead of time.
A convolutional layer on the other hand, uses filters.
One needs to pre-define the size of the filters, how many channels of filters there need to be and how many channels there are in the input.
In this case, the shapes look like the following:
(-1, c_{in}, l_{in}) \rightarrow (-1, c_{out}, l_{out})
This is, of course, the case for 1-D signals. The idea is the same for a 2-D image.
Note, here that only c_{in} and c_{out} need be predefined.
l_{out} is related to l_{in} depending on factors such as the filter size, dilation, stride and any padding.
However, l_{in} is variable.
This means that with a convolutional layer, we could use any size image and it would go through.
Let’s take the example of an MNIST classification task. Pretend for a moment that the digits have shape 28 \times 28 \times 1.
Let’s say our convolutions are chosen to halve the input-size.
Then, the sequence of sizes looks as follows:
(-1, 1, 28, 28) \rightarrow (-1, 32, 14, 14) \rightarrow (-1, 64, 7, 7) \rightarrow (-1, 128, 3, 3)
All this is good, but what if, in production, I get an MNIST image of size 28 \times 16?
Well, our fully-convolutional sequence of layers can handle this…
(-1, 1, 28, 16) \rightarrow (-1, 32, 14, 8) \rightarrow (-1, 64, 7, 4) \rightarrow (-1, 128, 3, 2)
In order to classify the 10 MNIST digits, let’s say we use a 1 \times 1 convolution with 10 channels, to get (-1, 10, 3, 3) in training and (-1, 10, 3, 2) in production.
Now, we need to somehow squeeze these last two dimensions in a size-agnostic manner.
A good way to do it is through Global Average Pooling, which just takes the mean of the last two dimensions.
Now, for both cases, we have a (-1, 10) shaped tensor, which we can treat as the class-scores.
This is an instance of a fully-convolutional net named so because, well…
there are only convolutional layers involved (you could count the Average Pooling as an activation function).
Notes:
-
It is never a good idea to have different data for training and testing (even if the shapes are different). This was just a hypothetical example. Usually, the training set itself would have variable-sized images/inputs.
Although, I suppose it shouldn’t do much harm.
-
Fully-Convolutional networks are by no means the only architectures that accept variable-sized inputs.
seq-to-seq models like the WaveNet and Tacotron 2 have other layers like LSTMs and even Dense layers!
Look into these other exotic architectures to understand why they don’t have trouble with varying sizes.