Number of outputs to inputs | understanding clarification

In Chap 13 it discusses:
Neural networks will only create useful features if they're forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs.

In Chap 14; the _resnet_stem first layer is created as ConvLayer(3, 32, 3, 2)
Is it correct to understand that this is mapping 27 values (3 channels x 3x3 kernel) to 288 (32 channels x 3x3)?

It appears to me that the outputs are significantly higher than the inputs; but the model is able to adjust these weights to reduce the cost function; so I think I’m missing something :slight_smile: or if I’m not; what characteristics of the model in Chap 14 makes it able to map to more outputs than inputs?

1 Like

Hi Brent. A couple of points.

First, I think you are mixing up activations with weights. The inputs and outputs of a layer are activations. Weights and biases are internal to the layer and learned.

So for a 100x100x3 input image, the number of inputs is 30000. The output of the ConvLayer with stride 2 has 50x50x32 = 80000 activations, unrelated to the kernel size. So yes at this layer the number of outputs is larger than the number of inputs.

The first layer of a convolutional net will typically extract a large number of low-level features, like edges, colors, and simple textures. (Here I am using “feature” to mean a single activation, not a whole spatial feature map.) At some deeper layer, the model will combine these many features into a set far fewer than 80000. This means it has learned a smaller set of features, or abstractions, that characterize any given image. Like “an eye” or “a corner”. It can then use these features to classify the image.

Without the feature “bottleneck”, the model would never be forced to derive the relevant features (abstractions) that allow the training set to be classified. Bad things may happen: the GPU will run out of memory, or the model may memorize the training set (overfitting).

To sum up, your features calculation was wrong, but right in its conclusion, and the smaller number of features will be found deeper in the model.

HTH, Malcolm
:slightly_smiling_face:

P.S. You can use learn.summary() to see the number of input and output activations for each layer in a model.

2 Likes

Hi Malcolm,

Thank you for correcting my calculation! And; yes, I follow that there is a reduction of width; height and channels through the model.

I’m still not clear on how to understand the paragraph in Chap 13 https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb A Simple Baseline

The model defined earlier in the chapter is

simple_cnn = sequential(
    conv(1 ,4),            #14x14
    conv(4 ,8),            #7x7
    conv(8 ,16),           #4x4
    conv(16,32),           #2x2
    conv(32,2, act=False), #1x1
    Flatten(),
)

So also has a reduction of width; height and channels through the model. However, when discussing increasing the number of output channels; the paragraph still states That means it isn't really learning much at all: the output size is almost the same as the input size. Neural networks will only create useful features if they're forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs. - and then redefines the model to have a larger convolution height,width:

return sequential(
        conv(1 ,8, ks=5),        #14x14
        conv(8 ,16),             #7x7
        conv(16,32),             #4x4
        conv(32,64),             #2x2
        conv(64,10, act=False),  #1x1
        Flatten(),
    )

As I understand from the much deeper ResNet architecture; the rule if the number of outputs from an operation is significantly smaller than the number of inputs is only applicable in certain scenarios; so I’m wondering what those are. Or … which bits of the above I’m mis-understanding; which is possibly more likely.

All thoughts most appreciated

I am really resisting having to read Chapter 13.

From what you listed, it looks like the comment is the spatial input size and conv(m,n) means m channels in, n channels out, with stride of 2.

Then at each layer, the image is halved (1/4 the activations), and the channels are doubled, meaning the number of activations (features) is halved. This reduction of activations, from the whole image down to ten, forces the model to learn relevant features.

The second model uses more convolutions at the first layer, not more height and width. I think you may be confused among height/width/features of a convolution. There are lots of good animations available that show visually exactly what each parameter means.
:slightly_smiling_face:

The 2nd model starts with conv(1 ,8, ks=5), where ks=5 is the kernel size of the convolution; so it is an increase in the width,height of the convolution, and then yes also an increase in the number of channels.

My question is specifically about the chapter and I think this is the correct forum to post it on? I’m quite curious to know if if the number of outputs from an operation is significantly smaller than the number of inputs is an actual rule; or if its only applicable to certain models.

Hi Brent. I suspect we are not communicating well, so I will do my best to comment and you can take it from there.

The 2nd model starts with conv(1 ,8, ks=5), where ks=5 is the kernel size of the convolution; so it is an increase in the width,height of the convolution, and then yes also an increase in the number of channels.

The convolution’s output size is unrelated to the kernel size, except for some minor edge effects. The output size is determined by the input size, the strides, and the number of output channels. I have never heard the terms width and height applied to a convolution operation and do not know what they would mean.

Yes, the first layer increases the number of activations, and deeper layers reduce them. This structure is typical for the first layer or layers of a model that processes images.

My question is specifically about the chapter and I think this is the correct forum to post it on? I’m quite curious to know if if the number of outputs from an operation is significantly smaller than the number of inputs is an actual rule; or if its only applicable to certain models.

Models with a high number of inputs, like images or time series, typically take the huge dimensionality of the input (e.g. the image size) and reduce it to a much smaller number of activations we call features. (This is exactly what the two models you cite above are already doing.) Then those features are typically run through a fully connected “head” that reduces the features to the classes or numbers or whatever. There’s no actual rule (I’ll call it a principle instead), but consider this: to do anything useful a model has to reduce a big thing to a small thing. Otherwise, you may as well just use the big thing directly! So the principle of reduction is true both for classification models in general, and inside the typical structure of a such a model, image (big)->features (small)->classes (smaller).

I don’t know whether the principle applies to every model there is. But I imagine it does. It’s a great, clarifying question to ask. Any machine learning task I can think of involves extracting a small number of relevant features and reassembling them into classes or numbers or maps or altered images or decisions or actions, etc. Maybe the reduction of inputs is a general unifying principle to keep in mind when designing any type of model.

HTH, Malcolm
:slightly_smiling_face:

Thank you for all your input, hugely appreciated your time in sharing your insights. Perhaps reading just the A Simple Baseline paragraph in Chap 13 could help smooth over any communication troubles we may be having? Although I feel we are in agreement on all points; I’m just asking a question very specific to that Chapter / Paragraph.

The quote I shared in my first question can be read as mapping from more to less through the whole network; but as I read the entire paragraph; it seems to be implying that every layer should map from more activations to less - and it strikes me as odd that it spends a whole paragraph going over that idea; when it doesn’t seem to apply to the ResNet model.

Quote- bold & italics mine:
But there is a subtle problem with this. Consider the kernel that is being applied to each pixel. By default, we use a 3×3-pixel kernel. That means that there are a total of 3×3 = 9 pixels that the kernel is being applied to at each location. Previously, our first layer had four output filters. That meant that there were four values being computed from nine pixels at each location. Think about what happens if we double this output to eight filters. Then when we apply our kernel we will be using nine pixels to calculate eight numbers. That means it isn’t really learning much at all: the output size is almost the same as the input size. Neural networks will only create useful features if they’re forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs.

Hi Brent. I did look over that section (at last!) and see where we are both confused. I have been talking about the total number of activations or features at each layer. Certainly, an image must be distilled into a smaller set of features in order to extract its class. Jeremy and Sylvain are extending this principle to say that the number of pixels seen by the kernel should be reduced into the number of channels.

Therefore the resnet first layer you cite originally would map a patch of 3x3x3=27 pixels into 32 channels, violating this principle, at least by a little.

I really don’t know what to say! The authors have much much more experience than I, and I would not be quick to question their advice. I personally have never developed this intuition regarding pixels to channels, and could point out that many of the 5x5 pixels in the improved convolution are used again to derive adjacent pixels in the output channel map.

Zooming out, maybe Jeremy and Sylvain are sharing a generally effective principle for designing models, not posing a fixed, definite rule. It seems you have found a case where it is violated, and are right to ask for clarification. But machine learning, as I have come to understand it, is not a definite set of design rules. Rather it is an always changing fuzzy bunch of ideas that have proved to be effective in practice.

Maybe someday you’ll test in practice whether the pixels to channels principle really works and improve our bunch of ideas.

Thanks for bringing up an important point.
:slightly_smiling_face:

1 Like

Thank you for sticking with me through the discussion; and clarifying my mistakes & incorrect terminology. If I had managed to phrase my initial question as clearly as your summary above - that would certainly have made the conversation a fair bit more efficient, so truly appreciate your effort in separating the forest from the trees.

Onward to experiments!!

2 Likes

learn.recorder.plot_loss() looks similar with the first kernel size of either 5 or 3

Kernel size 5:
kern5

Kernel size 3:
kern3

Hi Brent. You are welcome for any efforts. I learned a lot from our discussion of your question, so thanks to you too!

You ran some experiments and it looks like ks=5 gives a little but not a lot of improvement. IMHO this kind of empirical testing is exactly what needs to happen to investigate any rule or principle.

A few points. (These are not a parting shots, but suggestions for further directions.)

  • To see whether a model is better, typically you look at the best, final accuracy or loss on the validation set. The ultimate accuracy is used because researchers are interested in the best accuracy a method can achieve and in (bragging about) pushing the SOTA. Validation set because your model may not generalize well to new examples and/or may memorize the training set’s classes. You can’t determine this by looking at the training loss only.

  • Factors besides ultimate accuracy are also important, such as computation speed (time to train), memory use (fits in GPU, bigger batches), reproducibility of results.

  • You might want to take a look at “receptive fields” in relation to choosing kernel sizes.

Best wishes for your studies of machine learning!

:slightly_smiling_face:

1 Like