Lesson 8 official topic

No problem! Some advice that Jeremy always gives is break everything out into separate cells and check that the input and output matches what you expect. Then once everything is working you combine them back together again.

I found 2 issues. The first is that img.resize is not an inplace operation. You have to set the results of that operation back to the img variable ex: img = img.resize(…). The second thing that seems like a potential bug in fast.ai maybe is that the results of img.resize is a PIL image, not PILImageBW. You have to convert it back to the correct type prior to passing it to the model.

2022-10-05 14_53_14-CNN.ipynb - Colaboratory

3 Likes

You could also use fastais Resize function, which doesn’t change the class:

img = PILImageBW.create(btn_upload.data[-1])
img_small = Resize((28,28))(img)
type(img_small)
fastai.vision.core.PILImageBW
3 Likes

I’m trying to wrap my head around embeddings in the tabular_learner.

I understand I get obtain the embedding sizes of the categorical features with get_emb_sz(dls.train_ds).

And the first category’s embedding size is (5282, 194).

So this is a lookup table of 5282 rows (as there’s 5281 unique categories in this feature then + 1) and each category in this feature can be represented by 194 values (or a vector) which we are discovering through training the model. These 194 values “represent” this a particular category.

Are these 194 values referred to as embeddings or latent factors?

And I’m guessing that you could use PCA on these 5282 rows and 194 values to interpret them in relation to the target? (e.g. after training pulling learn.model.embeds[index of feature].weight)

Any help would be much appreciated :smile: !

Hi, I did not have the time to reach this part of the course yet, but I was browsing and I clicked on the link of HF pets space and there is a message that says “This space has one file that has been marked as unsafe.”

I have read about it in HF docs and it souds scary:
Pickle Scanning (huggingface.co)

What does that mean and is it really dangerous? TIA.

I’m having problems with the last task, which is to implement CrossEntropyLoss.

I uploaded my code and the full error in it to Kaggle: https://www.kaggle.com/code/maxwe000/notebook330407f119

RuntimeError: 0D or 1D target tensor expected, multi-target not supported

I really don’t know how to fix this. Likewise, I tried many different things, but I feel like the more I change, the more confusing it gets. Can someone help me with this error?

Hi,

I suggest you use CrossEntropyLossFlat instead of just a simple CrossEntropyLoss.
And remove Softmax from the model. CrossEntropyLoss is basically Softmax + log + NLLLoss combined.

So, you have 5 outputs after the last linear layer. The problem is target values range from 1 to 5, instead of 0 to 4, so you can either subtract 1 from them or create 6 outputs instead. This tweak is necessary due to NLLLoss, which is basically an index operation on the output with the target value.

Hopefully this helps.

2 Likes

Hey,

Thank you so much, it works now. Where did you learn all this? For example, what NLLLoss is and how it works together with the outputs? Or when to use CrossEntropyLossFlat instead of CrossEntropyLoss? I couldn’t find much in my research.

Hi, glad it helped.

I learned it from chapter 5 of the fastbook. There is a section about CrossEntropyLoss, and subsections have Softmax, Log Likelihood, Negative Log Likelihood, and so on. I just played around with them in my blog to get the same output as nn.CrossEntropyLoss().

For using CrossEntropyLossFlat vs. CrossEntropyLoss, I just used the flat version because I got the error and felt like it would work. I have to dig into details on why it works and why the normal version does not work. I am still learning. :slight_smile:

1 Like

Thanks, I‘ll look into it. Your blog looks interesting too, I‘ll check out some of your articles.

1 Like

Yes it is a well known safety issue with pickle files. Unfortunately it is the default for PyTorch & Python

Warning

The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

Consider signing data with hmac if you need to ensure that it has not been tampered with.

Safer serialization formats such as json may be more appropriate if you are processing untrusted data. See Comparison with json.

https://docs.python.org/3/library/pickle.html

The HF warning only indicates that the file is an unsafe format. It does not mean that the specified file contains an unsafe payload. The same caution is generally required for any code or files you download or execute.

Some of the background that led to this warning being added in HF repos.

There are moves towards safer formats. GitHub - huggingface/safetensors: Simple, safe way to store and distribute tensors

1 Like

Hi there,

I am currently looking into convolutions using the book (chapter 13) and looking at lecture 8 of the 2022 course (part1) and I am having a difficulty in understanding the following passage of the book in relation to what I’m seeing for the VGG architecture:

Context to the passage is that we are training on MNIST_SAMPLE data, so we only have 3’s and 7’s and we create a network in which the very first layer is a conv2d layer with input channels = 1 (MNIST is black and white images) output channels = 4, and stride = 2. So in the first layer we go from

activations [bs, 1, 28, 28] to [bs, 4, 14, 14]

Then the story goes that we want to switch to using full MNIST data (so all digits from 0 - 9) and therefore we we would like to use a bigger network, with more conv features. The idea is to instead of having 4 output channels in the first layer, increase this to 8 output channels.

Then the next passage comes:

But there is a subtle problem with this. Consider the kernel that is being applied to each pixel. By default, we use a 3×3-pixel kernel. That means that there are a total of 3×3 = 9 pixels that the kernel is being applied to at each location. Previously, our first layer had four output filters. That meant that there were four values being computed from nine pixels at each location. Think about what happens if we double this output to eight filters. Then when we apply our kernel we will be using nine pixels to calculate eight numbers. That means it isn’t really learning much at all: the output size is almost the same as the input size. Neural networks will only create useful features if they’re forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs.

And this makes sense to me, if we go from 9 activations to 8 activations, the model can basically just store all the input activations straight into the output activations, which is not “learning” but more “memorizing”. (BTW: I am not fully sure a conv layer could do this, in the same way a Linear layer could do this, due to the structure of the conv multiplication [tying of weights and the zero values] but still…)

However, when we look at the VGG architecture we see the following:

vgg_7mT4DML

So in the very first layer we go from [bs, 224, 224, 3][bs, 224, 224, 64]. I looked at papers with code to see the details of VGG and they are also using 3x3 kernels. So If I apply the logic from the previous passage we are using 3x3 (kernel size) x 3 (channels) = 27 pixels to compute 64 activations in the feature map. So the output size is even larger then the input size!

How do these things relate to one another?

1 Like

3x3 (kernel size) x 3 (channels) = 27 pixels to compute 64 activations in the feature map

I think that could be rephrased to: 3x3 (kernel size) = 9 pixels in each of the 3 channels to compute 1 value in each of the 64 channels in the next feature map. There’s a distinction to be made because the values are separated across the dimension of the channels, not aggregated.


About the intuition of “How do these things relate to one another?”: I think you’re referring to the growing and shrinking aspects?

Shrinking is found in the dot-product of the kernel with the pixels of the input layer (1), e.g. 9 pixels x 3x3 kernel produce one value in the output layer. The layers are also shrinking in two dimensions (224x224 → 112x112) because of the stride of the kernel (2).

Growing can be seen in the other dimension, the number of channels (3 → 64 → 128) which grows at each layer (3).

So, how do 1, 2, 3 relate to each other and the problem of “learning and not memorizing”?

  1. Does not relate too much with learning because it is constant. As long as the kernel is not smaller or equal to the stride, it shouldn’t make too much of a difference (but I didn’t look into it)
  2. Relates to learning because the model is forced to discard and keep some information or the other, as half of it (stride 2) is gone after the layer is created (it only remains encoded via (1))
  3. Relates to learning because it allows the model to hold more information about the input data (too many channels can lead to overfitting)

In the end, (1) encodes the information in (2) and (2) controls what information is available (3) in the end.

Does that help? I just completed the chapter as well, so I’m still learning, if you find out more you can let me know too.

That’s right - it’s quite inefficient and isn’t a good use of resources! Generally more recent models try to avoid this kind of thing, by having a series of layers with smaller increases in # channels.

2 Likes

Great, thanks so much for your reply. Good to know that this is indeed suboptimal and no longer the way things are done nowadays.

Thanks for your reply @mw00 and the sharing of your thoughts.

About the intuition of “How do these things relate to one another?”: I think you’re referring to the growing and shrinking aspects?

What I meant is: in the very first conv layer we map 27 input activations (3x3 grid x 3 RGB channels) → 64 activations. In the passage in the book, this is described as suboptimal where the case is describing a conv layer that maps 9 input activations to 8 output activations. So the “problem” that is described in the book is very very much happening in the VGG architecture, in fact it’s even worse than described in the book. I was wondering whether (1) I understood the passage in the book correctly (2) whether my math was correct and (3) whether VGG is indeed doing things not optimally.

Thanks for clarifying

Hi there,

I have a question as to the combination of convolutional layers and dropout.

In FastAI’s ConvLayer construct, we don’t get any Dropout layers, instead we get this:

> ConvLayer(3, 64)

ConvLayer(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU()
)

So a convolution, followed by batchnorm and relu. But in lecture 8 Jeremy discusses dropout in the context of Convolutions.

I did some searching on the web and the answer I got from ChatGPT summarized it pretty well in my opinion:

Convolutional blocks in deep learning models often consist of a sequence of convolutional layers, batch normalization, and activation functions. Dropout layers are also commonly used in deep learning models, but they are typically added after the convolutional blocks, rather than within the blocks.

The reason for this is that dropout is a regularization technique that helps to prevent overfitting by randomly dropping out some of the neurons during training. This can be particularly effective in fully connected layers, which can have a large number of parameters that may be prone to overfitting. However, convolutional layers typically have fewer parameters than fully connected layers, and the weight sharing in convolutional layers can also help to prevent overfitting.

Moreover, the use of batch normalization in convolutional blocks can also act as a regularizer by reducing internal covariate shift and improving the generalization of the network. Batch normalization reduces the effects of small changes in input distributions on the outputs of the layer, thus reducing the risk of overfitting.

Therefore, while dropout layers can be effective for regularization in deep learning models, they may not be necessary or as effective in convolutional blocks due to the properties of convolutional layers and the use of batch normalization. Nonetheless, adding a small amount of dropout after the convolutional blocks may still provide additional regularization benefits in some cases.

And another argument against dropout in convlayers: as far as I understand Resnet’s is also not incorporating dropout layers.

Are there other opinions on this? Or perhaps resources that discuss the usage of dropout for conv nets?

Hey,

today I tried to implement the cnn from the book on another dataset. Again, I get to the part where the loss function (CrossEntropyLoss) is being applied, but another error occurs:

ValueError: Expected input batch_size (64) to match target batch_size (1920).

I don’t understand this error because (1) I’m using CrossEntropyLossFlat to flatten input and output and (2) input has the shape 64x30 and output has the shape 64x2x15 which both equals to 1920 values if it’s flattened. It only works when I pass is_2d=False as a parameter to CrossEntropyLossFlat, but then I get increasing negative loss when training, which can’t be correct. Also, flattening the output tensor to 64x30 doesn’t work either - I am perplexed as both tensors have the same shape and should be compatible in my understanding… Other loss functions work fine.

Here’s the notebook with the error: notebookba6bdc3e8e | Kaggle

@mw00 try to use Flatten(full=True) in your model. By default Flatten doesn’t flatten out the batch dimension (neither does torch.nn.Flatten()).

As a tip to debug these kind of things, I can highly recommend getting familiar with the python debugger. In a notebook, after you receive an error like this you can simply go to the next cell and enter %debug, than you enter the debugger and you can see whats going on because you have access to all the variables in the context.

Some practical commands:

c : continues
n: go to next line
u: move up the stack
d: move down the stack

This really helps me personally to see what’s going on.

3 Likes

Thank you, I will try it out soon.

1 Like