Lesson 3 official topic

Thanks!
It all falls into place now.
It’s quite strange but interesting that despite forming everything correctly, it all boiled down to the loss function and the log did the trick.
Of course, the log made it easier for the computer to handle computations with very large or very small numbers.

Too many learnings in this chapter. I was thinking that I would be done in a day or two but it has taken more than a week or so and now I have started getting some grasp on what might be happening.

Hi, Just putting it out here.

I was trying to understand what torch.where does (sadly the examples in the PyTorch docs don’t cover the case I was interested in). Here is a quick link, if anyone is interested - Torch.where | Musings of Learning Machine Learning

2 Likes

Hi all, I am training a model on a dataset with 10 classes and around 120 images for each category. I kept a tab on what changes I was making and placed it in a table.
As I was changing the parameters and changing the pre-trained models, it felt like I was randomly doing this and that.
Is there a better approach to training and seeing improvements? Any pointers will be appreciated.

Model type Learner Data preprocessing Loss Data loading Valid provided Opt Batch size Other transformations Image size in item transformation Max_Accuracy
resnet34 Vision learner default Data block Entire data in path and then random split Default Data augmentation, im_size = 224 300 55
convnext_small_in22k Vision learner default Image data loaders Train test separate in data loader Default 64 Data aug, im_size = 224 300 Squish 60
swin_s3_small_224 Vision learner default Image data loaders Train test separate in data loader Default 32 224 Squish 58
convnext_tiny_hnf Vision learner default Image data loaders Train test separate in data loader Default 32 Data aug, im_size = 300 400 Squish 59
convnext_base Vision learner default Image data loaders Train test separate in data loader Default 32 Data aug, size not provided 128 61
convnext_small_in22k Vision learner default Image data loaders Train test separate in data loader Default 32 Data aug, size not provided 224 56
convnext_small_in22k Vision learner default Image data loaders Train test separate in data loader Default 64 Data aug, im_size = 224, imagent normalise 300 59
convnext_small_in22k Vision learner Preprocessed data, removed lossy class default Image data loaders Train test separate in data loader Default 32 Data aug, size not provided 128 70
convnext_small Vision learner Preprocessed data, removed lossy class Cross Entropy Flattened Image data loaders Train test separate in data loader Default 32 Data aug, size not provided 224 72.5
convnext_tiny Vision learner Preprocessed data, removed lossy class Focal loss, gamma 1.5 Image data loaders Train test separate in data loader Default 32 Data Aug, im_size = 128 224 71.5

Hi I am doing a perhaps more basic thing but wanted to get more into terms with the inner workings of deep learning.

Following the Excel example that Jeremy is showing, I am trying to see a bit the inner workings of Neural Networks. So I added in the Sheet, a third line of parameters, ReLU3, updated the loss function and ran the solver. The loss was reduced to 0.131.

My queston is: By adding yet another set of parameters and one more ReLU computation, does this correspond to having two intermediate layers, before the output is produced? I got the impression that the NN internally could look like:

dot(Input, Params1) -> Layer 1 -> dot(Layer1Output, Params2) -> Layer 2 -> dot(Layer2Output, Params3) -> output

where by dot above, I mean “dot product”. Does this approach the way it works, at all? Is it perhaps a new “epoch”, when we add some new parameters, or is the system so far unaware of how to properly adapt the parameters, in order to minimize loss?

My gist tells me that we are not adapting the parameters as when we were doing abc -= abc.grad*0.01, but wanted to cross-check :blush: .

Thank you in advance!

I am facing the same issue when predicting the model locally.
"AttributeError: ‘GELU’ object has no attribute ‘approximate’ "
Have you found any solution for this?

You’ll need to downgrade to an older version of PyTorch.

1 Like

Thank you for this! I was very confused why the results wouldn’t converge towards the correct values even if I tried increasing steps or changing learning rate.

Here is a notebook I wrote to help me understand the gradient descent code in the first part of the lesson:

Lesson 3: Gradient Descent Function Notebook on Google Collab
Lesson 3: Gradient Descent Function Notebook on Kaggle

I have packaged the gradient descent code into a single re-usable function, and tested it with various functions.

It displays the learning process in a graph and the notebook has a lot of interactable sliders that help illustrate how the parameters work.

It might be helpful to other students. NOTE though, I am a complete python and AI noob. So please forgive me if there is anything strange or nonstandard.

Here’s what the graphs look like once you run it:


Above [a, b, c] was [3, 2, 1] and gradient descent came up with [2.8793, 1.9189, 1.1948] from 30 steps with a learning rate of 0.05.

Same graph as above but I tweaked the parameters:

Even tried it with relu and double relu!:

image

2 Likes

Hi - I watched the video first and now I am trying to go execute the notebooks used by Jeremy in the video. However I am unable to find the exact notebook used in the video for Gradio pets classifier. The hugging spaces pets classifier code seems different to the one in the video and I’ve spent hours reading through the forum and I am still stuck. Could one of you please point me to the right place?

@Santhosh every lesson has a list of links from the lesson:

Thanks Jeremy. I went through the links but I couldn’t find the code that you used in your video around the 8 min mark - pets breed detector. So I wanted to ensure I wasn’t missing something.

IIRC it’s this one:

Binary Cross Entropy.

I have gone through the whole video course multiple times and now I am going through the whole textbook. I am on chapter 6 and I noticed this formula concerning Binary Cross Entropy used for handling the loss of multi-label classification

def binary_cross_entropy(inputs, targets):
inputs = inputs.sigmoid()
return -torch.where(targets==1, inputs, 1-inputs).log().mean()

My question is in the case of log 0, which is -inf, how do we handle such a case? A case where the target is 1 for instance and the inputs(prediction) is 0.

Consider the output of the sigmoid can’t be 0, thus log(0) can never occur.
image