What foundational topics would you like to hear more about?

I think it would be, wouldn’t it? Because the activations are what we’re making have mean 0. So using leaky relu might be a good idea…

A more in depth discussion of seq2seq learning would be great. I know it was partially covered in DL part 1, but a Jeremy in-depth discussion would be helpful.

2 Likes
  1. normalization makes training less sensitive to the scale of features. For instance regularization behaves differently for different scaling. Moreover using gradient descent, or a variant thereof, the speed of convergence depends on the scaling of features, so normalization makes the problem better conditioned, improving the convergence rate of gradient descent.

  2. it is hard to give a concise answer since the amount of training data is dependent on many different aspects of the task at hand:

  • degrees of difference among the classes,
  • possible augmentation over the training data
  • transfer learning
  • batch normalization

you should every time do some research to see how other people approach at the problem :wink:

1 Like

We’ll absolutely be doing that - this topic however is asking what foundational stuff we used in lessons 8 & 9 you’d like to learn more about. i.e. so people don’t get left behind in the next lessons.

1 Like

Thanks for the clarification. Callbacks are what I have been struggling with, but there are a lot of resources people put under the callback topic,so that might suffice.

Some matrix calculus fundamentals would be helpful. I’m trying to wrap my head around the derivative of a matrix vector multiplication (ie y = Wx, calculate \frac{dy}{dx}). Half the sources I find say \frac{dy}{dx} = W, the other half say \frac{dy}{dx} = W^T and I don’t know what to believe or if the difference even matters.

1 Like

I would say it doesn’t matter. it is really a matter of matching the dim in your matrix multiply.

I think these matrix derivitive things are a lot easier to think about in terms of individual components (e.g. https://forums.fast.ai/t/lesson-8-2019-discussion-wiki/41323/502?u=cqfd), since they’re just scalars. For example, the vector equation y = Wx corresponds to the component equation(s) y_i = \sum_j W_{ij} x_j. So, regular-old non-matrix calculus says then that

\frac{\partial y_i}{\partial x_j} = W_{ij}

This has two indices, i and j, so you can stick it into a matrix—you just need to decide which index you want to be the row index and which the column index. If i is the row index, you get W as your matrix, but if j is the row index, you get W^T.

That said, there’s a nice special case where you don’t really need to decide anything: when the thing you’re differentiating is the loss \mathcal{L}. The losses we differentiate are (always?) scalars, so they don’t have any indices! That means that, for example, something like \partial \mathcal{L}/\partial W_{ij} has the same number of indices as the thing you’re differentiating with respect to, W_{ij}, so you might as well use them the same way: i for the rows and j for the columns.

One example: out.g from class will always have the same shape as that layer’s output, since it’s the derivative of the loss with respect to the output.

5 Likes

It might be interesting to hear about strategies to handle huge data sets both for image and tabular data. For instance as in LANL Earthquake Prediction or VSB Power Line Fault Detection.

Deployment. Specifically, deploying PyTorch models built with fastai. Since we are learning to build our own layers in this part of the course there may be opportunity to learn about converting from PyTorch to other libraries or formats. For example, ONNX export does not support all layers that are used by fastai or some fairly standard ones like nn.ELU tend to cause problems. Maybe writing our own converter could help and might make for a nice course project as well.

2 Likes

Definitely would like to more about callbacks, I guess I will need to go through the previous video again to get good understanding. Details on the usage of Softmax appropriately would be of great help.

Difficult to choose can we short bits of everything?

When I learned that softmax followed by negative loss likelihood is the same as cross entropy loss in lesson 9, I realized that I haven’t been paying much attention to loss functions. If somebody were to ask me “why can’t we use accuracy as my loss function?” I can tell them “Jeremy said it’s too bumpy” but I can’t say I actually know what that means. I would appreciate some intuitions for what makes a function be suitable for a loss function.

8 Likes

Some things I struggle with,

  1. Inheritance in python. I noticed some variables and functions inside a class (last lesson) were declared as private. Does python supports private entities like c++? And, when should we use it.
  2. Debugging process : it would be great if you can guide us through an example. (may be in the future lessons, when we come across a complex network)
  3. Things to keep in mind while making a custom loss function, metric, callback or any other ‘fastai’ class stuff. Like, the numerical stability issue which was mentioned in the previous lecture. For example, if I want to modify the existing cross entropy loss and add extra weight to some particular classes (imbalanced dataset). What should be my thought process? Are there other possible ways to validate my custom loss (or a fastai/pytorch class entity) instead of training and comparing results? (Got some clarity in the previous lesson)
  4. In the fastai library, there are a lot of new practices in code structure compared to the previous version. Like, Flow Field and → Optional [ Figure ], it makes the code more readable. I want to know what are they exactly and the need to implement them. It will help me in writing better, more readable code.
4 Likes

I would say that you cannot use accuracy as you loss function for the same reason you cannot use other metrics like ROC: because they are not differentiable.

3 Likes

A loss function is used to optimize a machine learning model, whereas accuracy metric is used to measure its performance in an interpretable way. As @axelstram already told you, accuracy metric is easier to interpret, but however isn’t differentiable so it can’t be used for back-propagation. We need a differentiable loss function to act as a good proxy for accuracy.

2 Likes

Wow, a lot of people have the same questions I have, that’s great! A couple of more things:

  • By variance in Jeremy’s original list(2nd point) I assume its why normalization? would be curious if thats the case if not please consider adding it.
  • How exactly do we visualize the parameter space and the loss space so that we understand exactly how things are changing?
  • I remember @jeremy linking this paper in the 2nd lesson or on twitter not sure where, but, I would like to know how does one read such a paper and comprehend it? I’m sure this is one of the strongest barriers to anyone entering ML. Would be great if u can walk through it? (tl;dr - (relu - 0.5))

Look at “Visualizing the Loss Landscape of Neural Nets” in

Thanks, @axelstram and @fabris :slight_smile:

Here is where I am so far (yes, I slept on it :laughing:):
I think a function needs just two things to be an error function:

  1. It has gradient zero at y = \hat y (and only there)
  2. One should be able to gradient descent to y = \hat y because that’s how a lot of weight updates are written.

For example, if we have a regression problem and we define our “accuracy” as \frac{\hat y}{y} (i.e. it tells you 100% if y = \hat y). Note that it’s different from validation accuracy. We then take how much we are off from the 100% we are shooting for by doing 1 - \frac{\hat y}{y} because after all, it’s a loss function. Then I realized that this does not have gradient of zero at y = \hat y . We can fix this by taking the absolute value:

\left |1 - \frac{\hat y}{y} \right |

An error function created from accuracy. I plotted it on Excel and it seems to have the same shape as RMSE. Do you think it would work?

1 Like

What’s not differentiable is that for any example the loss with accuracy will either be 0 or 1. In that sense, it is ‘bumpy’. Even in your calculation, if you sum up the times your model was right and divide over all predictions, it will give you as a result one of a set of discreet values.

Say you have 100 examples in a batch and the model gets 89 right. The accuracy will be 89/100. There is no possible value between 88/100 and 90/100. Can you ever get 88.1/1000 or 89.5/100? In that sense, one could say the metric is ‘bumpy’.

When using cross entropy, for any given example the model will produce a prediction between 0 and 1 inclusive, all these values are possible. Say the ground truth is 1 and the model predicts 0.874311. We can backpropagate the loss and maybe improve the weights so that on next run the model will predict 0.89312, which is closer to GT.

That is the problem, there is no way one can express ‘make the output here 1 and here 0’ in a differentiable way. One can say ‘make the output here closer to 1 and here closer to 0’ (which is what MSE loss can do for example), but that is a different way to position the problem.

As a side note, the model will not output 0s and 1s, but values that can be thresholded to 0 and 1, by saying that everything above threshold equal to some value should be considered 1. But there is no differentiable way to get only these two values from the model, as far as I know. Again, that would be because that would introduce a discontinuity and if a function is not continuous it is not differentiable (not all continuous functions are differentiable but all differentiable functions are continuous)

3 Likes