When I learned that softmax followed by negative loss likelihood is the same as cross entropy loss in lesson 9, I realized that I haven’t been paying much attention to loss functions. If somebody were to ask me “why can’t we use accuracy as my loss function?” I can tell them “Jeremy said it’s too bumpy” but I can’t say I actually know what that means. I would appreciate some intuitions for what makes a function be suitable for a loss function.

Some things I struggle with,

- Inheritance in python. I noticed some variables and functions inside a class (last lesson) were declared as private. Does python supports private entities like c++? And, when should we use it.
- Debugging process : it would be great if you can guide us through an example. (may be in the future lessons, when we come across a complex network)
- Things to keep in mind while making a custom loss function, metric, callback or any other ‘fastai’ class stuff. Like, the numerical stability issue which was mentioned in the previous lecture. For example, if I want to modify the existing cross entropy loss and add extra weight to some particular classes (imbalanced dataset). What should be my thought process? Are there other possible ways to validate my custom loss (or a fastai/pytorch class entity) instead of training and comparing results? (Got some clarity in the previous lesson)
- In the fastai library, there are a lot of new practices in code structure compared to the previous version. Like,
`Flow Field`

and →`Optional`

[`Figure`

], it makes the code more readable. I want to know what are they exactly and the need to implement them. It will help me in writing better, more readable code.

I would say that you cannot use accuracy as you loss function for the same reason you cannot use other metrics like ROC: because they are not differentiable.

A *loss function* is used to optimize a machine learning model, whereas *accuracy metric* is used to measure its performance in an interpretable way. As @axelstram already told you, accuracy metric is easier to interpret, but however isn’t differentiable so it can’t be used for back-propagation. We need a differentiable loss function to act as a good proxy for accuracy.

Wow, a lot of people have the same questions I have, that’s great! A couple of more things:

- By variance in Jeremy’s original list(2nd point) I assume its why normalization? would be curious if thats the case if not please consider adding it.
- How exactly do we visualize the parameter space and the loss space so that we understand exactly how things are changing?
- I remember @jeremy linking this paper in the 2nd lesson or on twitter not sure where, but, I would like to know how does one read such a paper and comprehend it? I’m sure this is one of the strongest barriers to anyone entering ML. Would be great if u can walk through it? (tl;dr - (relu - 0.5))

Look at “Visualizing the Loss Landscape of Neural Nets” in

Thanks, @axelstram and @fabris

Here is where I am so far (yes, I slept on it ):

I think a function needs just two things to be an error function:

- It has gradient zero at y = \hat y (and only there)
- One should be able to gradient descent to y = \hat y because that’s how a lot of weight updates are written.

For example, if we have a regression problem and we define our “accuracy” as \frac{\hat y}{y} (i.e. it tells you 100% if y = \hat y). Note that it’s different from validation accuracy. We then take how much we are off from the 100% we are shooting for by doing 1 - \frac{\hat y}{y} because after all, it’s a loss function. Then I realized that this does not have gradient of zero at y = \hat y . We can fix this by taking the absolute value:

\left |1 - \frac{\hat y}{y} \right |

An error function created from accuracy. I plotted it on Excel and it seems to have the same shape as RMSE. Do you think it would work?

What’s not differentiable is that for any example the loss with accuracy will either be 0 or 1. In that sense, it is ‘bumpy’. Even in your calculation, if you sum up the times your model was right and divide over all predictions, it will give you as a result one of a set of discreet values.

Say you have 100 examples in a batch and the model gets 89 right. The accuracy will be 89/100. There is no possible value between 88/100 and 90/100. Can you ever get 88.1/1000 or 89.5/100? In that sense, one could say the metric is ‘bumpy’.

When using cross entropy, for any given example the model will produce a prediction between 0 and 1 inclusive, all these values are possible. Say the ground truth is 1 and the model predicts 0.874311. We can backpropagate the loss and maybe improve the weights so that on next run the model will predict 0.89312, which is closer to GT.

That is the problem, there is no way one can express ‘make the output here 1 and here 0’ in a differentiable way. One can say ‘make the output here closer to 1 and here closer to 0’ (which is what MSE loss can do for example), but that is a different way to position the problem.

As a side note, the model will not output 0s and 1s, but values that can be thresholded to 0 and 1, by saying that everything above threshold equal to some value should be considered 1. But there is no differentiable way to get only these two values from the model, as far as I know. Again, that would be because that would introduce a discontinuity and if a function is not continuous it is not differentiable (not all continuous functions are differentiable but all differentiable functions are continuous)

Thanks for the response

I think I used the wrong terminology - when we say “accuracy,” we automatically think of the metrics we print out. I do understand that if we used accuracy 0/1 over a validation set, it will not work. But I was thinking of a situation where the output of a model is a continuous variable (say, sales of a store this month).

If a model said $3,000 and the target is $5,000, then one could say “It is 60% accurate”. So I was thinking of something like this for a minibatch of size 1:

Well, radek explained it much better than I could, but I will try to expand it a bit if I can

First, the absolute value function is not differentiable at zero, so you have a problem there (edit: i don’t know if it’s *really* a problem, it’s basically what you have in an l1 loss, you could use the sign in the derivative). But suppose you took the square instead of the absolute value, well then you have to see how you define \hat y, because your network will output real value numbers, and you have to decide which of those number you take as your actual prediction, for example by taking a max of those numbers, which is another non differentiable function. I think you can “fix” this by doing some hacks like dividing by cases (like it’s done with relu), but in a regression setting, a loss function that cannot traverse the whole (positive) real line is not a very good loss function.

You’re right. I have a vague recollection of this from school. Is this the reason why we square then do square root (even though it feels like you’ll just come back to where you started with negative sign removed)?

Then how about this?

\sqrt{\left (1 - \frac{\hat y}{y} \right )^{2}}

** UPDATE **

I think I had a moment. This will become problem when y = 0. And what I am trying to create is starting to look more and more like RMSE by trying to fix all those issues.

Thank you for helping me work through this

Is there a chance to look into Hinton’s Capsule Networks?

Also, a deep dive in designing good reward function for Deep Reinforcement Learning would be awesome. Specifically, RL agents learn what they try to optimize, but when looking at data and Ai ethics, how can one ensure not to codify unwanted higher order effects. How can we even measure that?

After posting that I started thinking (and edited afterwards) and while it’s true that it’s not differentiable, you can use the sign to separate both cases when you are taking the derivative (image attached below), so it’s doable, it’s basically an l1 loss function. But I think you still have a problem with the \hat y, because thresholding on that conditions the values your loss function can take.

I think there are several reasons why you square and then take the root, but I haven’t thought it deeply so I may be wrong (note that when you do that, you take the square root of the sum of the squares. In the equation you wrote, the square and the root cancel each other, which does not happen for example when you do RMSE). One reason I can think of is because it’s really differentiable, another reason is because when you, for example, square the difference between predicted and actual, you penalize more heavily the errors than you do with absolute value, but then you can have the problem that all those squares added up amount to a very large error, so you “bring it down” to a more reasonable value by taking the square. It’s not clear to me now when you would want to do that, and when you would want to do just MSE. I think it’s an interesting topic to touch in the lectures.

It’s not a problem at all. It’s an infinitely small area where that happens.

Ah yes you mean l1 loss - that works just fine! Try it on a regression problem and see! (E.g. the image regression problem we did in part 1).

I always had a problem reading probability formulas in papers, they perhaps mean something simple, but looks very intimidating. It will be great if you can give insights on how interpret them and how not to sweat about it.

When Jeremy was showing “why sqrt5” notebook and conducted a few experiments with numbers, he used gut feeling and common sense to realize whether sth is concerning or not (I.e his version of init vs pytorch’s).

I often heard about statistical significance of experiments and as opposite p-hacking. What is that and whether Jeremy uses p-values in doing research?

- How explicit do you need to be about what operations get executed on the GPU?
- Do you need to explicitly tell pytorch to use the GPU?
- If so, how do you decide when to do so?
- How do you know whether a given operation is happening on the GPU or not?

Not sure if it fits with what you’re getting at, but I am very hand-wavy on GPU execution generally, especially when you have to specifically send something to the GPU in pytorch and when you don’t. I assume that pytorch “magically” executes things on the GPU wherever possible, but I also see explicit `device='cuda'`

calls, so I’m not sure when you need to make that call.

I also strongly suspect that this could be answered by a source dive plus documentation read, and might not be the kind of “fundamentals” you’re getting at.

Edit: Yep, a documentation read and a source dive cleared it up. this pytorch documentation page is very informative and particularly makes it clear that a tensor gets loaded to a device, and by default any tensors that result from an operation on that tensor will stay on the same device. Then, fastai’s `torch.core`

library tries to use the GPU by default . So, most of what happens will happen on the GPU, if there is one, and unless you say otherwise.

I’m sure there must be edge cases where you’d need to manually specify and I’m still curious to find out what those are, but consider this question answered

We’re doing it in the lesson anyway

Use of *args and **kwargs

I have some intuition around it, but can’t fully grasp when to use what.