Hi all,
I am having trouble with the loss function and I’m wondering if anyone can help!
In notebook 5 there is a table that represents the probability of an image being a 3 or a 7:
|3| 7| targ|idx| loss|
|---|---|---|---|---|
|0.602469|0.397531|0|0|-0.602469|
|0.502065|0.497935|1|1|-0.497935|
|0.133188|0.866811|0|2|-0.133188|
|0.99664|0.00336017|1|3|-0.00336017|
|0.595949|0.404051|1|4|-0.404051|
|0.366118|0.633882|0|5|-0.366118|
Now from my interpretation the loss of predicting a 3, which is where an image has the target = 1, you take the probability of the image being a 7. So for the second row, the image is actually a 3, with a target = 1 and the loss is given by -Pr(7) = - 0.497935. This is the same as - ( 1 - Pr(3)).
We want to maximise this to get the best possible performance such that it should equal zero (maximise because it’s a negative number).
Or better to say, we want to minimise the probability that the model predicts 7 in this case, so we use SGD to find the parameters that minimise this.
My question arises from the multiclass case. The notebook says:
To see this, consider what would happen if we added an activation column for every digit (0 through 9), and then targ contained a number from 0 to 9. As long as the activation columns sum to 1 (as they will, if we use softmax), then we’ll have a loss function that shows how well we’re predicting each digit. We’re only picking the loss from the column containing the correct label. We don’t need to consider the other columns, because by the definition of softmax, they add up to 1 minus the activation corresponding to the correct label. Therefore, making the activation for the correct label as high as possible must mean we’re also decreasing the activations of the remaining columns.
I interpret this like this. If we have numbers 1 - 9, with the image inputted being an actual 3, we use the result of the softmax which gives the probability of it being 3 (and of course the probability of being all the other numbers) as the loss function. Therefore we’re looking to maximise this value, and hence minimise 1 - Pr(3). My question is, is this loss function, in my example explained here, 1 - Pr(image = 3)? There’s no mentioned of 1 - Pr(3) in the notebook, well the 1 - operation in general, so that is confusing me!
Thanks!