Lesson 9 Discussion & Wiki (2019)


Remember that the weights are not yet normalized before Softmax is applied. What happens when the weights of the two pixels differ by a constant offset? i.e. What happens when each pixel 2 weight is d units different from the corresponding pixel 1 weight, where d is a constant?

When you transform the weights by exponentiation, a constant offset d between two sets of weights becomes a multiplicative factor.

To see this, suppose that the weights of pixel 1 are


and the weights of pixel 2 differ from those of pixel 1 by an offset d, so that they are


The exponentiated weights for pixel 1 are


And the exponentiated weights for pixel 2 are

\exp{(w_1+d)},\exp{(w_2+d)} = k\exp{w_1},k\exp{w_2},

where k = \exp{d}.

Now for each pixel, Softmax normalizes the exponentiated weights by their sum, and since

the weights of pixel 2 are proportional to those of pixel 1 by a multiplicative factor of k,

the normalized exponentiated weights will be the same for the two pixels. This is exactly what happens in the example Jeremy discusses in Lesson 10.

1 Like

Hi. The problems with softmax are addressed by Jeremy in Lesson 10.

1 Like

@Pomo If not too much trouble, could you please reference where in Lesson 10?

When training my model using the Runner class and trying to plot the losses I’m getting the following graph:

I know Jeremy explained at some point what’s happening when the losses start fluctuating like that and how to tackle it but I just can’t find it.

Can someone help me out? Thanks in advance.

1 Like

I could not get the lesson transcript search to work. (Nothing happens.) The discussion of softmax characteristics and limitations, though, is somewhere in that Lesson.

If you go here

there are some notes on Softmax in the Lesson notes link in the right hand panel. HTH, Malcolm

1 Like

Found it! @jeremy discusses Softmax starting at 44:38 in Lesson 10 video, and ending at 52:44. He’s discussing the entropy_example.xlsx spreadsheet and the section labelled Softmax in the 05a_foundations.ipynb notebook.

Two key points @jeremy makes are that Softmax operates under the assumption that each data point belongs to exactly one of the classes, and that Softmax works well when these assumptions are satisfied.

However, the assumptions are not satisfied for
(1) multi-class, multi-label problems where a data point can be a member of more than one class (i.e. have more than one label), or
(2) missing label problems where the identified classes do not provide a complete representation of the data, i.e. there are data points that belong to none of the classes.

So what to do about these cases?

@jeremy shows empirically that for multiclass, multilabel problems a better approach is to create a binary classifier for each of the classes.

For missing label problems, @jeremy says that some practitioners have tried
(A) adding a category for none-of-the-above, or alternately
(B ) doubling the number categories by adding categories for not(each class).

However, he says that both of these approaches are terrible, dumb and wrong, because it can be difficult to capture features that describe these ‘negative’ categories.

While I agree that the ‘negative class’ features could be hard to capture, I’m not convinced that either of the approaches (A) and (B) are wrong, since in each case, the classes satisfy the Softmax assumptions.

Case (A): if you can learn what features are present in a certain class K, you also know that when these features are absent, the data is not likely to be a member of class K. This means that learning to recognize class K is implicitly learning to recognize class not(K).

Case (B) I’d argue that none-of-the-aboveness can be learned with enough examples.

So I don’t see anything wrong with these approaches to handle the case of missing classes.

To summarize, Softmax works well when its assumptions are satisfied, and gives wrong or misleading results otherwise. An example of the former case: Softmax works well in language modeling when you are asking “what’s the next word?” An example of the latter case is when there are missing classes and you don’t account for this situation by using, say approach A or B above; in this case the output probabilities are entirely bogus. Multiclass, multilabel problems provide another example where Softmax is the wrong approach, because the class probabilities do not sum to one.

The Lesson 9 notebook 02_initializing.ipynb contains an enlightening series of experiments that empirically derive a ‘magic number’ scale factor for effective random Gaussian initialization of a linear layer followed by a ReLU.

In this annotated version of the notebook, 02b_initializing_jcat, I explore verifying the notebook’s empirical results from first principles.

The first experiment computes the first and second moments (mean, sqr) for a

Linear transformation of a scalar input, followed by a ReLU

Here is the original code cell

which calculates (mean, sqr) empirically as (0.321, 0.510)

In the section What just happened? immediately following that code cell, we compute
(mean, sqr) from first principles and show agreement with the empirical results.

The original notebook then goes on to extend the scalar example to a

Linear matrix transformation of a vector input, followed by a ReLU

Here is the section that computes the empirical result, which turns out to be

(mean, sqr) = (9.035, 256.7)

When I applied the same straightforward methodology as for the scalar case, I was surprised to find that the calculation for mean did not agree with the empirical result. I’m perplexed, still trying to understand where I went wrong.

Perhaps someone can help set me straight?

They’re mathematically doable. But I think the “with enough examples” is a clue as to the issue here - in general, we want an architecture that is as easy to learn as possible. My guess is that “does not look like any of {n classes}” might be hard to learn. And doubling the number of categories at the very least doubles the # params in the last layer - but then also needs enough computation in earlier layers to drive these.

Anyhoo, this is all just intuition - so if someone want to try some actual experiments, that would be very interesting! :slight_smile:

1 Like

I don’t get how is pos being transferred here. I get that 0.3 and 0.7 represents it, but how is it getting to _inner ?


Thanks !

0.3 and 0.7 are saying use 30% of the budget to go from 0.3 to 0.6 of a cosine scheduler and 70% of the budget to go from 0.6 to 0.2 of a cosine scheduler.
combine_scheds is a function that returns the function _inner. (last line - return -inner)
so combine_scheds is returning _inner that takes the pos argument.

now when you say - sched = combine_scheds([0.3, 0.7], [sched_cos(0.3, 0.6), sched_cos(0.6, 0.2)])
sched is this _inner function which takes pos.
next you say - plt.plot(a, [sched(o) for o in p]) , here sched(o) is like _inner(o) where o is the pos that is expected.
previously p was defined as p = torch.linspace(0.01,1,100) and this is what you are passing to sched which is passing it to _inner. Hope that helps !!! :slight_smile:

1 Like

helps alot ! thanks !

Can anyone help understand how does the function find the actual_position ?

Thanks !

p = torch.linspace(0.01,1,100)
pcts=[0.3, 0.7]
pcts = tensor([0] + listify(pcts))
pcts = torch.cumsum(pcts, 0)

for i,o in enumerate(p):
    idx = (o >= pcts).nonzero().max()
    actual_pos = (o-pcts[idx]) / (pcts[idx+1]-pcts[idx])
    print(i,o, idx,actual_pos)

rewriting the same piece of code, now you can print and see the values.

scheds = [sched_cos(0.3, 0.6), sched_cos(0.6, 0.2)]

you are combining two cos functions smoothly and the condition is that you want it start at 0.3 and go up to 0.6(for 30%) and then back down to 0.2(for 70%). (these are the y values on the graph)
So you basically have those three points. But you don’t know where the corresponding x values for these points are. This is controlled by what percentage you allocate to each of the schedulers.
pcts = [0.3,0.7] - first 30% sched_cos(0.3, 0.6) and last 70% sched_cos(0.6, 0.2)
pcts = [0.0,0.3,1.0] , because of the cumsum
starting point - sp; ending point - ep
for scheduler1: sp =0, ep =0.3
for scheduler2: sp =0.3, ep =1.0
actual_pos =( x - starting_point) / (ending_point - staring_point)
Hope that helps. :slight_smile:

i have a question on the last iteration of the loop:

for i,o in enumerate(p):
    idx = (o >= pcts).nonzero().max()
    actual_pos = (o-pcts[idx]) / (pcts[idx+1]-pcts[idx])
    print(i,o, idx,actual_pos)

it should return idx =2. Which should break the code as pcts[3] doesn’t exist. i don’t get how it works ?
i feel it is the equal comparison of floating point numbers that is causing it(but that fells odd, i think i’m wrong. missing something!!! )
Any help :slight_smile:

@jeremy Here is a nitpick. I noticed that in the video at 33 minutes, that the Excel version of NLL is using Log10. Pytorch uses the natural log. I tried to replicate the functions from the spreadsheet and they didn’t match.

I was going through the lesson 9. I wanted to know how the no of iterations i.e. (n-1)//bs + 1 was derived. The expression is correct for all the cases but I am trying to know how the expression came in the first place. It holds true for both even and odd numbers

>  for epoch in range(epochs):
>     for i in range((n-1)//bs + 1):
>         start_i = i*bs

Can someone please explain why we need super().setattr(k,v) in our DummyModule() class? Also which class’s setattr is it calling? Thanks guys.


I’m having trouble understanding the lines

sm_pred = log_softmax(pred)
def nll(input, target): return -input[range(target.shape[0]), target].mean()
loss = nll(sm_pred, y_train)

I’m used to thinking of “likelihood” as being the probability of the data given the parameters, yet the second line uses the true target for the calculation. Why is this? Can someone help clarify what’s going on here?

1 Like

Hi @Rosst, that is a great question!

Loss functions for classification problems need the target labels and the predicted probabilities as inputs, because their computation compares the actual vs. predicted distribution of labels.

Here, the nll (negative log likelihood) loss function takes inputs sm_pred (the predicted labels) and y_train (the target labels).

Hi @cbenett could you please post a snippet showing code you are referring to? In general, .super() refers to the parent class. So the code is referring to the setattr method whichever class DummyModule() inherits from. But if DummyModule() doesn’t explicitly inherit from another class, I’m as confused as you, and I second your question!