Found it! @jeremy discusses Softmax
starting at 44:38 in Lesson 10 video, and ending at 52:44. He’s discussing the entropy_example.xlsx
spreadsheet and the section labelled Softmax in the 05a_foundations.ipynb
notebook.
Two key points @jeremy makes are that Softmax
operates under the assumption that each data point belongs to exactly one of the classes, and that Softmax
works well when these assumptions are satisfied.
However, the assumptions are not satisfied for
(1) multi-class, multi-label problems where a data point can be a member of more than one class (i.e. have more than one label), or
(2) missing label problems where the identified classes do not provide a complete representation of the data, i.e. there are data points that belong to none of the classes.
So what to do about these cases?
@jeremy shows empirically that for multiclass, multilabel problems a better approach is to create a binary classifier for each of the classes.
For missing label problems, @jeremy says that some practitioners have tried
(A) adding a category for none-of-the-above, or alternately
(B ) doubling the number categories by adding categories for not(each class).
However, he says that both of these approaches are terrible, dumb and wrong, because it can be difficult to capture features that describe these ‘negative’ categories.
While I agree that the ‘negative class’ features could be hard to capture, I’m not convinced that either of the approaches (A) and (B) are wrong, since in each case, the classes satisfy the Softmax
assumptions.
Case (A): if you can learn what features are present in a certain class K, you also know that when these features are absent, the data is not likely to be a member of class K. This means that learning to recognize class K is implicitly learning to recognize class not(K).
Case (B) I’d argue that none-of-the-aboveness can be learned with enough examples.
So I don’t see anything wrong with these approaches to handle the case of missing classes.
To summarize, Softmax
works well when its assumptions are satisfied, and gives wrong or misleading results otherwise. An example of the former case: Softmax
works well in language modeling when you are asking “what’s the next word?” An example of the latter case is when there are missing classes and you don’t account for this situation by using, say approach A or B above; in this case the output probabilities are entirely bogus. Multiclass, multilabel problems provide another example where Softmax
is the wrong approach, because the class probabilities do not sum to one.