Terminology of label encoding and MultiCategoryList

StatisticDean · March 6, 2019, 1:52pm

I have an interrogation about some terminology in machine learning(that translates to how to interpret the code in fastai source).
When you deal with MultiClass Classification problems, you often perform a One Hot Encoding of your classes. That means that you make an array, with one row for each entry and one column for each class, and put 1 in a case if the corresponding row has the corresponding class and 0 everywhere else.

In the Multi Label Classification case, you can have one entry with multiple label (or no label), you can adapt the previously described encoding, and you get an array with only 0 and 1, with for each entry a 1 for each label associated to the entry and 0 everywhere else.

I’ve seen some people call that OneHotEncoding, MultiHotEncoding, or more original names. What is the standard terminology in the ML field(Or at least the one used in fastai)?

Specifically, in the MultiCategoryList, there is an argument one_hot that can be passed to tell that the items have already been encoded. Does it include the multilabel case?

sgugger · March 6, 2019, 2:15pm

Note that MultiCategoryList is for a multilabel problem, CategoryList is for single label (both handle 2 or more classes). By default MultiCategoryList expects target in the form of lists of tags and will perform the one-hot encoding for you (like in the planet dataset), but if you have your labels already one-hot encoded, then you can pass one_hot = True to tell the library it doesn’t need to do the one-hot encoding.

StatisticDean · March 6, 2019, 2:23pm

Thanks for your answer. Two more questions :

CategoryList doesn’t have a one_hot argument, is it because fastai will look at the data by itself to determine whether or not the data is already encoded?
Regarding the terminology again, wikipedia’s definition of one hot is " one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)". In that regard the encoding we perform in the multi label encoding doesn’t give us a one hot array. Is the choice of one_hot as argument name in the multi label case coming from a convenience point of view(Most people are familiar with One Hot Encoding in the multiclass case), or is it your official way to name that encoding in the fastai community.

sgugger · March 6, 2019, 2:29pm

CategoryList won’t work with one-hot encoded targets as the loss function in PyTorch (nn.CrossEntropy() or F.cross_entropy) doesn’t expect one-hot encoded targets (and it’s a loss of memory if you have a lot of categories)

Conversely, you need one-hot encoded labels for multiclass problem because the loss function requires you to provide them that way. It seems stupid to limit the name one-hot encoding to single-label problems as it’s really doing the same thing, and I can’t speak for the community as a whole, but when Jeremy or I say one-hot encoding, we mean putting 0s and 1s with 1s to indicate the target(s), be it a single-label or a multi-label problem.

StatisticDean · March 6, 2019, 2:35pm

Ok, this makes sense. I don’t mind the terminology as long as it is clear (It was not clear to me before I made this post, but now it is ).