Understanding Softmax/Probabilities Output on a multi-class classification problem

alessa · November 22, 2017, 2:01pm

On a dog and cat example the output is very easy, if it’s closer to 0 it predicts a cat and if it’s closer to 1, a dog.
If it’s closer to 0.5 it means that the model is confused and has no idea what to predict.

I try to apply the same steps for the fisheries competition (the-nature-conservancy-fisheries-monitoring) where you have 8 classes.

So what I have so far is the following: a log_preds variable (shape: (482, 8)), where each line corresponds to the tested image, and each column corresponds to a score for a certain class.

So for example, line 3 (image 3) has 8 scores - where the maximum value is on column 3 which corresponds to class 2 (because classes are from 0 to 7).

log_preds[:,3]
array([[-1.23495, -1.46053, -4.05357, -4.2766 , -2.55734, -1.64061, -2.68659, -2.24147],
       [-1.88049, -1.64775, -2.18399, -2.4335 , -2.20422, -2.20543, -4.69442, -1.49167],
       [-1.74321, -2.53353, -1.28885, -3.31856, -4.4335 , -1.7536 , -4.47653, -1.4377 ]], dtype=float32)

Everything is clear until now.

How to interpret the following?

probs = np.exp(log_preds[:,1])
probs[:3]
array([ 0.23211,  0.19248,  0.07938], dtype=float32)

Where probs is a variable which has only one value for each image (instead of 8). The value is between 0 and 1, because it represents a probability.

How do we interpret probs[j] = 0.07 as a probability value which predicts that image j belongs to class j

radek · November 22, 2017, 2:26pm

The reason we only have 3 numbers there is that - if I am reading the code right - we only asked for the maximum probabilities from each row. Out of 8 numbers for each row, the one that corresponds to highest probability will be our predicted class.

Softmax is just a generalization of what we had for two classes. For two classes, we had 2 numbers per row, if we changed them into probabilities they added up to 1. Same here - there will be 8 numbers for each row and they should add up to 1. I think if you were to do np.exp(log_preds[:3]) maybe that would be helpful to see what is going on there. From the top of my head, I think you should be also then able to do np.sum(<what you got above>, axis=1) and this should sum all the values and (hopefully) they all will sum up to one across each row!

alessa · November 22, 2017, 3:04pm

Thanks radek for your reply

I figure out what it is wrong with that code.

I tried to analyse the results/look at the pictures as in lesson1
in order to display: correct/incorrect labels, most correct/incorrect labels and most uncertain labels.

Although the log_preds.shape was (482,8) the probs.shape was (482,1) due to the fact that probs = np.exp(log_preds[:,1]) - because it takes only one column from log_preds - since it is a 2 class problem.

Then this variable probs is used to compute the most uncertain and most incorrect/correct cats/dogs:

most_uncertain = np.argsort(np.abs(probs -0.5))[:4] # cell 19

idxs[np.argsort(mult * probs[idxs])[:4]] # cell 24

So the conclusion is that these functions need to be updated in order to be run with a multi-class classification problem, where class > 2.

jeremy · November 22, 2017, 6:11pm

Exactly right! If you do update those functions, it would be great if you could submit a pull request, or post your code here, since I think others would find that helpful too.

alessa · November 23, 2017, 11:16am

I’m working on it, @jeremy
I see some differences between the output of the two log_preds = learn.predict() and log_preds,y = learn.TTA() --> the sum of the probabilities computed from learn.predict is 1, while the sum of the probabilities computed from learn.TTA is less than 1.

log_preds,y = learn.TTA()
probs = np.exp(log_preds)
probs[:3,:]
array([[ 0.24539,  0.09634,  0.00952,  0.07535,  0.08283,  0.11779,  0.01073,  0.01777],
       [ 0.39541,  0.03649,  0.02922,  0.21219,  0.01865,  0.11463,  0.02149,  0.07635],
       [ 0.49963,  0.01744,  0.14847,  0.00922,  0.10761,  0.03257,  0.03157,  0.04786]], dtype=float32)

where the sum of probs on each line is less than 1

and

log_preds = learn.predict()
probs = np.exp(log_preds)
probs[:3,:]
array([[ 0.19239,  0.10078,  0.01741,  0.02733,  0.34674,  0.26757,  0.01891,  0.02886],
       [ 0.31218,  0.04001,  0.03897,  0.32251,  0.02798,  0.0951 ,  0.04427,  0.11897],
       [ 0.61758,  0.01188,  0.18024,  0.00552,  0.15407,  0.00619,  0.01183,  0.01269]], dtype=float32)

where the sum of probs of each line is 1

In any case to display the most correct/incorrect images - we will use the learn.predict() function.

alessa · November 23, 2017, 2:02pm

So here it is

log_preds = learn.predict()
y = data.val_y

or

log_preds, y = learn.TTA()

and then

num_classes = len(data.classes)

preds = np.argmax(log_preds, axis=1)
probs = np.exp(log_preds)

# the following functions have an extra parameter - y which is the selected_class between (0,num_classes-1)
# y is a number in the case of displaying the most correct/incorrect classes 
# y is a vector in the case of displaying the most uncertain classes

def plot_val_with_title(idxs, title, y):
    imgs = np.stack([data.val_ds[x][0] for x in idxs])    
    if type(y) == int: title_probs = [probs[x,y] for x in idxs]
    else:    
        key = 0;
        for x in idxs:
            title_probs = [probs[x,y[key]] for x in idxs]
            key += 1
    
    print(title)
    return plots(data.val_ds.denorm(imgs), rows=1, titles=title_probs)

def plots(ims, figsize=(12,6), rows=1, titles=None):
    f = plt.figure(figsize=figsize)
    for i in range(len(ims)):
        sp = f.add_subplot(rows, len(ims)//rows, i+1)
        sp.axis('Off')
        if titles is not None: sp.set_title(titles[i], fontsize=16)
        plt.imshow(ims[i])

def load_img_id(ds, idx): return np.array(PIL.Image.open(PATH+ds.fnames[idx]))

def most_by_mask(mask, y, mult):
    idxs = np.where(mask)[0]
    return idxs[np.argsort(mult * probs[idxs,y])[:4]]

# Here the mult=-1 when the is_correct flag is true -> that means that when we want to display the most correct classes we will make a descending sorting (argsort) because we want that the biggest probabilities to be displayed first. 
# When is_correct is false, we want to display the most incorrect classes, so we want an ascending sorting since our interest is in the smallest probabilities.

def most_by_correct(y, is_correct): 
    mult = -1 if is_correct==True else 1
    return most_by_mask((preds == data.val_y)==is_correct & (data.val_y == y), y, mult)

In order to call these functions

most_uncertain = np.argsort(np.average(np.abs(probs-(1/num_classes)), axis = 1))[:4]
idxs_col = np.argsort(np.abs(probs[most_uncertain,:]-(1/num_classes)))[:4,-1]
plot_val_with_title(most_uncertain, "Most uncertain predictions", idxs_col)

# for most correct classes with label 0
label = 0
plot_val_with_title(most_by_correct(label, True), "Most correct class 0", label) 

# for most incorrect classes with label 2
label = 2
plot_val_with_title(most_by_correct(label, False), "Most incorrect class 2", label)

jeremy · November 24, 2017, 1:59am

Congrats! I’m looking forward to checking it out

sabzo · November 24, 2017, 3:12am

Can you explain why we call learn.predict() in the first place? What does it serve to do? Does this take images from the validation set and comes up with the “prediction”?

alessa · November 24, 2017, 3:59pm

Both learn.predict() and learn.TTA() are using the validation set, because on the validation set we have the ground_truth - the true labels, so we can compute how accurate is the model. (for the test set we have no labels, for kaggle competitions)

The prediction is done by looking at the sample image and returning x scores (where x = number of classes), this numbers usually are between (-oo and 1] but they don’t represent anything meaningful, this is why we turn them into probabilities by np.exp(preds).

Learn.TTA() does prediction on the validation set + modified versions of the sample images from the validation set.

There is something else we can do with data augmentation: use it at inference time (also known as test time). Not surprisingly, this is known as test time augmentation, or just TTA.

TTA simply makes predictions not just on the images in your validation set, but also makes predictions on a number of randomly augmented versions of them too (by default, it uses the original image along with 4 randomly augmented versions). It then takes the average prediction from these images, and uses that. To use TTA on the validation set, we can use the learner’s TTA() method.

We need the output of these functions (these predictions) in order to check how good is our model. For example to compute the accuracy

log_preds,y = learn.TTA()
accuracy(log_preds,y)
0.99650000000000005

Another example is that we want to display the most incorrect/uncertain classifications - in order to have an intuition why our model doesn’t do what we expect from it to do.

jeremy · November 24, 2017, 6:14pm

@alessa I packaged up your changes and also refactored it a bit into a class. I also found a bug with missing parentheses (which I suspect came from my original code - sorry!) which I fixed. It’s now in fastai, and here’s an example of it being used with the new kaggle seedlings competition:

alessa · November 25, 2017, 6:37pm

Thank you Jeremy! Next time I will try to provide directly the class.
I spend lots of time with this line title_probs = [self.probs[x,y[i]] for i,x in idxs] which was giving me errors cause it didn’t like the type of vector idxs, from where the extra unneeded for.
Only now thanks to your code I see the way to do it: title_probs = [self.probs[x,y[i]] for i,x in enumerate(idxs)].
Thanks!

alessa · November 26, 2017, 11:04am

Actually I have noticed that is missing the plot_most_uncertain. I will take it as an assessment and I will implement it directly in the class that you have created and push it on the git.

alessa · November 26, 2017, 12:13pm

@Jeremy, I am not allowed to push on the git, but here are the modifications

I added number of classes to the init (I choose this way - in order to keep the class call simpler as you proposed ImageModelResults(data.val_ds, log_preds)

def init(self, ds, log_preds):
self.ds = ds
self.preds = np.argmax(log_preds, axis=1)
self.probs = np.exp(log_preds)
self.num_classes = log_preds.shape[1]

I added the following methods

def most_uncertain(self):
    return np.argsort(np.average(np.abs(self.probs-(1/self.num_classes)), axis = 1))[:4]

def most_uncertain_class(self, most_uncertain_idx):
    return np.argsort(np.abs(self.probs[most_uncertain_idx,:]-(1/self.num_classes)))[:4,-1]

def plot_by_uncertain(self):
    """
    most_uncertain() - will return the most uncertain indexes which can belong to different classes
    most_uncertain_class() - will return the specific classes of this uncertain indexes
    we need to know the classes in order to display them on the plot along to the probabilities values
    """
    most_uncertain_idxs = self.most_uncertain();
    return self.plot_val_with_title(most_uncertain_idxs, self.most_uncertain_class(most_uncertain_idxs))

def plot_most_uncertain(self): return self.plot_by_uncertain()

The way to call the function

imr = ImageModelResults(data.val_ds, log_preds)
imr.plot_most_uncertain()

ecdrid · November 26, 2017, 12:29pm

We can fork and create a PR…

jeremy · November 26, 2017, 10:19pm

Thanks @alessa . I’m happy to make those changes directly, but you might enjoy learning about how to send a Pull Request with the changes yourself - it’s a great skill to have in your toolbox! If you’d like to give it a go, install this and follow the relevant steps in the readme: https://github.com/github/hub

If you’d like to learn more, have a look at https://www.atlassian.com/git/tutorials/making-a-pull-request . Before you send your pull request (PR), ensure it only has the specific changes you want to make (e.g. don’t include updated notebooks, temp files, etc).

If you’d rather not, no problem - I’ll make the changes directly.

alessa · November 27, 2017, 10:43am

Thanks Jeremy for the links, I am happy to learn to do that!

alessa · November 28, 2017, 3:51pm

Pull request done

ramesh · November 28, 2017, 8:24pm

You might want to add a Screenshot with one from your Dog Breed or some other dataset to show the output of this change. This might help us understand how to use the API. I was expecting to provide the Class that I am interested in seeing the most uncertain examples, but it doesn’t take any parameters. You can see an example of adding screenshots to Pull requests here - https://github.com/fastai/fastai/pull/43. You can drag the screenshot into the github comments section and it will insert it there, very similar to how you insert in the forums here.

alessa · November 29, 2017, 11:13am

Thanks Ramesh for your reply. It was very useful to see how I should do a proper --> pull request.

The most uncertain examples is done by following the initial method which was not taking into account each class aside, but all classes together. So it was looking into all the probabilities which were close to 0.5 (since it was a 2 classes problem).

most_uncertain = np.argsort(np.abs(probs -0.5))[:4]
plot_val_with_title(most_uncertain, "Most uncertain predictions")

You are right, it is more useful to have a specific method which will plot the most uncertain examples by class - I will make the updates for it.

jeremy · November 29, 2017, 6:02pm

OK so would you prefer I wait for those updates before I merge your PR?