I just moved up by 100 positions on the leaderboard (NCFM on kaggle) by applying the do_clip function as put forward at the end of lesson 7.

Now I am struggling to understand exactly why. I take it that clipping cuts off extreme values. So, intuitively, I’d say it behaves like a kind of regularization or at least it undoes some of the overfitting present in the prediction.

Why is it better, however, to do it by clipping instead of, say, applying regularization?

Also, I’d like to understand better how you chose the exact threshold of 0.82 (i.e. a_max parameter in np.clip). Could someone (@jeremy ?) comment on how to choose the best threshold?

I’m not to Lesson 7 yet, so I’m not confident that this relates, but the dogs_cats_redux notebook says:

Log Loss doesn’t support probability values of 0 or 1–they are undefined (and we have many). Fortunately, Kaggle helps us by offsetting our 0s and 1s by a very small value. So if we upload our submission now we will have lots of .99999999 and .000000001 values. This seems good, right?

Not so. There is an additional twist due to how log loss is calculated–log loss rewards predictions that are confident and correct (p=.9999,label=1), but it punishes predictions that are confident and wrong far more (p=.0001,label=1). See visualization below.

[…]

So to play it safe, we use a sneaky trick to round down our edge predictions

Thanks, @Matthew, this was helpful. Yes, this explains why to do clipping. However, I am still unsure how to choose the right threshold for a_max in np.clip(). In the cats and dogs example it was 0.95, but in the NCFM example in lesson 7 it was 0.82. Is this purely empirical or is there a rule, e.g. with more classes it has a smaller number?

I have understood that you should use validation accuracy as a clip threshold (rounding down, i.e. away from 1 and 0)

I have followed this on Dogs vs. Cats and have always got better test score this way. (Top ~8%)

I dont quite get the calculation of min in the clip

def do_clip(arr, mx): return np.clip(arr, (1-mx)/7, mx)

Why /7?

It’s (number of classes-1). In the kaggle fish competition there is 8 classes.

Let’s say for 6 classes there was a 100% prediction for a given class, but the max you allowed was 90% prediction (mx). That means in the other 5 classes (number of classes - 1) the minimum prediction to evenly distribute would be 2%.