ClassificationInterpretation confusion_matrix sample count mismatch

anlytix · April 19, 2020, 4:49pm

Update: Never mind. it seems like I was using from_folder wrong. I need to use from_name_re given my folder structure. I fixed that and it works as expected.

I have a single image classification dataset with 714 samples across 11 classes.

data = ImageDataBunch.from_folder(
        './data, train='train', valid_pct=0.2,
        ds_tfms=get_transforms(), size=size, bs=bs,
    ).normalize(imagenet_stats)

I fit the learner, and get interpretation, plot confusion matrix:

learner = cnn_learner(data, resnet34, metrics='error_rate')
learner.fit_one_cycle(4)
interp = ClassificationInterpretation.from_learner(learner)
interp.plot_confusion_matrix(figsize=(10, 10), dpi=60

I notice that the confusion matrix is showing way more samples than in the training dataset. For example, my first class has 11 elements, but the confusion matrix shows 55 elements in the main diagonal for that class.

If I call get_preds and look at its shape, I get (1496, 11)

preds, _, _  = interp.get_preds(with_loss=True)
preds.shape

Where does that 1496 number come from?

What samples are considered to plot the confusion matrix? Is there some resampling happening? Are the results being accumulated across cycles/epochs?

Thanks

muellerzr · April 19, 2020, 5:02pm

It comes from your validation set, not the training

anlytix · April 19, 2020, 5:25pm

Thanks. I am using validation_pct = 0.2, so I was expecting the numbers to be even less.

Never mind - it seems like I was using from_folder wrong. I need to use from_name_re given my folder structure. I fixed that and it worked correctly.