Calculating the Accuracy for test set

Nohaaw · February 27, 2019, 10:39am

Hello all,

I have a test set with 10 subfolders where each subfolder name = label.
my question is: How can I calculate the accuracy after the prediction? (how can I compare the predicted labels with the actual labels)

I couldn’t find any answer in the forum I am struggling for weeks.
Please help me if you can.

Thanks.

marcmuc · February 27, 2019, 10:59am

In fastai the test set is expected to be unlabeled data, so you cannot calculate the accuracy on that if it is specified as “test”. All functionality in fastai is set up to use the val set for accuracy, confusion matrix etc.

So if you have a labeled test set, you could first train your model using your real train/val sets, save your model. Then create a new databunceh in which you define your labeled test set as your “fastai val set”, load your trained model and do prediction, confusion matrix, accuracy etc. on that test set (now for fastai purposes defined as the val set).

https://docs.fast.ai/data_block.html#LabelLists.add_test_folder

Nohaaw · February 28, 2019, 1:29pm

Thanks @marcmuc for the quick response.
I followed the method you mention and I got the accuracy For the test set but when I tried to show the top losses or draw the confusion matrix it shows the real validation set data! not the test set.

I don’t know why.

Here is my code:

main data set to train the model (train, valid ):

bs = 64
np.random.seed(42)
path = Path(‘train’).resolve()
data = (ImageItemList.from_folder(path)
.random_split_by_pct(valid_pct=0.2)
.label_from_folder()
.transform(size=224)
.databunch(bs=bs)).normalize(imagenet_stats)

and when I finished training the model I used:
2. test set

path = Path().resolve()
data_test = (ImageItemList.from_folder(path)
.split_by_folder(train=‘train’, valid=‘test’)
.label_from_folder()
.transform( size=224)
.databunch(bs=bs)).normalize(imagenet_stats)

step to validate the test set:
learn.validate(data_test.valid_dl)
step to show the top losses and condusion matrix (the problem):
interp = ClassificationInterpretation.from_learner(learn)
losses,idxs = interp.top_losses()
len(data_test.valid_ds)==len(losses)==len(idxs)
interp.plot_top_losses(9, figsize=(15,11), heatmap=False)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

it gives the confusion of the real validation data even I wrote ‘data_test.valid_ds’, not ‘data.valid_ds’.

I hope you will get my point and can help me fix it.

marcmuc · February 28, 2019, 3:40pm

It‘s probably because you use. ClassificationInterpretation.from_learner() and pass it the learner you used for training, which contains the original Val set. Try to save the Model, create a new learner with your second databunch and reload the weights, then do the Interpretation.

Nohaaw · February 28, 2019, 10:05pm

It works! Thank you so much for your help!

jordan.howell2 · June 16, 2019, 12:53am

Hello @marcmuc. I started using fast.ai a few days ago and went through lesson 1. Do I understand your comment and the docs to mean a “test” set denoted in imagedatabunch is the same as the predict function in Keras? If I save the model and want to deploy it, do I have to pass new data into the test parameter of the imagedatabunch?

rquintino · August 20, 2019, 11:42pm

hi @marcmuc I’m trying to do the eval (conf matrix, top losses,etc) based on an exported model and just using load_learner. Assuming we dont know anything else about the exported model, ie no transforms, normalization,etc. just want to use same pipeline as exported to avoid any mismatch, as this can be happening later in time, other kernel,etc. But still be able to use all eval functions available.
is this possible?

huge thanks!

ste · August 21, 2019, 10:37am

Late answer, but instead of:

try to use:

interp = ClassificationInterpretation.from_learner(learn, ds_type=DatasetType.Train)

This force the ìnterp` to work with test set.

rquintino · August 21, 2019, 10:42am

use like this and it works but really not comfortable, dont know if this is the proper way of applying all train transforms, and also assumes that I know what normalization I do need (if I drop normalization, results become invalid):

(…)
learn=load_learner(model_path.parent,file=model_path.name)

train_val_src=ImageList.from_df(df=df,path=data_folder,cols=‘path’).split_from_df().label_from_df(cols=‘label’)

data_num_workers=8
data = (train_val_src.transform(learn.data.dl_tfms,size=size)
.databunch(bs=batch_size,num_workers=data_num_workers)
.normalize(imagenet_stats)
)

learn.data=data
learn.validate(data.valid_dl)

interp = ClassificationInterpretation.from_learner(learn)
(…)

thanks!

ste · August 21, 2019, 10:55am

I did a similar thing i a slightly different way to verify my results: create a new databunch with only train images and prevent it from shuffling and clipping:

il = ImageList.from_folder(path=path)
ils = il.split_none()
ll = ils.label_from_folder()
ll.transform(tfms=tfms,size=size)
new_data = ll.databunch(bs=bs);
new_data.train_dl = new_data.train_dl.new(shuffle=False, drop_last=False) # Important: prevent shuffle!
new_data.normalize(stats)

Then you substitute the “data” on your learner and compute interpretation on Train:

learn.data = new_data
interp = ClassificationInterpretation.from_learner(learn,ds_type=DatasetType.Train)

NOTE: Sugger suggested how to prevent your Train set to shuffle here:

rquintino · August 21, 2019, 1:01pm

thanks @ste ! my main concern with this, is that it needs some existing knowledge from pipeline used (transforms/normalization). Trying to find a way of black bock inferencing (using load_learner), but still be able to use all fastai validation functions. thoughts?
(test mode inferencing on load_learner does not support labels from what I’ve read)

side note, some doubts on naming param as “test” on load_learner, if the use is suposed to be batch inferencing (ex: production)

ste · August 21, 2019, 1:24pm

Usually you use the test set to make “batched inference” (ie: a LOT of data like kaggle competition).
If you want to make inference on single “item” at time use learn.predict.

See: great @anurag starter code for render/starlette:
prediction = learn.predict(img)[0]

The naming convention fits very good the use case of a kaggle competition.
…Maybe the whole fastai library was initially built on top of Jeremy’s efforts in this direction

muellerzr · August 21, 2019, 1:31pm

@ste I agree with all of the above, but an easier way to do it is like so, if you don’t want to deal with a lot of that train_dl.new stuff (easier in my head)

After ll.transform:

ll.valid = ll.train
db = ll.databunch()

learn.data.valid_dl = db.valid_dl

# Create Databunch
il = ImageList.from_folder(path=data_path)
ils = il.split_none() #All data on Train Set
ll = ils.label_from_folder()
ll.valid = ll.train # @muellerzr Trick!
ll.transform(tfms=None,size=256) # Optional Transforms
data = ll.databunch(bs=32);
data.normalize(stats)

learn.data.valid_dl = data.valid_dl

# Interpret
interp = ClassificationInterpretation.from_learner(learn,ds_type=DatasetType.Valid)
interp.plot_confusion_matrix()

rquintino · August 21, 2019, 3:09pm

What am I looking to have is like this:
-model eval pipeline for exported models, not for production/api inference
-load model in a black box way (as we would for inference), using load_learner
-no knowledge of transforms, normalize,whatsoever like in load_learner/inference
-load a validation set (multiple images, with labels) so batch mode is preferred
-but using for evaluating a validation dataset and running all interpret functions

makes sense?

ste · August 22, 2019, 8:43am

If this is the case, @muellerzr solution seems to be your best option:

This snippet should work in your case:

# Create Databunch
il = ImageList.from_folder(path=data_path)
ils = il.split_none() #All data on Train Set
ll = ils.label_from_folder()
ll.valid = ll.train # @muellerzr Trick!
ll.transform(tfms=None,size=256) # Optional Transforms
data = ll.databunch(bs=32);
data.valid_dl = data.valid_dl.new(shuffle=False, drop_last=False) # Add this to prevent drop and shuffle
data.normalize(stats);

# Interpret
interp = ClassificationInterpretation.from_learner(learn,ds_type=DatasetType.Valid)
interp.plot_confusion_matrix()

muellerzr · August 22, 2019, 12:05pm

@ste, look at my updated post. You don’t need to do the .new(shuffle) etc for this anymore, and I also show how to override the valid_dl on learner

(I also stole that nice example code shamelessly )

ste · August 22, 2019, 12:55pm

Hi,
I agree that shuffle is optional, but drop_last not (skip that if you don’t care to lose at maximum bs-1 samples).

NOTE: The Train DataLoader has always shuffle=True and drop_last=True as you can see on create function.

muellerzr · August 22, 2019, 12:56pm

Drop last does not happen on the validation set, hence why we set valid to the train. This is before databunch and before anything was lost to shuffle and drops You can see this if you do a show_batch between the train and valid after databunching,

Minimal example, look at the Tabular problem (ADULTS).

data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_none()
                           .label_from_df(cols=dep_var))
data.valid = data.train
data = data.databunch())

If you do a show_batch() and compare the train and the validation, the train was shuffled and dropped, the validation was not

Also comparing len(data.train_dl) and len(data.valid_dl) you will see that there is one more batch in the validation (that non-drop last)

aviopene · September 11, 2019, 3:08pm

Today I was trying to validate a loaded model against a labeled test set and I didn’t find an easy way (e.g. a Fast.ai builtin function) to do it, so I wrote a small function to do that. I’m posting it here because I stumbled upon this thread many times in a few hours of search.

The data passed to the function can be the original DataBunch or a newly created DataBunch with the valid set replaced by the labeled test set. Like this for instance:

tfms = get_transforms()
data_test = ImageDataBunch.from_folder(path, train='train', valid='test', bs=bs, ds_tfms=tfms, size=img_size).normalize(imagenet_stats)

This is the function definition:

def evaluate_model_from_interp(interp, data):
    # perform a "manual" evaluation of the model to take a look at predictions vs. labels and to
    # re-compute accuracy from scratch (to double check and also because I didn't find a quick way
    # to extract accuracy inside the guts of Fast.ai after a call to validate() on the test set...)
    print(f'Interp has {len(interp.y_true)} ground truth labels: {interp.y_true}')
    print(f'Interp yielded {len(interp.preds)} raw predictions. First two raw predictions are: {interp.preds[:2]}')
    print(f'The problem had {len(data.classes)} classes: {data.classes}') # data.c is just len(data.classes)
    
    print('')
    print(f'Pred -> GroundTruth = PredLabel -> GroundTruthLabel')
    
    ok_pred = 0
    
    for idx, raw_p in enumerate(interp.preds):
        pred = np.argmax(raw_p)
        if idx < 10:
           print(f'{pred} -> {interp.y_true[idx]} = {data.classes[pred]} -> {data.valid_ds.y[idx]}')
        if pred == interp.y_true[idx]:
           ok_pred += 1
    
    acc = ok_pred / len(interp.y_true)
    print(f'Overall accuracy of the model: {acc:0.5f}')

And then it can be called simply with:

evaluate_model_from_interp(interp, data_test)

saivicky2015 · March 1, 2020, 7:07pm

Hi, but is interp the built in function or is it different here?