Calculating the Accuracy for test set

Hello all,

I have a test set with 10 subfolders where each subfolder name = label.
my question is: How can I calculate the accuracy after the prediction? (how can I compare the predicted labels with the actual labels)

I couldn’t find any answer in the forum I am struggling for weeks.
Please help me if you can.



In fastai the test set is expected to be unlabeled data, so you cannot calculate the accuracy on that if it is specified as “test”. All functionality in fastai is set up to use the val set for accuracy, confusion matrix etc.

So if you have a labeled test set, you could first train your model using your real train/val sets, save your model. Then create a new databunceh in which you define your labeled test set as your “fastai val set”, load your trained model and do prediction, confusion matrix, accuracy etc. on that test set (now for fastai purposes defined as the val set).


Thanks @marcmuc for the quick response.
I followed the method you mention and I got the accuracy For the test set but when I tried to show the top losses or draw the confusion matrix it shows the real validation set data! not the test set.

I don’t know why.

Here is my code:

  1. main data set to train the model (train, valid ):

bs = 64
path = Path(‘train’).resolve()
data = (ImageItemList.from_folder(path)

and when I finished training the model I used:
2. test set

path = Path().resolve()
data_test = (ImageItemList.from_folder(path)
.split_by_folder(train=‘train’, valid=‘test’)
.transform( size=224)

  1. step to validate the test set:

  2. step to show the top losses and condusion matrix (the problem):
    interp = ClassificationInterpretation.from_learner(learn)
    losses,idxs = interp.top_losses()
    interp.plot_top_losses(9, figsize=(15,11), heatmap=False)
    interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

it gives the confusion of the real validation data even I wrote ‘data_test.valid_ds’, not ‘data.valid_ds’.

I hope you will get my point and can help me fix it.

1 Like

It‘s probably because you use. ClassificationInterpretation.from_learner() and pass it the learner you used for training, which contains the original Val set. Try to save the Model, create a new learner with your second databunch and reload the weights, then do the Interpretation.


It works! Thank you so much for your help! :heart_eyes:

Hello @marcmuc. I started using a few days ago and went through lesson 1. Do I understand your comment and the docs to mean a “test” set denoted in imagedatabunch is the same as the predict function in Keras? If I save the model and want to deploy it, do I have to pass new data into the test parameter of the imagedatabunch?

hi @marcmuc I’m trying to do the eval (conf matrix, top losses,etc) based on an exported model and just using load_learner. Assuming we dont know anything else about the exported model, ie no transforms, normalization,etc. just want to use same pipeline as exported to avoid any mismatch, as this can be happening later in time, other kernel,etc. But still be able to use all eval functions available.
is this possible?

huge thanks!

Late answer, but instead of:

try to use:

interp = ClassificationInterpretation.from_learner(learn, ds_type=DatasetType.Train)

This force the ìnterp` to work with test set.

use like this and it works but really not comfortable, dont know if this is the proper way of applying all train transforms, and also assumes that I know what normalization I do need (if I drop normalization, results become invalid):



data = (train_val_src.transform(,size=size)

interp = ClassificationInterpretation.from_learner(learn)


I did a similar thing i a slightly different way to verify my results: create a new databunch with only train images and prevent it from shuffling and clipping:

il = ImageList.from_folder(path=path)
ils = il.split_none()
ll = ils.label_from_folder()
new_data = ll.databunch(bs=bs);
new_data.train_dl =, drop_last=False) # Important: prevent shuffle!

Then you substitute the “data” on your learner and compute interpretation on Train: = new_data
interp = ClassificationInterpretation.from_learner(learn,ds_type=DatasetType.Train)

NOTE: Sugger suggested how to prevent your Train set to shuffle here:


thanks @ste ! my main concern with this, is that it needs some existing knowledge from pipeline used (transforms/normalization). Trying to find a way of black bock inferencing (using load_learner), but still be able to use all fastai validation functions. thoughts?
(test mode inferencing on load_learner does not support labels from what I’ve read)

side note, some doubts on naming param as “test” on load_learner, if the use is suposed to be batch inferencing (ex: production)

Usually you use the test set to make “batched inference” (ie: a LOT of data like kaggle competition).
If you want to make inference on single “item” at time use learn.predict.

See: great @anurag starter code for render/starlette:
prediction = learn.predict(img)[0]

The naming convention fits very good the use case of a kaggle competition.
…Maybe the whole fastai library was initially built on top of Jeremy’s efforts in this direction :wink:

@ste I agree with all of the above, but an easier way to do it is like so, if you don’t want to deal with a lot of that stuff (easier in my head)

After ll.transform:

ll.valid = ll.train
db = ll.databunch() = db.valid_dl

# Create Databunch
il = ImageList.from_folder(path=data_path)
ils = il.split_none() #All data on Train Set
ll = ils.label_from_folder()
ll.valid = ll.train # @muellerzr Trick!
ll.transform(tfms=None,size=256) # Optional Transforms
data = ll.databunch(bs=32);
data.normalize(stats) = data.valid_dl

# Interpret
interp = ClassificationInterpretation.from_learner(learn,ds_type=DatasetType.Valid)

What am I looking to have is like this:
-model eval pipeline for exported models, not for production/api inference
-load model in a black box way (as we would for inference), using load_learner
-no knowledge of transforms, normalize,whatsoever like in load_learner/inference
-load a validation set (multiple images, with labels) so batch mode is preferred
-but using for evaluating a validation dataset and running all interpret functions

makes sense? :slight_smile:

If this is the case, @muellerzr solution seems to be your best option:

This snippet should work in your case:

# Create Databunch
il = ImageList.from_folder(path=data_path)
ils = il.split_none() #All data on Train Set
ll = ils.label_from_folder()
ll.valid = ll.train # @muellerzr Trick!
ll.transform(tfms=None,size=256) # Optional Transforms
data = ll.databunch(bs=32);
data.valid_dl =, drop_last=False) # Add this to prevent drop and shuffle

# Interpret
interp = ClassificationInterpretation.from_learner(learn,ds_type=DatasetType.Valid)

@ste, look at my updated post. You don’t need to do the .new(shuffle) etc for this anymore, and I also show how to override the valid_dl on learner

(I also stole that nice example code shamelessly :wink: )

1 Like

I agree that shuffle is optional, but drop_last not (skip that if you don’t care to lose at maximum bs-1 samples).

NOTE: The Train DataLoader has always shuffle=True and drop_last=True as you can see on create function.


Drop last does not happen on the validation set, hence why we set valid to the train. This is before databunch and before anything was lost to shuffle and drops :slight_smile: You can see this if you do a show_batch between the train and valid after databunching,

Minimal example, look at the Tabular problem (ADULTS).

data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
data.valid = data.train
data = data.databunch())

If you do a show_batch() and compare the train and the validation, the train was shuffled and dropped, the validation was not

Also comparing len(data.train_dl) and len(data.valid_dl) you will see that there is one more batch in the validation (that non-drop last)


Today I was trying to validate a loaded model against a labeled test set and I didn’t find an easy way (e.g. a builtin function) to do it, so I wrote a small function to do that. I’m posting it here because I stumbled upon this thread many times in a few hours of search.

The data passed to the function can be the original DataBunch or a newly created DataBunch with the valid set replaced by the labeled test set. Like this for instance:

tfms = get_transforms()
data_test = ImageDataBunch.from_folder(path, train='train', valid='test', bs=bs, ds_tfms=tfms, size=img_size).normalize(imagenet_stats)

This is the function definition:

def evaluate_model_from_interp(interp, data):
    # perform a "manual" evaluation of the model to take a look at predictions vs. labels and to
    # re-compute accuracy from scratch (to double check and also because I didn't find a quick way
    # to extract accuracy inside the guts of after a call to validate() on the test set...)
    print(f'Interp has {len(interp.y_true)} ground truth labels: {interp.y_true}')
    print(f'Interp yielded {len(interp.preds)} raw predictions. First two raw predictions are: {interp.preds[:2]}')
    print(f'The problem had {len(data.classes)} classes: {data.classes}') # data.c is just len(data.classes)
    print(f'Pred -> GroundTruth = PredLabel -> GroundTruthLabel')
    ok_pred = 0
    for idx, raw_p in enumerate(interp.preds):
        pred = np.argmax(raw_p)
        if idx < 10:
           print(f'{pred} -> {interp.y_true[idx]} = {data.classes[pred]} -> {data.valid_ds.y[idx]}')
        if pred == interp.y_true[idx]:
           ok_pred += 1
    acc = ok_pred / len(interp.y_true)
    print(f'Overall accuracy of the model: {acc:0.5f}')

And then it can be called simply with:

evaluate_model_from_interp(interp, data_test)

Hi, but is interp the built in function or is it different here?