Calculating the Accuracy for test set

Thanks @marcmuc for the quick response.
I followed the method you mention and I got the accuracy For the test set but when I tried to show the top losses or draw the confusion matrix it shows the real validation set data! not the test set.

I don’t know why.

Here is my code:

  1. main data set to train the model (train, valid ):

bs = 64
path = Path(‘train’).resolve()
data = (ImageItemList.from_folder(path)

and when I finished training the model I used:
2. test set

path = Path().resolve()
data_test = (ImageItemList.from_folder(path)
.split_by_folder(train=‘train’, valid=‘test’)
.transform( size=224)

  1. step to validate the test set:

  2. step to show the top losses and condusion matrix (the problem):
    interp = ClassificationInterpretation.from_learner(learn)
    losses,idxs = interp.top_losses()
    interp.plot_top_losses(9, figsize=(15,11), heatmap=False)
    interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

it gives the confusion of the real validation data even I wrote ‘data_test.valid_ds’, not ‘data.valid_ds’.

I hope you will get my point and can help me fix it.


It‘s probably because you use. ClassificationInterpretation.from_learner() and pass it the learner you used for training, which contains the original Val set. Try to save the Model, create a new learner with your second databunch and reload the weights, then do the Interpretation.


It works! Thank you so much for your help! :heart_eyes:

Hello @marcmuc. I started using a few days ago and went through lesson 1. Do I understand your comment and the docs to mean a “test” set denoted in imagedatabunch is the same as the predict function in Keras? If I save the model and want to deploy it, do I have to pass new data into the test parameter of the imagedatabunch?

hi @marcmuc I’m trying to do the eval (conf matrix, top losses,etc) based on an exported model and just using load_learner. Assuming we dont know anything else about the exported model, ie no transforms, normalization,etc. just want to use same pipeline as exported to avoid any mismatch, as this can be happening later in time, other kernel,etc. But still be able to use all eval functions available.
is this possible?

huge thanks!

Late answer, but instead of:

try to use:

interp = ClassificationInterpretation.from_learner(learn, ds_type=DatasetType.Train)

This force the ìnterp` to work with test set.

use like this and it works but really not comfortable, dont know if this is the proper way of applying all train transforms, and also assumes that I know what normalization I do need (if I drop normalization, results become invalid):



data = (train_val_src.transform(,size=size)

interp = ClassificationInterpretation.from_learner(learn)


I did a similar thing i a slightly different way to verify my results: create a new databunch with only train images and prevent it from shuffling and clipping:

il = ImageList.from_folder(path=path)
ils = il.split_none()
ll = ils.label_from_folder()
new_data = ll.databunch(bs=bs);
new_data.train_dl =, drop_last=False) # Important: prevent shuffle!

Then you substitute the “data” on your learner and compute interpretation on Train: = new_data
interp = ClassificationInterpretation.from_learner(learn,ds_type=DatasetType.Train)

NOTE: Sugger suggested how to prevent your Train set to shuffle here:


thanks @ste ! my main concern with this, is that it needs some existing knowledge from pipeline used (transforms/normalization). Trying to find a way of black bock inferencing (using load_learner), but still be able to use all fastai validation functions. thoughts?
(test mode inferencing on load_learner does not support labels from what I’ve read)

side note, some doubts on naming param as “test” on load_learner, if the use is suposed to be batch inferencing (ex: production)

Usually you use the test set to make “batched inference” (ie: a LOT of data like kaggle competition).
If you want to make inference on single “item” at time use learn.predict.

See: great @anurag starter code for render/starlette:
prediction = learn.predict(img)[0]

The naming convention fits very good the use case of a kaggle competition.
…Maybe the whole fastai library was initially built on top of Jeremy’s efforts in this direction :wink:

@ste I agree with all of the above, but an easier way to do it is like so, if you don’t want to deal with a lot of that stuff (easier in my head)

After ll.transform:

ll.valid = ll.train
db = ll.databunch() = db.valid_dl

# Create Databunch
il = ImageList.from_folder(path=data_path)
ils = il.split_none() #All data on Train Set
ll = ils.label_from_folder()
ll.valid = ll.train # @muellerzr Trick!
ll.transform(tfms=None,size=256) # Optional Transforms
data = ll.databunch(bs=32);
data.normalize(stats) = data.valid_dl

# Interpret
interp = ClassificationInterpretation.from_learner(learn,ds_type=DatasetType.Valid)

What am I looking to have is like this:
-model eval pipeline for exported models, not for production/api inference
-load model in a black box way (as we would for inference), using load_learner
-no knowledge of transforms, normalize,whatsoever like in load_learner/inference
-load a validation set (multiple images, with labels) so batch mode is preferred
-but using for evaluating a validation dataset and running all interpret functions

makes sense? :slight_smile:

If this is the case, @muellerzr solution seems to be your best option:

This snippet should work in your case:

# Create Databunch
il = ImageList.from_folder(path=data_path)
ils = il.split_none() #All data on Train Set
ll = ils.label_from_folder()
ll.valid = ll.train # @muellerzr Trick!
ll.transform(tfms=None,size=256) # Optional Transforms
data = ll.databunch(bs=32);
data.valid_dl =, drop_last=False) # Add this to prevent drop and shuffle

# Interpret
interp = ClassificationInterpretation.from_learner(learn,ds_type=DatasetType.Valid)

@ste, look at my updated post. You don’t need to do the .new(shuffle) etc for this anymore, and I also show how to override the valid_dl on learner

(I also stole that nice example code shamelessly :wink: )

1 Like

I agree that shuffle is optional, but drop_last not (skip that if you don’t care to lose at maximum bs-1 samples).

NOTE: The Train DataLoader has always shuffle=True and drop_last=True as you can see on create function.


Drop last does not happen on the validation set, hence why we set valid to the train. This is before databunch and before anything was lost to shuffle and drops :slight_smile: You can see this if you do a show_batch between the train and valid after databunching,

Minimal example, look at the Tabular problem (ADULTS).

data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
data.valid = data.train
data = data.databunch())

If you do a show_batch() and compare the train and the validation, the train was shuffled and dropped, the validation was not

Also comparing len(data.train_dl) and len(data.valid_dl) you will see that there is one more batch in the validation (that non-drop last)


Today I was trying to validate a loaded model against a labeled test set and I didn’t find an easy way (e.g. a builtin function) to do it, so I wrote a small function to do that. I’m posting it here because I stumbled upon this thread many times in a few hours of search.

The data passed to the function can be the original DataBunch or a newly created DataBunch with the valid set replaced by the labeled test set. Like this for instance:

tfms = get_transforms()
data_test = ImageDataBunch.from_folder(path, train='train', valid='test', bs=bs, ds_tfms=tfms, size=img_size).normalize(imagenet_stats)

This is the function definition:

def evaluate_model_from_interp(interp, data):
    # perform a "manual" evaluation of the model to take a look at predictions vs. labels and to
    # re-compute accuracy from scratch (to double check and also because I didn't find a quick way
    # to extract accuracy inside the guts of after a call to validate() on the test set...)
    print(f'Interp has {len(interp.y_true)} ground truth labels: {interp.y_true}')
    print(f'Interp yielded {len(interp.preds)} raw predictions. First two raw predictions are: {interp.preds[:2]}')
    print(f'The problem had {len(data.classes)} classes: {data.classes}') # data.c is just len(data.classes)
    print(f'Pred -> GroundTruth = PredLabel -> GroundTruthLabel')
    ok_pred = 0
    for idx, raw_p in enumerate(interp.preds):
        pred = np.argmax(raw_p)
        if idx < 10:
           print(f'{pred} -> {interp.y_true[idx]} = {data.classes[pred]} -> {data.valid_ds.y[idx]}')
        if pred == interp.y_true[idx]:
           ok_pred += 1
    acc = ok_pred / len(interp.y_true)
    print(f'Overall accuracy of the model: {acc:0.5f}')

And then it can be called simply with:

evaluate_model_from_interp(interp, data_test)

Hi, but is interp the built in function or is it different here?

I was looking a way to print the overall accuracy of the model WRT to my test set (with labels). Turned out in the test set is always unlabeled, so this is not possible directly, but one has first to create another DataBunch (or replace the validation set in the existing one).

Anyway, also with another DataBunch, I didn’t find the way to print the accuracy, so I came up with this function.

@muellerzr: Thank you for this solution.

I was wondering whether I can also use .to_fp16() here?

I tried the following which did not work: = data.valid_dl.add_tfm(to_half)

When I “splitted” it into 2 lines, it seemed to work: Is this right?

xz =
xz = data.valid_dl.add_tfm(to_half)

Does the complete example below look right? (If it’s okay that I ask :slight_smile: - I am afraid to do something wrong and get predictions that are somehow wrong in reality).

And I also have a “is_valid” column with only "True"s in my df: It does not matter whether such a third column is still inside the test set, does it?

And is it important that batch size here and in train/valid set are the same?

Thank you for your work! :slight_smile:

#Complete example from me:
    il = ImageList.from_df(df_test, path = '/home/name_folder')
    ils = il.split_none() #All data on Train Set
    ll = ils.label_from_df(cols='label', label_cls = CategoryList)
    ll.valid = ll.train # @muellerzr Trick!
    ll.transform(tfms=get_transforms(flip_vert=True, max_zoom=1., max_warp=None),size=256) # Optional Transforms
    data = ll.databunch(bs=120);

    xz =
    xz = data.valid_dl.add_tfm(to_half)

    # Interpret
    interp = ClassificationInterpretation.from_learner(learn,ds_type=DatasetType.Valid)