How to train on the full dataset using ImageClassifierData.from_csv

With regards to the dog breed competition, @jeremy says in the lesson 2 video that he would “go back and remove the validation set and just rerun the same steps and submit that” (see

… however, we I try to set val_idxs = None, it throws an exception:


Yeah it’s a bug, sorry. Try using [0] as val_idxs for now, so just one thing is in your validation set.


Ah ok. Thanks for the quick reply.

Yea I tried this and it works. Basically you just end up with one image in your validation set instead of none.

1 Like

Oh no, I used range(10) :frowning:

Also have some doubts about retraining with whole dataset:

Im sure I am missing something -maybe obvious- here. It seems to me that all this code “needs” a validation dataset. I mean, how Adam is going to work without it? And how can we make it work in the same way it did before?

If we remove the validation set when training on all data then, how do we reproduce the steps given previously when used the “uncomplete” train and validation. Aren’t we blind training now? Will the code just work as previously if reproducing number of epochs, learning rates, in spite of not having a validation set?.

Adam has nothing to do with a validation set - is there something you’ve read that suggests otherwise?

Yes you’re right, we’re blind training without a validation set. So you have to complete the exact same steps as when you weren’t blind - same LR, same epochs, etc. It should then work fine.


My believe was, Adam adaps lr as a function of how validation loss changes, but obviously I have to find out better how it adapts lr. :thinking:
(EDIT: I had got Adam completely wrong, understood thanks to part 1 explanation, link in below post)

Turns out that best resource to understand Adagrad/Rmsprop/Adam was… FAST.AI !!! :grinning:
(and yes, I completely had understood adam wrong)

So, thanks to @EricPB 's excelent video timelines of version 1 of this course here Part 1: complete collection of video timelines I could find adagrad here explained with an excel spreadsheet, just what I needed!

Thank you @Jeremy, this solved a big misconception I had ,one less left! :grinning:


This works, but I’m curious.

I thought one of the golden rules was to always have a validation data set (which we are essentially eliminating). So does this work because we have already determined our model and process is good enough with a validation set first? Or does it work simply because the process of training on smaller image sizes and then larger sizes just generalizes well, in which case, it would make sense that we could just start with training on the full data set and forego using a validation set completely?

Another question that I can’t believe I never realized …

In your example, you specify sizes of 224 and then 299 for training … BUT, the get_data() method above resizes both to 340. So what is being gained?

1 Like

Exactly this!

1 Like

The transforms downsize the images to 224 or 299. Reading the jpgs and resizing is slow for big images, so resizing them all to 340 first saves time.


Thank you for the pre-resizing clarification. It was a bit confusing for me too. Thought that was a bug in the code.

What’s interesting to me is that the practice seems to apply to related architectures as well.

Case in point. I started with resnet34 and a validation data set, followed the basic training process, set the data to 299, and went through a few final iterations. Things were looking good so I submitted to kaggle and placed around 60th.

Ran through the same steps without a validation set and things improved. I moved up in the competition somewhere in the 40’s.

So I thought, “Will more complex resent models improve without a validation dataset if I follow the steps above?”

So I ran through the same process with resnext50, and my ranking improved. Ran through it one more time using resnext101, and it put me at 14th place.

def get_data(sz, bs):
    tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
    data = ImageClassifierData.from_csv(path=PATH,
                                        suffix = '.jpg')
    return data if sz > 300 else return data.resize(340, 'tmp')

data_500 = get_data(size=500, bs)

data_224 = get_data(size=224, bs)
data_299 = get_data(size=299, bs)

Actually, the data.resize code() is still a bit not clear to me. The logic says:

  • That if the image size is > 300, then resize accoring to tfms and nothing else, which means I get 500x500 images/generator back? That means data_500 contains 500x500 images
  • However, if the image size is <=300, then tfms first resizes them to 224 or 299 and THEN again resize them 340x340 and store the resized pics in ‘340/tmp’ folder? So essentially data_224 and data_299 both have the same content since they are both getting resized to 340 in the end?

Clearly I am not understanding this correctly.


I haven’t tested this, but since you aren’t doing a resize, you’ll just be using the images in training and test folders

If < 300, the framework is going to create a bunch of 340x340 images under /tmp/340, and then USE those images to create the 224 and 299 images for training. The idea is to save computation time by having a big enough set of images to handle most all desired sizes you want to train on.


So you are using resnext101 by itself and getting to 14th or are you combing all of these into one ensemble?

No ensembling. It’s pretty amazing.

Yikes I have been ensembling like crazy and I’m in 22nd place currently. I might need to go back to the drawing board.