How to train on the full dataset using ImageClassifierData.from_csv

wgpubs · November 12, 2017, 6:15pm

With regards to the dog breed competition, @jeremy says in the lesson 2 video that he would “go back and remove the validation set and just rerun the same steps and submit that” (see https://youtu.be/4mwdySNmtYs?t=2h24m1s)

… however, we I try to set val_idxs = None, it throws an exception:

jeremy · November 12, 2017, 6:16pm

Yeah it’s a bug, sorry. Try using [0] as val_idxs for now, so just one thing is in your validation set.

wgpubs · November 12, 2017, 6:17pm

Ah ok. Thanks for the quick reply.

jamesrequa · November 12, 2017, 7:00pm

Yea I tried this and it works. Basically you just end up with one image in your validation set instead of none.

sermakarevich · November 12, 2017, 7:45pm

Oh no, I used range(10)

miguel_perez · November 12, 2017, 10:25pm

Also have some doubts about retraining with whole dataset:

Im sure I am missing something -maybe obvious- here. It seems to me that all this code “needs” a validation dataset. I mean, how Adam is going to work without it? And how can we make it work in the same way it did before?

If we remove the validation set when training on all data then, how do we reproduce the steps given previously when used the “uncomplete” train and validation. Aren’t we blind training now? Will the code just work as previously if reproducing number of epochs, learning rates, in spite of not having a validation set?.

jeremy · November 13, 2017, 12:34am

Adam has nothing to do with a validation set - is there something you’ve read that suggests otherwise?

Yes you’re right, we’re blind training without a validation set. So you have to complete the exact same steps as when you weren’t blind - same LR, same epochs, etc. It should then work fine.

miguel_perez · November 13, 2017, 8:40am

My believe was, Adam adaps lr as a function of how validation loss changes, but obviously I have to find out better how it adapts lr.
(EDIT: I had got Adam completely wrong, understood thanks to fast.ai part 1 explanation, link in below post)

miguel_perez · November 13, 2017, 10:31am

Turns out that best resource to understand Adagrad/Rmsprop/Adam was… FAST.AI !!!
(and yes, I completely had understood adam wrong)

So, thanks to @EricPB 's excelent video timelines of version 1 of this course here Part 1: complete collection of video timelines I could find adagrad here https://youtu.be/V2h3IOBDvrA?t=34m35s explained with an excel spreadsheet, just what I needed!

Thank you @Jeremy, this solved a big misconception I had ,one less left!

wgpubs · November 13, 2017, 3:53pm

This works, but I’m curious.

I thought one of the golden rules was to always have a validation data set (which we are essentially eliminating). So does this work because we have already determined our model and process is good enough with a validation set first? Or does it work simply because the process of training on smaller image sizes and then larger sizes just generalizes well, in which case, it would make sense that we could just start with training on the full data set and forego using a validation set completely?

wgpubs · November 13, 2017, 4:15pm

Another question that I can’t believe I never realized …

In your example, you specify sizes of 224 and then 299 for training … BUT, the get_data() method above resizes both to 340. So what is being gained?

jeremy · November 13, 2017, 4:41pm

Exactly this!

jeremy · November 13, 2017, 4:42pm

The transforms downsize the images to 224 or 299. Reading the jpgs and resizing is slow for big images, so resizing them all to 340 first saves time.

abi · November 13, 2017, 5:25pm

Thank you for the pre-resizing clarification. It was a bit confusing for me too. Thought that was a bug in the code.

wgpubs · November 13, 2017, 5:49pm

What’s interesting to me is that the practice seems to apply to related architectures as well.

Case in point. I started with resnet34 and a validation data set, followed the basic training process, set the data to 299, and went through a few final iterations. Things were looking good so I submitted to kaggle and placed around 60th.

Ran through the same steps without a validation set and things improved. I moved up in the competition somewhere in the 40’s.

So I thought, “Will more complex resent models improve without a validation dataset if I follow the steps above?”

So I ran through the same process with resnext50, and my ranking improved. Ran through it one more time using resnext101, and it put me at 14th place.

abi · November 13, 2017, 6:33pm

def get_data(sz, bs):
    tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
    data = ImageClassifierData.from_csv(path=PATH,
                                        folder='train',
                                        csv_fname=label_csv,
                                        bs=bs,
                                        tfms=tfms,
                                        val_idxs=val_idxs,
                                        test_name='test',
                                        suffix = '.jpg')
    return data if sz > 300 else return data.resize(340, 'tmp')


data_500 = get_data(size=500, bs)

data_224 = get_data(size=224, bs)
data_299 = get_data(size=299, bs)

Actually, the data.resize code() is still a bit not clear to me. The logic says:

That if the image size is > 300, then resize accoring to tfms and nothing else, which means I get 500x500 images/generator back? That means data_500 contains 500x500 images
However, if the image size is <=300, then tfms first resizes them to 224 or 299 and THEN again resize them 340x340 and store the resized pics in ‘340/tmp’ folder? So essentially data_224 and data_299 both have the same content since they are both getting resized to 340 in the end?

Clearly I am not understanding this correctly.

wgpubs · November 13, 2017, 6:39pm

I haven’t tested this, but since you aren’t doing a resize, you’ll just be using the images in training and test folders

If < 300, the framework is going to create a bunch of 340x340 images under /tmp/340, and then USE those images to create the 224 and 299 images for training. The idea is to save computation time by having a big enough set of images to handle most all desired sizes you want to train on.

KevinB · November 13, 2017, 6:50pm

So you are using resnext101 by itself and getting to 14th or are you combing all of these into one ensemble?

wgpubs · November 13, 2017, 6:52pm

No ensembling. It’s pretty amazing.

KevinB · November 13, 2017, 6:54pm

Yikes I have been ensembling like crazy and I’m in 22nd place currently. I might need to go back to the drawing board.