Dog Breed Identification challenge

sermakarevich · November 15, 2017, 9:58pm

No reasons to get 0.47 when you have 0. 16

jamesrequa · November 15, 2017, 9:58pm

Sorry if this is a simple question but how would one “train” a model on predictions instead of the images?

thiago · November 15, 2017, 10:02pm

My bad =D

yinterian · November 15, 2017, 10:04pm

First predict on your training images. You will get 1000 probabilities, then use your favorite machine learning package for example (sklearn) with multiclass logistic regression.

arjunrajkumar · November 16, 2017, 12:33am

Hey… Congrats on the score! Could you give a hint on how to train with all the training data. Currently trained by splitting the data between training and validation- so interested to see how much it improves without splitting the data.

Thanks!

arjunrajkumar · November 16, 2017, 12:37am

Is it as simple as moving just one image to validation? Will try that in a bit!

jamesrequa · November 16, 2017, 1:12am

With the current fastai code I don’t think it accepts 0 images for validation set. So for now you can just use val_idxs = [0] which will end up having one image in the validation set and the rest is all in the training set. It should produce similar results to training with all images as its very close to what you want!

bdekoven · November 16, 2017, 2:13am

I have been able to achieve a score of 0.385 mostly by adjustment of the learning rate. I also adjusted precompute and the the cycle_len parameter to obtain the score. The cycle_mult parameter did not seem to have much influence here.

I am not sure how to code model averaging / ensembling at this time. Although I see there are posts in this topic.

jamesrequa · November 16, 2017, 2:24am

What are some of the models you have in your ensemble? just fyi my best ensemble only has 3 models atm.

rikiya · November 16, 2017, 2:27am

Though @jamesrequa already replied, here’s mine.
I did split into training and validation by the following code:
val_idxs = get_cv_idxs(n, val_pct=1e-4)
Then I got val_idxs = 1 as shown below.

Hope this may help

jamesrequa · November 16, 2017, 2:50am

Is this concept the same as what the winner of Kaggle Planet competition is describing here as “ridge regression”?
“to predict the final clear probability (from the resnet-101 model alone), I have a specific clear ridge regression model that takes in the resnet-101 model’s predictions of all 17 labels.”
http://blog.kaggle.com/2017/10/17/planet-understanding-the-amazon-from-space-1st-place-winners-interview/

jeremy · November 16, 2017, 3:29am

No, the planet winner is describing an ensembling approach, whereas @yinterian is simply describing a way to use a pretrained network.

Chris_Palmer · November 16, 2017, 5:23am

Hi @sermakarevich

I have been trying to get my head around what you are doing! When you do a 5-fold CV are you training a model from each architecture 5 times with a randomly resampled set of images (i.e. a different train/test split each time from a full set of data - as in classical CV)?

Then, are you taking an average of the final CV scores of each of your models as the score for your ensemble?

If this is the case, how do you apply that to get an output for the kaggle competition - I have never entered one so it may sound like a dumb question, but I thought that you would have to get an output by passing data into a model, to get a set of predictions / matches. How could you do this with a bunch of models?

Or, perhaps you are doing something on a lower level and creating a new model by assembling / chaining together components from each of the architectures?

Finally, if by some fluke I have described something accurately here, even if you do have a bunch of predictions from a group of models, when applying these to a (real-word) problem, how do you choose what is the “correct prediction” for an item of an unknown class - is it again the average?

Apart from solving a possible over-fitting problem, and gaining confidence about the robustness of your approach since it is an ensemble, how is this approach better than just choosing the very best individual architecture?

sermakarevich · November 16, 2017, 6:07am

Those are inception, inceptionresnet and resnet with different image sizes.

jamesrequa · November 16, 2017, 6:08am

ok you might want to try resnext that was one of my best ones

sermakarevich · November 16, 2017, 6:25am

Hi @Chris_Palmer.

With sklearn.model_selection.StratifiedShuffleSplit and data.from_csv

thats almost right. At each CV step you do 2 predictions: 1 for validation set and one for test set. Test set prediction you just average, train predictions you concat. These oof train predictions you can use to define how to blend/average your test predictions. Those might be simple average, median, weight, or as @yinterian recommended ridge or lr on top of your predictions. This also typically called as stacking - using models outputs as inputs into next level of models.

Take a look at these two articles/posts:

@jamesrequa thanks man, I will try it. So many parameters, its easy to get lost when I don`t have enough intuition

@jeremy @yinterian I think I got the reasons why imagenet network with original layers might perform better than fine tuned. Because it observed more dogs when was trained as in imagenet competitions there was another train/test split. Is thats right?

thiago · November 16, 2017, 11:05am

Great links! Thanks

rikiya · November 16, 2017, 2:38pm

My ensemble also has 3 models from resnext, inceptionresnet, and inception, just fyi.

pierreguillou · November 16, 2017, 6:07pm

Hello @jamesrequa,

sorry if my following question has already been published in the forum.
You wrote :

Does it means there is no way to pass the test folder name AFTER the training of the model learn() ?

I ask this question because I trained my model without test_name (by default, test_name=None) after the following code for data :
data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}labels.csv', val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)

sermakarevich · November 16, 2017, 8:37pm

@jeremy can you please share your approach to fine tune a model (maybe you can cover this in next lecture)? I mean not these steps from lesson 1, but like:

how do you define best dropout
nodes in fc layer
number of fc layers
number of steps with each lr
how do you deal with randomness variation
etc.