Dog Breed Identification challenge

Hi @jamesrequa

Thanks for all of your help on this, and yes, its the DogBreed competition! I am very curious to see where I can get, so I can test my understanding of Jeremy’s and others’ instructions!

Have just submitted this “trained on all the data” model and it does not do very well - a score of 0.22755. I am not sure that I can trust the way I have gone about it, and as there is so much digging around to get clear understanding on this, it’s no wonder :expressionless:

I think I will just go back to the drawing board and try and follow the regular approach to see where it gets me, because I was certainly getting a better score during the building of the model, but all of this extra work with the ful data set has left me in the dark.

Is there any way to test how well we are doing on the test set before we submit to kaggle?

Running through my learning again after setting val_idxs = [0] (i.e. just one validation file), I still am getting validation errors and predictions that look like the previous information when I had a full data set - is this expected? Is this validation against my one and only validation file?

Also, should I be looking at the error rate and accuracy data and choosing a place to stop the learning, as I can see that after the second run I am increasingly overfitting and my accuracy is getting worse, so going all the way through 3 runs may not be desirable? Its very hard to know what is really happening without a validation set!

# step 1
learn.fit(1e-2, 7)
[ 0.       0.22371  0.22195  0.94088]                        
[ 1.       0.20968  0.2259   0.93747]                        
[ 2.       0.20207  0.22398  0.93844]                        
[ 3.       0.20532  0.22566  0.93939]                        
[ 4.       0.18854  0.22653  0.93698]                        
[ 5.       0.20381  0.22526  0.94088]                        
[ 6.       0.21357  0.22947  0.93597]

# step 2 -- should I have stopped after this?
learn.fit(1e-2, 3, cycle_len=2, cycle_mult = 2)
[ 0.       0.17768  0.22844  0.93844]                        
[ 1.       0.17088  0.23041  0.93695]                        
[ 2.       0.16777  0.23185  0.93796]                        
[ 6.       0.17352  0.23387  0.93698]                        
[ 7.       0.16513  0.22885  0.93646]                        
[ 8.       0.16994  0.23512  0.93792]                        
[ 9.       0.16108  0.23063  0.93991]                        
[ 10.        0.15742   0.23026   0.93939]                    
[ 11.        0.14899   0.22877   0.93991]                    
[ 12.        0.14532   0.23005   0.94137]                    
[ 13.        0.16061   0.22951   0.9404 ]   

# step 3 - carry on, even though extreme overfitting???
learn.fit(1e-2, 3, cycle_len=1, cycle_mult = 2)
[ 0.       0.16628  0.23203  0.93503]                        
[ 1.       0.15619  0.23206  0.93646]                        
[ 2.       0.14303  0.23088  0.93548]                        
[ 3.       0.15428  0.23497  0.93796]                        
[ 4.       0.15449  0.23107  0.93841]                        
[ 5.       0.1584   0.23028  0.93841]                        
[ 6.       0.14592  0.2302   0.93942] 

How do I get predictions on the training set for Dog Breed competition to create a linear model?
I created a model as shown in the cifar 10 post.

def get_data(sz, bs):
        tfms = tfms_from_model(m, sz)
        data = ImageClassifierData.from_csv(PATH, "train", f'{PATH}labels.csv', test_name = "test", 
                                            val_idxs = val_idxs, tfms = tfms, suffix = ".jpg", bs = bs )
        return data if sz>300 else data.resize(340, "tmp")

m = resnet101(True)
bmodel = BasicModel(m.cuda(), name='resnet101')

bs = 58
sz = 299
data = get_data(sz, bs)
learn = ConvLearner(data, bmodel)

I tried train_features = learn.predict(data.trn_dl), but it returns an array with shape (10357, 1000). I also tried train_features = learn.predict(data.trn_ds) and it has the same shape. data.trn_y has a shape (8178,) which is what I expected since 20% of the data is in the validation set.

I’m using:
probs = model.predict_dl(data.trn_dl)

2 Likes

Thank you - learn.predict_dl(data.trn_dl) worked.

Hi @binga

This error is coming from my call to metrics.log_loss. I believe its because something is wrong with y I am passing as a parameter.

If I switch around the parameters and say metrics.log_loss(probs, y) then I get a different error, “ValueError: Multioutput target data is not supported with label binarization” - so I don’t think that the parameters are in an incorrect order.

Actually, y is just an array of zeros which can’t be right, so maybe when we use log_preds, y = learn.TTA(is_test=True) then the y we are getting back is incorrect.

I believe that my y is just zeros because I have used val_idxs = [0] - in order to train with the entire data set.

In summary I believe the problem it reports (that I need to “Please provide the true labels explicitly through the labels argument”) means that I need to get my labels some other way - how would this be done?

Ah, is that the case? Apologies for missing that part. Now the error makes sense. Since you have put val_idxs = [0], the validation set would have had (almost) less than 10 records and it so happens that all those records are of the same label. Sklearn enforces atleast two different labels constraint; which is the error you see.

Anyways, if you’re training on all the data, this number will not be significant to your analysis right? Why would you want to calculate it anyway?

Edit: I’m on my phone right now. I’ll get back on this a little while later.

,i’m trying to use kaggle-cli on crestle .after kg config i encounter this error .do anyone of you have any suggestion what should i do to resolve this issue.

does anyone know how i can resolve this issue

Did you try pip install lxml ?

yes ,it say requirement already satisfied

Some conflict of versions, I do not know how to solve such issues. Maybe try to re-install kaggle cli?

Did you update kaggle-cli?


even when i try to reinstall kaggle cli ,it’s clearly mention requirement already satisfied lxml …
then again the same error is thrown to me

to upgrade i used this command pip install -U kaggle-cli,but still facing same issue

You have let me know the problem. I just have one file in my valid set and so y is only one label.

But now I am confused about the is_test = True. I thought this meant that it is going to use the images in the test folder to test against. And that there should be Y labels equivalent for those images, and that TTA function would return return those. But I guess they wouldn’t be a proper test set would they, if you knew the labels!

actually guys i solved the problem instead of using pip install kaggle cli ,i used pip3 install kaggle cli

1 Like

Sorry @naveenmanwani, I am out of good ideas. Only bad left ) Delete both kaggle-cli and lxml and install once again ?

1 Like


This line suggests that when you don’t give is_test=True in your TTA function call, the 2nd output in the return statement will contain targets from validation folder - which in your case happens to be 1 record and hence the sklearn error.

However, when you put is_test=True, you are using the images in your test folder (for which you do not know the label) and using the model to generation predictions; in this case library returns zeros as a placeholder as the second output result, which you can consider irrelevant. You only need the first output from TTA function call and from those probabilities you could further extract labels / probabilities as you’d like to process them downstream.

1 Like

nbuser@jupyter:~$ kg config -g -u ##### -p ###### -c Dog Breed Identification
usage: kg config [-h] [-u USERNAME] [-p PASSWORD] [-c COMPETITION] [-g]
kg config: error: unrecognized arguments: Breed Identification
nbuser@jupyter:~$ kg config -h -u ##### -p ##### -c Dog Breed Identification -g
usage: kg [–version] [-v | -q] [–log-file LOG_FILE] [-h] [–debug]

An unofficial Kaggle command line tool.

optional arguments:
–version show program’s version number and exit
-v, --verbose Increase verbosity of output. Can be repeated.
-q, --quiet Suppress output except warnings and errors.
–log-file LOG_FILE Specify a file to log output. Disabled by default.
-h, --help Show help message and exit.
–debug Show tracebacks on errors.

Commands:
complete print bash completion command
config Set config.
dataset Download dataset from a specific user.
download Download data files from a specific competition.
help print detailed help for another command
submissions List recent submissions.
submit Submit an entry to a specific competition.
nbuser@jupyter:~$

does that mean i did everything correctly @sermakarevich,
if yes ,now how should i download the dog breed dataset