Dog Breed Identification challenge


(Chris Palmer) #243

Hi @jamesrequa

Thanks for all of your help on this, and yes, its the DogBreed competition! I am very curious to see where I can get, so I can test my understanding of Jeremy’s and others’ instructions!

Have just submitted this “trained on all the data” model and it does not do very well - a score of 0.22755. I am not sure that I can trust the way I have gone about it, and as there is so much digging around to get clear understanding on this, it’s no wonder :expressionless:

I think I will just go back to the drawing board and try and follow the regular approach to see where it gets me, because I was certainly getting a better score during the building of the model, but all of this extra work with the ful data set has left me in the dark.

Is there any way to test how well we are doing on the test set before we submit to kaggle?


(Chris Palmer) #244

Running through my learning again after setting val_idxs = [0] (i.e. just one validation file), I still am getting validation errors and predictions that look like the previous information when I had a full data set - is this expected? Is this validation against my one and only validation file?

Also, should I be looking at the error rate and accuracy data and choosing a place to stop the learning, as I can see that after the second run I am increasingly overfitting and my accuracy is getting worse, so going all the way through 3 runs may not be desirable? Its very hard to know what is really happening without a validation set!

# step 1
learn.fit(1e-2, 7)
[ 0.       0.22371  0.22195  0.94088]                        
[ 1.       0.20968  0.2259   0.93747]                        
[ 2.       0.20207  0.22398  0.93844]                        
[ 3.       0.20532  0.22566  0.93939]                        
[ 4.       0.18854  0.22653  0.93698]                        
[ 5.       0.20381  0.22526  0.94088]                        
[ 6.       0.21357  0.22947  0.93597]

# step 2 -- should I have stopped after this?
learn.fit(1e-2, 3, cycle_len=2, cycle_mult = 2)
[ 0.       0.17768  0.22844  0.93844]                        
[ 1.       0.17088  0.23041  0.93695]                        
[ 2.       0.16777  0.23185  0.93796]                        
[ 6.       0.17352  0.23387  0.93698]                        
[ 7.       0.16513  0.22885  0.93646]                        
[ 8.       0.16994  0.23512  0.93792]                        
[ 9.       0.16108  0.23063  0.93991]                        
[ 10.        0.15742   0.23026   0.93939]                    
[ 11.        0.14899   0.22877   0.93991]                    
[ 12.        0.14532   0.23005   0.94137]                    
[ 13.        0.16061   0.22951   0.9404 ]   

# step 3 - carry on, even though extreme overfitting???
learn.fit(1e-2, 3, cycle_len=1, cycle_mult = 2)
[ 0.       0.16628  0.23203  0.93503]                        
[ 1.       0.15619  0.23206  0.93646]                        
[ 2.       0.14303  0.23088  0.93548]                        
[ 3.       0.15428  0.23497  0.93796]                        
[ 4.       0.15449  0.23107  0.93841]                        
[ 5.       0.1584   0.23028  0.93841]                        
[ 6.       0.14592  0.2302   0.93942] 

Lesson 3 In-Class Discussion
#245

How do I get predictions on the training set for Dog Breed competition to create a linear model?
I created a model as shown in the cifar 10 post.

def get_data(sz, bs):
        tfms = tfms_from_model(m, sz)
        data = ImageClassifierData.from_csv(PATH, "train", f'{PATH}labels.csv', test_name = "test", 
                                            val_idxs = val_idxs, tfms = tfms, suffix = ".jpg", bs = bs )
        return data if sz>300 else data.resize(340, "tmp")

m = resnet101(True)
bmodel = BasicModel(m.cuda(), name='resnet101')

bs = 58
sz = 299
data = get_data(sz, bs)
learn = ConvLearner(data, bmodel)

I tried train_features = learn.predict(data.trn_dl), but it returns an array with shape (10357, 1000). I also tried train_features = learn.predict(data.trn_ds) and it has the same shape. data.trn_y has a shape (8178,) which is what I expected since 20% of the data is in the validation set.


(Lucas Goulart Vazquez) #246

I’m using:
probs = model.predict_dl(data.trn_dl)


#248

Thank you - learn.predict_dl(data.trn_dl) worked.


(Chris Palmer) #249

Hi @binga

This error is coming from my call to metrics.log_loss. I believe its because something is wrong with y I am passing as a parameter.

If I switch around the parameters and say metrics.log_loss(probs, y) then I get a different error, “ValueError: Multioutput target data is not supported with label binarization” - so I don’t think that the parameters are in an incorrect order.

Actually, y is just an array of zeros which can’t be right, so maybe when we use log_preds, y = learn.TTA(is_test=True) then the y we are getting back is incorrect.

I believe that my y is just zeros because I have used val_idxs = [0] - in order to train with the entire data set.

In summary I believe the problem it reports (that I need to “Please provide the true labels explicitly through the labels argument”) means that I need to get my labels some other way - how would this be done?


(Phani Srikanth) #250

Ah, is that the case? Apologies for missing that part. Now the error makes sense. Since you have put val_idxs = [0], the validation set would have had (almost) less than 10 records and it so happens that all those records are of the same label. Sklearn enforces atleast two different labels constraint; which is the error you see.

Anyways, if you’re training on all the data, this number will not be significant to your analysis right? Why would you want to calculate it anyway?

Edit: I’m on my phone right now. I’ll get back on this a little while later.


(naveen manwani) #251

,i’m trying to use kaggle-cli on crestle .after kg config i encounter this error .do anyone of you have any suggestion what should i do to resolve this issue.


(naveen manwani) #252

does anyone know how i can resolve this issue


(sergii makarevych) #253

Did you try pip install lxml ?


(naveen manwani) #254

yes ,it say requirement already satisfied


(sergii makarevych) #255

Some conflict of versions, I do not know how to solve such issues. Maybe try to re-install kaggle cli?


(Chris Palmer) #256

Did you update kaggle-cli?


(naveen manwani) #257


even when i try to reinstall kaggle cli ,it’s clearly mention requirement already satisfied lxml …
then again the same error is thrown to me


(naveen manwani) #258

to upgrade i used this command pip install -U kaggle-cli,but still facing same issue


(Chris Palmer) #259

You have let me know the problem. I just have one file in my valid set and so y is only one label.

But now I am confused about the is_test = True. I thought this meant that it is going to use the images in the test folder to test against. And that there should be Y labels equivalent for those images, and that TTA function would return return those. But I guess they wouldn’t be a proper test set would they, if you knew the labels!


(naveen manwani) #260

actually guys i solved the problem instead of using pip install kaggle cli ,i used pip3 install kaggle cli


(sergii makarevych) #261

Sorry @naveenmanwani, I am out of good ideas. Only bad left ) Delete both kaggle-cli and lxml and install once again ?


(Phani Srikanth) #262


This line suggests that when you don’t give is_test=True in your TTA function call, the 2nd output in the return statement will contain targets from validation folder - which in your case happens to be 1 record and hence the sklearn error.

However, when you put is_test=True, you are using the images in your test folder (for which you do not know the label) and using the model to generation predictions; in this case library returns zeros as a placeholder as the second output result, which you can consider irrelevant. You only need the first output from TTA function call and from those probabilities you could further extract labels / probabilities as you’d like to process them downstream.


(naveen manwani) #263

nbuser@jupyter:~$ kg config -g -u ##### -p ###### -c Dog Breed Identification
usage: kg config [-h] [-u USERNAME] [-p PASSWORD] [-c COMPETITION] [-g]
kg config: error: unrecognized arguments: Breed Identification
nbuser@jupyter:~$ kg config -h -u ##### -p ##### -c Dog Breed Identification -g
usage: kg [–version] [-v | -q] [–log-file LOG_FILE] [-h] [–debug]

An unofficial Kaggle command line tool.

optional arguments:
–version show program’s version number and exit
-v, --verbose Increase verbosity of output. Can be repeated.
-q, --quiet Suppress output except warnings and errors.
–log-file LOG_FILE Specify a file to log output. Disabled by default.
-h, --help Show help message and exit.
–debug Show tracebacks on errors.

Commands:
complete print bash completion command
config Set config.
dataset Download dataset from a specific user.
download Download data files from a specific competition.
help print detailed help for another command
submissions List recent submissions.
submit Submit an entry to a specific competition.
nbuser@jupyter:~$

does that mean i did everything correctly @sermakarevich,
if yes ,now how should i download the dog breed dataset