Planet Classification Challenge


(Kevin Bird) #1

Creating a thread similar to Dog Breed Identification Challenge to discuss Planet Classification Challenge and any insights, roadblocks, or scores. Since this one is expired I think we should be able to share code here as well possibly?

Kaggle link: https://www.kaggle.com/c/planet-understanding-the-amazon-from-space


(Jeremy Howard) #2

Yup can share! :slight_smile:


(Apil Tamang) #3

@jeremy

Almost done with the planet kaggle competition… but kinda stuck at submission file generation.

One question: we agreed that we were going to use the sigmoid function for generating output values on the very end. Now these values are not going to be normalized. So, for one sample image learn.TTA(is_test=True) might look something like:

[ 0.29374, 0.00393, 0.00965, 0.00235, 0.00266, 0.01863, 0.05588, 0.00071, 0.04019, 0.01701,
0.00452, 0.88296, 0.93706, 0.15011, 0.00272, 0.00258, 0.09983],

How do we know what value we should use to determine if one of the 17 categories is predicted for this image using the numbers above?

I say so, because I looked at the sample_submission.csv file, and it looks as follows:

file_9973,primary clear agriculture road water
file_9974,primary clear agriculture road water
file_9975,primary clear agriculture road water
file_9976,primary clear agriculture road water
file_9977,primary clear agriculture road water
file_9978,primary clear agriculture road water
file_9979,primary clear agriculture road water

Sorry… haven’t had the time to do more research. It’s 11 PM and my sleep-deprived (surprise, surprise!) brain is ready to call it a day.

Thnx.


(Kevin Bird) #4

Sample_submission just does the same value for everything as a template of what the submission file will look like. For yours I am guessing you will look at the ones that are greater than 0.5 so in this case:

[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0]

And then you will match those ones up to the value they represent so in this case it could be clear and primary or hazy agriculture.


(James Requa) #5

For this competition and for most multi-label problems like this, you’ll need to choose an optimal “threshold” so anything above it is considered and anything below it is not. Luckily the fastai library has a function called opt_th that actually finds the threshold for us! You can find it in the planet.py file

Like Kevin mentioned you’ll need to make sure your predictions are matched up with the right labels since the submission file actually is looking for the label names not the prediction values.


(Kevin Bird) #6

Wow, that is awesome. I was planning on using 0.5, but something tells me this might (100% definitely) be a better method.


(James Requa) #7

Yea lot of people used 0.2 as a good fixed threshold for all labels in this comp, 0.5 would be too high I think. Using opt_th of course should be even better as the optimal threshold shouldn’t be the same value for all labels.


(Apil Tamang) #8

@all

(Preamble: There’s an issue with using the planet dataset that’s being discussed here. There’s also a way to side-step the exception-trace as long as you’re not producing the submission file!
https://github.com/fastai/fastai/issues/23#issuecomment-344795382)

Got my first submission in. At 0.8898 on the leaderboard, think it’s fairly okay. I barely fit the model once, and focused more on getting the submission file done. Here’s my lousy script that produces the file. Nothing fancy, and very java-ish like. I couldn’t, for the sake of good life, get one of them fancy python generators working.

mapp={}
for i in range(17):
    result = data.classes[i]
    mapp[i] = result

so, that mapp is:

{0: 'agriculture',
 1: 'artisinal_mine',
 2: 'bare_ground',
 3: 'blooming',
 4: 'blow_down',
 5: 'clear',
 6: 'cloudy',
 7: 'conventional_mine',
 8: 'cultivation',
 9: 'habitation',
 10: 'haze',
 11: 'partly_cloudy',
 12: 'primary',
 13: 'road',
 14: 'selective_logging',
 15: 'slash_burn',
 16: 'water'}

The script for producing the test thus would be:

import re
tta_test = learn.TTA(is_test=True)
files = learn.data.test_ds.fnames
predictions = tta_test[0]
pattern = 'test-jpg\/(.*)\.jpg'
files = list(learn.data.test_ds.fnames)

with open("planet_submission.csv.apil","w") as f:
    f.write('image_name,tags\n')
    for i in tqdm(range(61191)):
        
        # files[i] is of this form: test-jpg/test_xyze.jpg
        # Only want to extract the 'text_xyze' part for the
        # submission
        pattern = re.search('test-jpg\/(.*)\.jpg',files[i])        
        
        predLine = pattern.group(1)+','
        prediction = predictions[i]

        for j in range(17):
            
            # only use the prediction if
            # score is greater than 0.2
            
            if(prediction[j] > 0.2):
                predLine +=mapp[j]+' '
        
        f.write(predLine+"\n")

Now, all I need to do is figure out how to leverage the op_th function! Hmmm… Wife’s really mad I’m doing nothing but coding since I got home. Gotta go to bed !!


(Jeremy Howard) #9

You can do that after the course finishes :wink:

Thanks for sharing that code. But gosh yes it is java-ish - we’ll try to help fix it up a little…


(Hiromi Suenaga) #10

Dear Python/PyTorch gurus out there,

Is this just me or is this something most people can reproduce? I ran it on an AWS instance created by a script from version 1 of this course and also on a new one that’s created via the new part 1 v2 AMI and saw the same result. Also, I downloaded data from Kaggle on each instance rather than copying them over.


(Sarada Lee) #11

I did “git pull” and “conda update --all”. But, no module named 'fast_gen" in planet_cv.ipynb


(Maureen Metzger) #12

@Moody, might be an issue of where you are running the notebook from, as often this error happens when the fastai directory is not where the notebooks expects to find it, based on your current working directory


(Jeremy Howard) #13

Yeah we haven’t covered planet_cv yet, so it’s not in working condition. You should change the imports to be the same as our lesson1 and lesson2 notebooks. Probably some other changes needed too.


(Kevin Bird) #14

So it looks like that opt_th is for the validation set, once you are moving from validation to test, how do you determine what th you should use?


(James Requa) #15

I haven’t tested it extensively yet but I think we have to just go with the thresholds based on how they did on the validation set, since we don’t really know in advance how they will perform on the test set…However, if you were to retrain with all of the data have you checked to see if opt_th still works? Cause that would probably be an even better indication of which thresholds to use…Looking at the code now…for the targs parameter couldn’t you just feed it with the y values from the whole dataset (training with all images - no validation set) ?


(Kevin Bird) #16

Yeah, you could figure out the optimal threshold on all of the known values, So you would just do the actuals vs predicted of all the training images and that would give you the threshold and then you would just have to put that into your testing threshold and hope it is the same. I think for my first submission I’m just going to use 0.2. That can be v2 for me.


(Debashish Panigrahi) #17

this may help for the predictions

prob_preds, y= learn.TTA(is_test=True)
classes = np.array(data.classes, dtype=str)
res = [" ".join(classes[np.where(pp > 0.5)]) for pp in prob_preds] 
test_fnames = [os.path.basename(f).split(".")[0] for f in data.test_ds.fnames]
test_df = pd.DataFrame(res, index=test_fnames, columns=['tags'])
test_df.to_csv('planet-amazon-from-the-space_Deb.csv', index_label='image_name')

(Kevin Bird) #18

That is such an elegant way to handle that. I have been hacking together a solution for the past hour and this just got me my last piece I needed. Especially liked this line:

res = [" ".join(classes[np.where(pp > 0.5)]) for pp in prob_preds] 

Short, but there is a lot going on there and it is really powerful.

You will probably want to lower that 0.5 to somewhere around 0.2 is the only change I would recommend based on other conversations.


(Kevin Bird) #19

I was able to get a 0.93045 score on my first submission. Pretty happy with that as a starting point. That would put me around 47th when the competition was going on. My next step is going to be recreating without looking at other code snippets (or at least keeping that to a minimum).


(James Requa) #20

Umm why you trying to be so modest Kevin, thats a crazy good score for this competition!! LOL