Planet Classification Challenge

KevinB · November 14, 2017, 4:28am

Creating a thread similar to Dog Breed Identification Challenge to discuss Planet Classification Challenge and any insights, roadblocks, or scores. Since this one is expired I think we should be able to share code here as well possibly?

Kaggle link: https://www.kaggle.com/c/planet-understanding-the-amazon-from-space

jeremy · November 14, 2017, 6:29am

Yup can share!

apil.tamang · November 15, 2017, 3:38am

@jeremy

Almost done with the planet kaggle competition… but kinda stuck at submission file generation.

One question: we agreed that we were going to use the sigmoid function for generating output values on the very end. Now these values are not going to be normalized. So, for one sample image learn.TTA(is_test=True) might look something like:

[ 0.29374, 0.00393, 0.00965, 0.00235, 0.00266, 0.01863, 0.05588, 0.00071, 0.04019, 0.01701,
0.00452, 0.88296, 0.93706, 0.15011, 0.00272, 0.00258, 0.09983],

How do we know what value we should use to determine if one of the 17 categories is predicted for this image using the numbers above?

I say so, because I looked at the sample_submission.csv file, and it looks as follows:

file_9973,primary clear agriculture road water
file_9974,primary clear agriculture road water
file_9975,primary clear agriculture road water
file_9976,primary clear agriculture road water
file_9977,primary clear agriculture road water
file_9978,primary clear agriculture road water
file_9979,primary clear agriculture road water

Sorry… haven’t had the time to do more research. It’s 11 PM and my sleep-deprived (surprise, surprise!) brain is ready to call it a day.

Thnx.

KevinB · November 15, 2017, 3:48am

Sample_submission just does the same value for everything as a template of what the submission file will look like. For yours I am guessing you will look at the ones that are greater than 0.5 so in this case:

[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0]

And then you will match those ones up to the value they represent so in this case it could be clear and primary or hazy agriculture.

jamesrequa · November 15, 2017, 3:56am

For this competition and for most multi-label problems like this, you’ll need to choose an optimal “threshold” so anything above it is considered and anything below it is not. Luckily the fastai library has a function called opt_th that actually finds the threshold for us! You can find it in the planet.py file

Like Kevin mentioned you’ll need to make sure your predictions are matched up with the right labels since the submission file actually is looking for the label names not the prediction values.

KevinB · November 15, 2017, 4:15am

Wow, that is awesome. I was planning on using 0.5, but something tells me this might (100% definitely) be a better method.

jamesrequa · November 15, 2017, 4:21am

Yea lot of people used 0.2 as a good fixed threshold for all labels in this comp, 0.5 would be too high I think. Using opt_th of course should be even better as the optimal threshold shouldn’t be the same value for all labels.

apil.tamang · November 16, 2017, 3:31am

@all

(Preamble: There’s an issue with using the planet dataset that’s being discussed here. There’s also a way to side-step the exception-trace as long as you’re not producing the submission file!
https://github.com/fastai/fastai/issues/23#issuecomment-344795382)

Got my first submission in. At 0.8898 on the leaderboard, think it’s fairly okay. I barely fit the model once, and focused more on getting the submission file done. Here’s my lousy script that produces the file. Nothing fancy, and very java-ish like. I couldn’t, for the sake of good life, get one of them fancy python generators working.

mapp={}
for i in range(17):
    result = data.classes[i]
    mapp[i] = result

so, that mapp is:

{0: 'agriculture',
 1: 'artisinal_mine',
 2: 'bare_ground',
 3: 'blooming',
 4: 'blow_down',
 5: 'clear',
 6: 'cloudy',
 7: 'conventional_mine',
 8: 'cultivation',
 9: 'habitation',
 10: 'haze',
 11: 'partly_cloudy',
 12: 'primary',
 13: 'road',
 14: 'selective_logging',
 15: 'slash_burn',
 16: 'water'}

The script for producing the test thus would be:

import re
tta_test = learn.TTA(is_test=True)
files = learn.data.test_ds.fnames
predictions = tta_test[0]
pattern = 'test-jpg\/(.*)\.jpg'
files = list(learn.data.test_ds.fnames)

with open("planet_submission.csv.apil","w") as f:
    f.write('image_name,tags\n')
    for i in tqdm(range(61191)):
        
        # files[i] is of this form: test-jpg/test_xyze.jpg
        # Only want to extract the 'text_xyze' part for the
        # submission
        pattern = re.search('test-jpg\/(.*)\.jpg',files[i])        
        
        predLine = pattern.group(1)+','
        prediction = predictions[i]

        for j in range(17):
            
            # only use the prediction if
            # score is greater than 0.2
            
            if(prediction[j] > 0.2):
                predLine +=mapp[j]+' '
        
        f.write(predLine+"\n")

Now, all I need to do is figure out how to leverage the op_th function! Hmmm… Wife’s really mad I’m doing nothing but coding since I got home. Gotta go to bed !!

jeremy · November 16, 2017, 3:35am

You can do that after the course finishes

Thanks for sharing that code. But gosh yes it is java-ish - we’ll try to help fix it up a little…

hiromi · November 16, 2017, 5:16am

Dear Python/PyTorch gurus out there,

Is this just me or is this something most people can reproduce? I ran it on an AWS instance created by a script from version 1 of this course and also on a new one that’s created via the new part 1 v2 AMI and saw the same result. Also, I downloaded data from Kaggle on each instance rather than copying them over.

github.com

hiromis/fastai-workspace/blob/master/Lesson2-Broken.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multi-label classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Planet: Understanding the Amazon from Space\n",
    "\n",
    "The dataset is from https://www.kaggle.com/c/planet-understanding-the-amazon-from-space"
   ]
  },
  {
   "cell_type": "code",

This file has been truncated. show original

Moody · November 16, 2017, 5:23am

I did “git pull” and “conda update --all”. But, no module named 'fast_gen" in planet_cv.ipynb

memetzgz · November 16, 2017, 2:44pm

@Moody, might be an issue of where you are running the notebook from, as often this error happens when the fastai directory is not where the notebooks expects to find it, based on your current working directory

jeremy · November 16, 2017, 5:44pm

Yeah we haven’t covered planet_cv yet, so it’s not in working condition. You should change the imports to be the same as our lesson1 and lesson2 notebooks. Probably some other changes needed too.

KevinB · November 17, 2017, 2:08am

So it looks like that opt_th is for the validation set, once you are moving from validation to test, how do you determine what th you should use?

jamesrequa · November 17, 2017, 2:27am

I haven’t tested it extensively yet but I think we have to just go with the thresholds based on how they did on the validation set, since we don’t really know in advance how they will perform on the test set…However, if you were to retrain with all of the data have you checked to see if opt_th still works? Cause that would probably be an even better indication of which thresholds to use…Looking at the code now…for the targs parameter couldn’t you just feed it with the y values from the whole dataset (training with all images - no validation set) ?

KevinB · November 17, 2017, 3:29am

Yeah, you could figure out the optimal threshold on all of the known values, So you would just do the actuals vs predicted of all the training images and that would give you the threshold and then you would just have to put that into your testing threshold and hope it is the same. I think for my first submission I’m just going to use 0.2. That can be v2 for me.

Deb · November 17, 2017, 5:43am

this may help for the predictions

prob_preds, y= learn.TTA(is_test=True)
classes = np.array(data.classes, dtype=str)
res = [" ".join(classes[np.where(pp > 0.5)]) for pp in prob_preds] 
test_fnames = [os.path.basename(f).split(".")[0] for f in data.test_ds.fnames]
test_df = pd.DataFrame(res, index=test_fnames, columns=['tags'])
test_df.to_csv('planet-amazon-from-the-space_Deb.csv', index_label='image_name')

KevinB · November 17, 2017, 5:53am

That is such an elegant way to handle that. I have been hacking together a solution for the past hour and this just got me my last piece I needed. Especially liked this line:

res = [" ".join(classes[np.where(pp > 0.5)]) for pp in prob_preds]

Short, but there is a lot going on there and it is really powerful.

You will probably want to lower that 0.5 to somewhere around 0.2 is the only change I would recommend based on other conversations.

KevinB · November 17, 2017, 7:12am

I was able to get a 0.93045 score on my first submission. Pretty happy with that as a starting point. That would put me around 47th when the competition was going on. My next step is going to be recreating without looking at other code snippets (or at least keeping that to a minimum).

jamesrequa · November 17, 2017, 7:32am

Umm why you trying to be so modest Kevin, thats a crazy good score for this competition!! LOL