Kaggle Comp: Plant Seedlings Classification

sermakarevich · March 13, 2018, 6:46pm

I am sorry @alessa, what graphs are you referring to? - updt: Found.

First chart is just plt.hist of images width, second one - of image height, third one - plt.scatter of width/height. To build the forth I use OOF predictions of a train set and calculate the score by selecting only images that satisfy criteria: < than some size or > than some size.

alessa · March 13, 2018, 6:47pm

these ones

digitalspecialists · March 13, 2018, 8:34pm

I was able to do well with success down to 1. emsembling what I found to be the strongest performing architectures (resnet50 and nasnet), 2. spending time fine tuning hyper parameters and image sizes, 3. running k-fold cross validations, and more than once. I think these are good steps for any serious attempt at any leaderboard climbing on any similar competition, and was a good starter learning experience. The competition has closed, but remains a good one to practise these skills.

SHAR1 · March 14, 2018, 1:55am

Did you come across an error while running nasnet? I faced size error.

bdekoven · March 14, 2018, 1:02pm

@digitalspecialists could you share the code for how you performed the cross validations? Thank you!

SHAR1 · March 14, 2018, 7:52pm

This is brilliant!

bdekoven · March 14, 2018, 7:59pm

@SHAR1 thank you so much for this information!

sermakarevich · March 14, 2018, 8:28pm

Please be careful:

SHAR1 · March 15, 2018, 6:05am

Here is the notebook snipet for my first attempt at this competition. I think its a good place to start.
Just vanilla fastai tips. No cross-validation, ensemble, segmentation of any sorts. I have just kept an eye on the losses, nothing more. I haven’t added any documentation, cause, I followed jermey’s tips nothing more. If you need explanation. Just ping me, I’ll add up.

0.988 accuracy. Around 0.97 in public leaderboard.

gist.github.com

https://gist.github.com/Sharwon/8da9c814b52920724c113dd38d42aff3

gistfile1.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [

This file has been truncated. show original

simple_plant_seedlings.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [

This file has been truncated. show original

SHAR1 · March 15, 2018, 7:09am

There were two classes in this problem which had some kind of correlation with each other. Most of the errors were due to this. If I want my model to concentrate more on classifying these two classes. How should I approach the problem?

some intuition that I had …
So, can I train a model specifically for these two classes and ensemble it with my main predictions (updating only these two classes). Or, do you recommend any other approach?

sermakarevich · March 15, 2018, 7:13am

I did not to that, I just blended multiple models, which predicted all classes at once. But I have no idea if your approach will work, just give it a try.

cudawarped · March 15, 2018, 8:30am

@SHAR1 Try oversampling (duplicated all images) Black-grass in your training set. It has half the number of samples of Loose Silky-bent which is the other class it gets confused with.

This may be the wrong approach but I got 0.98740 with vanilla Resnet50, top down aug and incorporation of the validations set at the end.

biletskyy · March 15, 2018, 11:09am

My first Kaggle competition. Got 0.97858 with Resnet50. No crossvalidation.
I haven’t done nothing special, but it’s nice to get good results with so little experience. It gives me motivation to move forward and it was fun )

Waltz · March 15, 2018, 6:05pm

In this competition, my first Kaggle, I obtain 0.98614 (in the public leaderborad, for what that means) !! Thanks to the Jeremy tips and the fastai software.

Used only Resnet50, resnext gave me memory error and I was not able to load nasnet.
I wanted to use all the examples but did not find any other solution than reducing val_pct. I still do not know how to train on all the examples.
I performed about 5-6 trainings, then I checked the wrong classified patterns. Most of the time the errors came from misclassification between class 0 and 6 (sorry now I dont remember the class names), but one model was different, it worked better over the 0-6 and worst over another couple. So, finally, I created an ensemble of just two classifier.

I really would like to know how you debug the code. I’m working with spyder and this is really a pain.
It seems impossible to put a breakpoint, check some values and continue.
Does it exist any IDE that allows to manage decently the debugging ?

sermakarevich · May 8, 2018, 6:11am

Just pixels.

saurabh502 · May 10, 2018, 5:20am

Here is what worked for me:

PATH = “data/plant-seedlings-classification/”

change code :
from glob2 import glob --> from glob import glob
for image in glob(“train//*.png"): —> for image in glob("{}/train//*.png”.format(PATH)):

Hope this helps!

sjames · May 24, 2018, 6:07pm

Thank you. You recommendation worked perfectly. Appreciate it.

I needed to change only the following line after installing glob2
for image in glob("{}/train/**/*.jpg".format(PATH)):

pierreguillou · July 25, 2018, 10:56am

Hi @gerardo,

did you solve your problem ? I’m trying as well to use f1 metric in m.fit(lr, 3, metrics=[f1]) but it gives the following error. Any advice ?

RuntimeError: invalid argument 3: sizes do not match at c:\anaconda2\conda-bld\pytorch_1519501749874\work\torch\lib\thc\generated…/THCTensorMathCompareT.cuh:65

hwasiti · September 9, 2018, 6:20pm

Did anybody triy to predict single images downloaded from the internet?

My model has ~96.5% accuracy. I tried to load 10 pictures with somehow similar to the pictures of the training set from google images (tried to pick those that had single leaf, soil background …etc). But the model predicted all wrong!

I suspected that there is something wrong with my code. Tried several pictures from the validation set and all predicted correct except for the few that I know it was confusing for the model and that’s why it has 96% accuracy.

It is weird that no single picture from the internet predicted well…

Does that mean the model is overfit on the kaggle dataset because it does not have varieties of backgrounds , lighting changes …etc.?

I followed the template of dog breeds of Jeremy. And the single file prediciton like Jeremy’s code:

#from https://www.dpi.nsw.gov.au/__data/assets/image/0007/709369/varieties/thumb_gallery.jpg
fn = ‘testing_from_www\thumb_gallery.jpg’

trn_tfms, val_tfms = tfms_from_model(arch, sz)
ds = FilesIndexArrayDataset([fn], np.array([0]), val_tfms, PATH)
dl = DataLoader(ds)
preds = learn.predict_dl(dl)
learn.data.classes[np.argmax(preds)], np.exp(preds)

Should we do normalization for the pictures before making prediction?
I thought this is already done in part of fastai so I did not make any changes to the pictures.

KevinB · December 11, 2018, 3:11am

Thank you very much for following up on this. I just had this exact same issue.