Kaggle Humpback Whale Identification

Hi,

I am quite new at ML and AI
I have watched 2018 courses, mostly image identification part. I managed to take part in Histopathologic Cancer Detection competition at Kaggle, i’m in 128th place now.

Now, I tackle Humpback Whale Identification


on Kaggle with new fastai v. 1.0 library
My github is here:

Could someone check the errors with interp = ClassificationInterpretation.from_learner(learn)
I get there please?

Greetings,
Tom

Its not clear which platform you are using, checkout show_install. I had issues on Mac

Thanks! I use anaconda on Ubuntu, seems like everything has installed cleanly. But, I’ll cross-check :slight_smile:
I put install.log in the github, all seems ok

I will try once my current job finishes

Thanks!!
I forgot to mention, that I run lesson 1/2019 and it works perfectly

New commit in the GitHub
I changed the ImageDataBunch.from_csv code and I get a warning:
/home/tom/anaconda3/lib/python3.7/site-packages/fastai/data_block.py:475: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the folowing unknown labels, the corresponding items have been discarded.
w_39ea8fa, w_d8ae71c, w_ef62b09, w_5966e55, w_b07ec5d…
if getattr(ds, ‘warn’, False): warn(ds.warn)

I think, probably those rejects are greyscale…
The interp.plot_top_losses works now
Tom

This means your valid set contains labels that aren’t in the train set, which happens because this competition dataset has some labels/categories/whales with just one image.

It’d be nice if the warning message was friendlier, but the essential info is there in the message.

The best place to get help about a Kaggle challenge is on Kaggle forums. There are plenty of fastai users there to assist. Here is an excellent place to start https://www.kaggle.com/c/humpback-whale-identification/discussion/74647

1 Like

I get an exception

Exception: Your validation data contains a label that isn't present in the training set, please fix your data.

I guess we need to do some preprocessing, or is there a nice flag to skip these small categories? I’ve spend 10 mins reading various threads with no solution

Thanks! It’s clear.
Very specific competition :slight_smile:

Nearly all the non-playground Kaggle challenges are specific in some way. That’s why the sponsors are crowd sourcing for solutions. And also why they are great challenges for students/researchers/practitioners, as you have to use lateral thinking and discipline as well as the latest DL techniques. Persevere and enjoy!

3 Likes

I just started on lesson 1 of the 2019 course. Just wondering if they are they going to go over scenarios where you have labels that have 1 to a few images?

Just want to add to this part.
After a bit of research (understanding the fast-ai lib better), it seems that when you do random split, the random_split_by_pct() will actually complain if you dont have enough data to support a class.

So I dig into the source file, here is the link

It seems that if you do 20% split, if you dont have 5 data for a class,
cut = int(valid_pct * len(self))
return self.split_by_idx(rand_idx[:cut])

To have cut = 1, you will need len(self) = 5, so 0.2*5=1
Otherwise you are returning empty list rand_idx[:0] is []

2 solutions I guess

  1. you drop the rare cases (not applicable in this case…)
  2. you fill the gaps to the train dataset to make it has enough data (still hard 5k+ unique label out of 20k+ data)

Or you could find a smart way to supply the validation set (manually label some data as valid…etc)

I have figured out a easy way if you want to get things down quickly…

src = (ImageItemList.from_csv(’…/input/’,‘train.csv’,folder=‘train’)
.no_split()
.label_from_df())

data = (src.transform(get_transforms(),size=224)
.databunch(num_workers=0)
.normalize(imagenet_stats))

You can do no split, so now you dont have a validation set. It will not be good since you cant tell how your model is doing (bias?overfiting?)

But at least you can get the things going.

:grinning:I will try to figure out a way to supply the validation set and post back.

3 Likes

Thank you for interesting read! Another approach from https://www.kaggle.com/c/humpback-whale-identification/discussion/74647 suggested by Rob is Siamese network, where you test if two images are of the same object.

Sounds good. One thing came to my mind when I saw this post yesterday was Siamese network, as it can solve one-shot learning / verification smoothly (as far as I know).

I know how to implement it in keras, but don’t really know how to create layers using fastai lib.
Thank you so much for providing such a great example :slight_smile:

@heye0507 Perhaps it is a good candidate for augmentation also

Hi Tom,
Have you solved the problem of train/validation split?
Validation should contain images which are not in train, but many whales only have one image. I have it working in keras (easy), but I am new to fastai and do not know how to
make custom validation set (need to take 20% images but only from whales that have more than 4 pictures)
Dmitry

1 Like

I made one on kaggle kernel, if you want, you can take a look.

https://www.kaggle.com/heye0507/prepare-databunch-with-fastai-1-0

Basically I sampled each whale with number of images * 20%, if it it’s more than 0.5, then take 1
so if 3 images, 30.2=0.6, split 1 to valid
if 2 images, 2
0.2 = 0.4, keep in train…

with fastai resnet 50 with some tuning, you can reach about 0.6+ on map5. Which is about top 50% on leader board.

but if you want something more, you probably need to implement it on Siamese network. (fastai + pytorch?)

2 Likes

PERFECT! Thank you

Thanks great idea! I failed to make progress

I have done that competition before (playground version), and I have many ideas how to proceed and what is working and what is not. I know how to do it keras but not in fastai. I think fastai’s layer-level learning rate is a big advantage: I need it in keras.