Kaggle Humpback Whale Identification

This means your valid set contains labels that aren’t in the train set, which happens because this competition dataset has some labels/categories/whales with just one image.

It’d be nice if the warning message was friendlier, but the essential info is there in the message.

The best place to get help about a Kaggle challenge is on Kaggle forums. There are plenty of fastai users there to assist. Here is an excellent place to start https://www.kaggle.com/c/humpback-whale-identification/discussion/74647

1 Like

I get an exception

Exception: Your validation data contains a label that isn't present in the training set, please fix your data.

I guess we need to do some preprocessing, or is there a nice flag to skip these small categories? I’ve spend 10 mins reading various threads with no solution

Thanks! It’s clear.
Very specific competition :slight_smile:

Nearly all the non-playground Kaggle challenges are specific in some way. That’s why the sponsors are crowd sourcing for solutions. And also why they are great challenges for students/researchers/practitioners, as you have to use lateral thinking and discipline as well as the latest DL techniques. Persevere and enjoy!

3 Likes

I just started on lesson 1 of the 2019 course. Just wondering if they are they going to go over scenarios where you have labels that have 1 to a few images?

Just want to add to this part.
After a bit of research (understanding the fast-ai lib better), it seems that when you do random split, the random_split_by_pct() will actually complain if you dont have enough data to support a class.

So I dig into the source file, here is the link

It seems that if you do 20% split, if you dont have 5 data for a class,
cut = int(valid_pct * len(self))
return self.split_by_idx(rand_idx[:cut])

To have cut = 1, you will need len(self) = 5, so 0.2*5=1
Otherwise you are returning empty list rand_idx[:0] is []

2 solutions I guess

  1. you drop the rare cases (not applicable in this case…)
  2. you fill the gaps to the train dataset to make it has enough data (still hard 5k+ unique label out of 20k+ data)

Or you could find a smart way to supply the validation set (manually label some data as valid…etc)

I have figured out a easy way if you want to get things down quickly…

src = (ImageItemList.from_csv(’…/input/’,‘train.csv’,folder=‘train’)
.no_split()
.label_from_df())

data = (src.transform(get_transforms(),size=224)
.databunch(num_workers=0)
.normalize(imagenet_stats))

You can do no split, so now you dont have a validation set. It will not be good since you cant tell how your model is doing (bias?overfiting?)

But at least you can get the things going.

:grinning:I will try to figure out a way to supply the validation set and post back.

3 Likes

Thank you for interesting read! Another approach from https://www.kaggle.com/c/humpback-whale-identification/discussion/74647 suggested by Rob is Siamese network, where you test if two images are of the same object.

Sounds good. One thing came to my mind when I saw this post yesterday was Siamese network, as it can solve one-shot learning / verification smoothly (as far as I know).

I know how to implement it in keras, but don’t really know how to create layers using fastai lib.
Thank you so much for providing such a great example :slight_smile:

@heye0507 Perhaps it is a good candidate for augmentation also

Hi Tom,
Have you solved the problem of train/validation split?
Validation should contain images which are not in train, but many whales only have one image. I have it working in keras (easy), but I am new to fastai and do not know how to
make custom validation set (need to take 20% images but only from whales that have more than 4 pictures)
Dmitry

1 Like

I made one on kaggle kernel, if you want, you can take a look.

https://www.kaggle.com/heye0507/prepare-databunch-with-fastai-1-0

Basically I sampled each whale with number of images * 20%, if it it’s more than 0.5, then take 1
so if 3 images, 30.2=0.6, split 1 to valid
if 2 images, 2
0.2 = 0.4, keep in train…

with fastai resnet 50 with some tuning, you can reach about 0.6+ on map5. Which is about top 50% on leader board.

but if you want something more, you probably need to implement it on Siamese network. (fastai + pytorch?)

2 Likes

PERFECT! Thank you

Thanks great idea! I failed to make progress

I have done that competition before (playground version), and I have many ideas how to proceed and what is working and what is not. I know how to do it keras but not in fastai. I think fastai’s layer-level learning rate is a big advantage: I need it in keras.

I found a solution on another thread (TabularDataBunch Error: "Your validation data contains a label that isn't present in the training set, please fix your data.")

#after load the dataset, grab the targets and make unique list
classes = df['SalePrice'].unique()
classes.sort()

#later passes that list to be treated as categorical values.
.label_from_df(cols=dep_var, classes=classes)

Are you using the Kaggle kernels for this? I tried to use the kernel for the Cancer detection one and I can’t save out my predictions because it always freezes when I try to commit my kernel after I start training.

I never managed to successfully commit the Kaggle kernel. I run them on my own machine and on Collab without any problems.

Ok, cool. That’s what I was wondering. I have my machine, but I use it for work too and I stupidly only committed 80gbs of memory to my Linux partition when I split it a year ago. So downloading 10gbs worth of data pretty much eats up all my free space. I figured Kaggle’'s kernels might be a way for me to test it out, but I guess not. I might look into the google colab setup for trying out Kaggle competitions. I appreciate the feedback as I was getting extremely frustrated wasting hours of my time only to get the commit error :slight_smile:

Oh I see! Collab is quite slow, but works. And it’s free.

Hey, Kaggle kernels have worked perfectly fine for me so far. Also if you want to alter your partitions you can use an application such as GParted. Check out its tutorials online, it is pretty easy to use.
Cheers