This means your valid set contains labels that aren’t in the train set, which happens because this competition dataset has some labels/categories/whales with just one image.
It’d be nice if the warning message was friendlier, but the essential info is there in the message.
Exception: Your validation data contains a label that isn't present in the training set, please fix your data.
I guess we need to do some preprocessing, or is there a nice flag to skip these small categories? I’ve spend 10 mins reading various threads with no solution
Nearly all the non-playground Kaggle challenges are specific in some way. That’s why the sponsors are crowd sourcing for solutions. And also why they are great challenges for students/researchers/practitioners, as you have to use lateral thinking and discipline as well as the latest DL techniques. Persevere and enjoy!
I just started on lesson 1 of the 2019 course. Just wondering if they are they going to go over scenarios where you have labels that have 1 to a few images?
Just want to add to this part.
After a bit of research (understanding the fast-ai lib better), it seems that when you do random split, the random_split_by_pct() will actually complain if you dont have enough data to support a class.
Sounds good. One thing came to my mind when I saw this post yesterday was Siamese network, as it can solve one-shot learning / verification smoothly (as far as I know).
I know how to implement it in keras, but don’t really know how to create layers using fastai lib.
Thank you so much for providing such a great example
Hi Tom,
Have you solved the problem of train/validation split?
Validation should contain images which are not in train, but many whales only have one image. I have it working in keras (easy), but I am new to fastai and do not know how to
make custom validation set (need to take 20% images but only from whales that have more than 4 pictures)
Dmitry
Basically I sampled each whale with number of images * 20%, if it it’s more than 0.5, then take 1
so if 3 images, 30.2=0.6, split 1 to valid
if 2 images, 20.2 = 0.4, keep in train…
with fastai resnet 50 with some tuning, you can reach about 0.6+ on map5. Which is about top 50% on leader board.
but if you want something more, you probably need to implement it on Siamese network. (fastai + pytorch?)
I have done that competition before (playground version), and I have many ideas how to proceed and what is working and what is not. I know how to do it keras but not in fastai. I think fastai’s layer-level learning rate is a big advantage: I need it in keras.
#after load the dataset, grab the targets and make unique list
classes = df['SalePrice'].unique()
classes.sort()
#later passes that list to be treated as categorical values.
.label_from_df(cols=dep_var, classes=classes)
Are you using the Kaggle kernels for this? I tried to use the kernel for the Cancer detection one and I can’t save out my predictions because it always freezes when I try to commit my kernel after I start training.
Ok, cool. That’s what I was wondering. I have my machine, but I use it for work too and I stupidly only committed 80gbs of memory to my Linux partition when I split it a year ago. So downloading 10gbs worth of data pretty much eats up all my free space. I figured Kaggle’'s kernels might be a way for me to test it out, but I guess not. I might look into the google colab setup for trying out Kaggle competitions. I appreciate the feedback as I was getting extremely frustrated wasting hours of my time only to get the commit error
Hey, Kaggle kernels have worked perfectly fine for me so far. Also if you want to alter your partitions you can use an application such as GParted. Check out its tutorials online, it is pretty easy to use.
Cheers