I am quite new at ML and AI
I have watched 2018 courses, mostly image identification part. I managed to take part in Histopathologic Cancer Detection competition at Kaggle, i’m in 128th place now.
Now, I tackle Humpback Whale Identification
on Kaggle with new fastai v. 1.0 library
My github is here:
Could someone check the errors with interp = ClassificationInterpretation.from_learner(learn)
I get there please?
Its not clear which platform you are using, checkout show_install. I had issues on Mac
Thanks! I use anaconda on Ubuntu, seems like everything has installed cleanly. But, I’ll cross-check
I put install.log in the github, all seems ok
I will try once my current job finishes
I forgot to mention, that I run lesson 1/2019 and it works perfectly
New commit in the GitHub
I changed the ImageDataBunch.from_csv code and I get a warning:
/home/tom/anaconda3/lib/python3.7/site-packages/fastai/data_block.py:475: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the folowing unknown labels, the corresponding items have been discarded.
w_39ea8fa, w_d8ae71c, w_ef62b09, w_5966e55, w_b07ec5d…
if getattr(ds, ‘warn’, False): warn(ds.warn)
I think, probably those rejects are greyscale…
The interp.plot_top_losses works now
This means your valid set contains labels that aren’t in the train set, which happens because this competition dataset has some labels/categories/whales with just one image.
It’d be nice if the warning message was friendlier, but the essential info is there in the message.
The best place to get help about a Kaggle challenge is on Kaggle forums. There are plenty of fastai users there to assist. Here is an excellent place to start https://www.kaggle.com/c/humpback-whale-identification/discussion/74647
I get an exception
Exception: Your validation data contains a label that isn't present in the training set, please fix your data.
I guess we need to do some preprocessing, or is there a nice flag to skip these small categories? I’ve spend 10 mins reading various threads with no solution
Thanks! It’s clear.
Very specific competition
Nearly all the non-playground Kaggle challenges are specific in some way. That’s why the sponsors are crowd sourcing for solutions. And also why they are great challenges for students/researchers/practitioners, as you have to use lateral thinking and discipline as well as the latest DL techniques. Persevere and enjoy!
I just started on lesson 1 of the 2019 course. Just wondering if they are they going to go over scenarios where you have labels that have 1 to a few images?
Just want to add to this part.
After a bit of research (understanding the fast-ai lib better), it seems that when you do random split, the random_split_by_pct() will actually complain if you dont have enough data to support a class.
So I dig into the source file, here is the link
It seems that if you do 20% split, if you dont have 5 data for a class,
cut = int(valid_pct * len(self))
To have cut = 1, you will need len(self) = 5, so 0.2*5=1
Otherwise you are returning empty list rand_idx[:0] is 
2 solutions I guess
- you drop the rare cases (not applicable in this case…)
- you fill the gaps to the train dataset to make it has enough data (still hard 5k+ unique label out of 20k+ data)
Or you could find a smart way to supply the validation set (manually label some data as valid…etc)
I have figured out a easy way if you want to get things down quickly…
src = (ImageItemList.from_csv(’…/input/’,‘train.csv’,folder=‘train’)
data = (src.transform(get_transforms(),size=224)
You can do no split, so now you dont have a validation set. It will not be good since you cant tell how your model is doing (bias?overfiting?)
But at least you can get the things going.
I will try to figure out a way to supply the validation set and post back.
Thank you for interesting read! Another approach from https://www.kaggle.com/c/humpback-whale-identification/discussion/74647 suggested by Rob is Siamese network, where you test if two images are of the same object.
Sounds good. One thing came to my mind when I saw this post yesterday was Siamese network, as it can solve one-shot learning / verification smoothly (as far as I know).
I know how to implement it in keras, but don’t really know how to create layers using fastai lib.
Thank you so much for providing such a great example
@heye0507 Perhaps it is a good candidate for augmentation also
Have you solved the problem of train/validation split?
Validation should contain images which are not in train, but many whales only have one image. I have it working in keras (easy), but I am new to fastai and do not know how to
make custom validation set (need to take 20% images but only from whales that have more than 4 pictures)
I made one on kaggle kernel, if you want, you can take a look.
Basically I sampled each whale with number of images * 20%, if it it’s more than 0.5, then take 1
so if 3 images, 30.2=0.6, split 1 to valid
if 2 images, 20.2 = 0.4, keep in train…
with fastai resnet 50 with some tuning, you can reach about 0.6+ on map5. Which is about top 50% on leader board.
but if you want something more, you probably need to implement it on Siamese network. (fastai + pytorch?)
Thanks great idea! I failed to make progress
I have done that competition before (playground version), and I have many ideas how to proceed and what is working and what is not. I know how to do it keras but not in fastai. I think fastai’s layer-level learning rate is a big advantage: I need it in keras.