I am quite new at ML and AI
I have watched 2018 courses, mostly image identification part. I managed to take part in Histopathologic Cancer Detection competition at Kaggle, i’m in 128th place now.
Now, I tackle Humpback Whale Identification
on Kaggle with new fastai v. 1.0 library
My github is here:
Could someone check the errors with interp = ClassificationInterpretation.from_learner(learn)
I get there please?
New commit in the GitHub
I changed the ImageDataBunch.from_csv code and I get a warning:
/home/tom/anaconda3/lib/python3.7/site-packages/fastai/data_block.py:475: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the folowing unknown labels, the corresponding items have been discarded.
w_39ea8fa, w_d8ae71c, w_ef62b09, w_5966e55, w_b07ec5d…
if getattr(ds, ‘warn’, False): warn(ds.warn)
I think, probably those rejects are greyscale…
The interp.plot_top_losses works now
Tom
This means your valid set contains labels that aren’t in the train set, which happens because this competition dataset has some labels/categories/whales with just one image.
It’d be nice if the warning message was friendlier, but the essential info is there in the message.
Exception: Your validation data contains a label that isn't present in the training set, please fix your data.
I guess we need to do some preprocessing, or is there a nice flag to skip these small categories? I’ve spend 10 mins reading various threads with no solution
Nearly all the non-playground Kaggle challenges are specific in some way. That’s why the sponsors are crowd sourcing for solutions. And also why they are great challenges for students/researchers/practitioners, as you have to use lateral thinking and discipline as well as the latest DL techniques. Persevere and enjoy!
I just started on lesson 1 of the 2019 course. Just wondering if they are they going to go over scenarios where you have labels that have 1 to a few images?
Just want to add to this part.
After a bit of research (understanding the fast-ai lib better), it seems that when you do random split, the random_split_by_pct() will actually complain if you dont have enough data to support a class.
Sounds good. One thing came to my mind when I saw this post yesterday was Siamese network, as it can solve one-shot learning / verification smoothly (as far as I know).
I know how to implement it in keras, but don’t really know how to create layers using fastai lib.
Thank you so much for providing such a great example
Hi Tom,
Have you solved the problem of train/validation split?
Validation should contain images which are not in train, but many whales only have one image. I have it working in keras (easy), but I am new to fastai and do not know how to
make custom validation set (need to take 20% images but only from whales that have more than 4 pictures)
Dmitry
Basically I sampled each whale with number of images * 20%, if it it’s more than 0.5, then take 1
so if 3 images, 30.2=0.6, split 1 to valid
if 2 images, 20.2 = 0.4, keep in train…
with fastai resnet 50 with some tuning, you can reach about 0.6+ on map5. Which is about top 50% on leader board.
but if you want something more, you probably need to implement it on Siamese network. (fastai + pytorch?)
I have done that competition before (playground version), and I have many ideas how to proceed and what is working and what is not. I know how to do it keras but not in fastai. I think fastai’s layer-level learning rate is a big advantage: I need it in keras.