Yes I have used .use_partial_data and it works perfectly.
Nope! It allows you to used a subset of data that you may have in folder before training. It results extremely useful when you want to check if your code works.
In addition, by default it grabs a random percentage of data. Though I don’t know if it can be customised.
I tried your approach last night @larcat and am seeing good results. I used a subset of the CallTech101 images - from all classes and added as my ‘not a sport’ class. https://sportsidentifier.azurewebsites.net/ Generally it is much better but still some odd pictures are chosen as ‘Cricket’. It got my hummingbirds right, but I liked the 2nd choice on this one - truly a field of dreams! I’ll try the other option when I get a chance - all sports v CallTech101 as a binary - so using two stages for the prediction.
I’d probably need to see how it worked - and move the decision point around. One thing I’d noticed in using different models as I train is that even when accuracy is improving it can mean that the model is even more sure on incorrect classifications. Most of my ‘not a sport’ pictures were in the 80s and 90s. The one above stuck out as being quite low - but this was in the 10 class classifier. With the binary it will be interesting to see how that image is treated. The CalTech images are quite broad - but probably lacking landscapes - and I might need to leave out the soccer ball class.
You should import fastai before.
But as @seb0 said, if you installed in the last month or later, you should not be confused with previous 0.4~ version, sou you need only import fastai; fastai.__version__
I think the Caltech set are too many groups of very similar images to be a good training set for my binary split detector. The training went almost immediately to 100% - and I was sure I’d done something wrong - but it was working perfectly (on the training data). Once I gave it some real world non sports pictures then it was around 85% correct (but mostly at 99-100% certain of the prediction). It looks like ‘overfitting’ to my training data of non-sports pictures. I’ll see if I can create my own training set that gives better results, and also repeat the previous tests. And my tulip pictures was still a sport.
I had around 4500 sports pictures and used the full Caltech set. I thought that had around 6K images on my home setup - but it looks like there are around 9K in the full set. Maybe should have balanced it better. For my previous tests I used a subset of around 650 across all categories - as I have 400-700 for each sport.
The example splits it randomly but it is possible to split by index, folder, df etc if you want to be more specific about which images to include in your training and validation sets. I think the sklearn library has a stratification function that addresses the split for multi classification labels, but I have not applied this myself.
There may be sports pics in the Caltech set – Not familiar with it.
I’m working on a binary classifier right now and having difficulty getting the accuracy above the no-info guess. In my case, however, all the images are pretty similar so I’m not entirely surprised – It isn’t entirely intuitive that there would be sufficient signal to separate for mine.
I’m not comfy enough with the backend of all of this to know if the distribution really matters, but in more traditional ML, if you aren’t pretty well balanced, beating the no-info can get non trivial. Just a thought.
Thanks @larcat - there are some soccer balls - but not much else sporty. Also all the Caltech images are the same size - my sports pictures aren’t - so it could be that some artefacts are being used to separate that aren’t found in my test samples (which aren’t the same size - as each other nor the Caltech images) - rather than the content.
I tried a binary classifier based on all the sports images (4800) against a set of my photos (4300) - mostly landscapes and city shots and it was working well when I tested against random pictures from Flickr. It even classified some other sports shots that were not represented in my sports set - such as tennis and golf - correctly as sports. Following this I rebuilt my 10 class classifier (9 sports + other) with 700 random shots from my set for ‘other’. It now knows tulips are not a sport!
The accuracy during training was around 96% - with no ‘others’ classified as sport - and only a couple of sports photos identified as ‘other’. Maybe time to introduce a few more sports - hurling, camogie, and maybe golf and tennis. I don’t think there is much value in putting the binary classifier in before the sports classifier - as the single model seems to be doing the job well enough. https://sportsidentifier.azurewebsites.net/
Thanks for sharing @Lankinen - and I found similar results when training - almost perfect segregation v the CalTech101 dataset - but it just didn’t generalize to other random pictures that people might choose to predict against. In that case my set of pictures without necessarily obvious subject generalized better when I chose various random pictures (just searched Flickr) that were not in my dataset. It still wasn’t perfect - and many pictures with a lot of well kept lawns look like cricket! And of course it may have picked up my bias in image selection too.