Lesson 1: Crappy Mugiwara Categorizer, how did I make it less crappy?

OG message, at first this was a question. But by fiddling around, and just watching the next lesson, I learned techniques that helped make this model much better. Still open to learning about how I could get that model (and hopefully future ones even better)!


Hola, everyone!

So for the project for lesson 1, since I had no world-changing ideas to develop, and that at this point I am just a noob with some coding and math knowledge, I wanted to tackle a simpler project.

Given an image, identify which member of the strawhats (mugiwara) the image represents:

I half-expected the model to be subpar considering all the examples seemed to be of animals or other physical things. Still, a loss of 50% after 10 epochs is worse than I expected. So I was wondering if there were ways to improve that, maybe through the use of specific pretrained models for anime. I looked a bit into it, and found out a project called Animesion which might help in that regard, but trying to set that in and I found myself way out of my depth.

Sidenote: I figured that cleaning the dataset the model was trained on would be a good first step, but I wasn’t able to get the cleaner to work in Kaggle. I am planning to do the Jupyter install on my local machine soon, so if it doesn’t work, I’ll leave it a that. But I was wondering if there was a way to fix that.

Edit: It seems simply adding augmentations, makes the model far better. But it’s far from high nineties.

Edit2: Well, I figured increasing the data set, a bit would make it better. And oh my, it did. I ran another run of downloading (about 300 per crew member), there was certainly duplicates in the data set, but that coupled with default data augmentations and I managed to get error rates as low as 5.01%

Appreciate you adding the edits in here.

Amazing how simply increasing the data dramatically improved the error rate.

Right! YMMV, but simple augmentation & increasing the data set seemed to give me improvements good enough to make the model actually deployable. I’d likely would have deployed it even if I it had a success rate of 50% (just for the sake of practice), but I am much more happy with this version.

1 Like

A clean, representative dataset is key, particularly on smaller datasets.
Be careful with duplicates, they will skew your results.
In particular avoid data leakage between the training and validation/test sets through duplicate examples appearing in both. (though random augmentations may alleviate this to some extent)

Thank you, this makes sense. I’ll take special care to avoid that in the future.

Are there any recommended tools to clean up your training data? I am currently still at lesson 2 so maybe this shown further ahead, but the ImageClassifierCleaner does not seem to allow deleting/relabeling more than a handful images at a time. Are there any other tools that would allow for “deep” cleaning of the dataset?