Theoretical Explanation: Good Dataset

Scenario: I am trying to create a dataset with images of choice for different animal classes. I am going to train those images for classification using CNN.

Problem: Lets assume I somehow don’t have the privilege to collect too many images and was only able to collect few of them for each class. Here’s the list: -

  1. Baboon : 800

  2. Fox : 1000

  3. Hyena : 5000

  4. Giraffe : 43

  5. Zebra : 88

6: Hippopotamus: 233

7: Yak: 578

8: Polar Bear : 456

9: Lion : 3442

10: Indian Tiger : 40,000

Questions are: -

  • Is this a good dataset to train the CNN model.? I am worried about the quantity each class have.

  • Will it be helpful if I augmenting the data ? I think am going to.

  • In future the above mentioned dataset is going to increase. So there is a chance that I will train the model again. Should I create a model that fits the data of the present size or should I create a bigger one inorder to adjust future data ?

Thank you for your time.

I can get data from Internet. But this question is about the approaches to take when we are bounded by less data like the one in National Data Science Bowl (classifying Planktons).

My approach would be the same as my 20month son’s. Just try things out and see what happens.

That said, I would simply copy paste the small classes to balance things out and train on that.

Kind regards,