How to create a toy dataset out of a large dataset for quick experimentation (In genral and in particular to Food-101)

Hello All,

I am planing to build an image classifier using Food-101 dataset. The actual dataset consists of 101 food categories and has 1000 images in each category(750 train and 250 test).

For quick iterations while experimenting with my classifier, I would like to use a small sample of the dataset instead of using the whole dataset.

For creating a small sample, I can reduce the number of classes from 101 to some x and reduce number of training images in each category from 750 to y.
Could some one please advice on how should I choose x and y in this case?

In general, whenever working on a large dataset, I would like to come up with a sample dataset that is representative of the original dataset as well as small in size for quick experimentation purposes. So given some fixed computing power, how do you deal with this trade-off between being able to experiment quickly and preserving the essence of original dataset.
How should I approach this toy dataset creation problem, in general?

Also, how can we guarantee that a good classifier created using a toy dataset would also come out to be a good model on the full dataset? I mean, are there any other factors apart from representativeness of dataset on which the scalability of a classifier’s performance depends when we move from a small version of dataset towards the original dataset?

Any insights would be much appreciated :slight_smile:


1 Like

Hey Anurag,

Creating a simpler version of your dataset for quick prototyping isn’t really something that can always be approached the same way - I’d say for classification, if you have a dataset that’s too large for prototyping then either you have way more samples per class than classes and then you take fewer samples, or you have not that many samples per class but many classes (such as your case).

If I were you, I would just select some subset of classes (5-10), maybe incorporating some domain knowledge so that the classes aren’t too easy, and prototype on that, and then move on to the full dataset once I am done prototyping and would like to run on the whole thing. I wouldn’t reduce the number of samples per class or the resolution of the images because that can affect stability of training and generalization to the full dataset.

Taking a subset of classes is fairly easy - you just look for the indices where y is in your list of classes and then you only keep the x and y at those indices.

Also feel free to share your code if you’d like :slight_smile:

1 Like

Hi Oren,

Thanks for the advises. I’ll train on a few classes for the experimentation purposes with all the training images for those classes. As I do not have any specific knowledge about the food images available in the dataset, for determining the subset of classes maybe I’ll fine tune a pre-trained resnet34 for one epoch and plot the most confused classes to find some hard classes.

Also, I may be completely wrong but I don’t know why I feel like as we go from the toy dataset to the full dataset we may have to switch to a deeper architecture (e.g from resnet34 to resnet50). I was just wondering if you can provide some clarification on this because if this is the case then we have find a new set of hyperparameters again for the new architecture, right? And if that is the case then I’m not sure what purpose would the toy dataset serve?

Have not started working on this yet but will surely share after I’m done. :slight_smile:

1 Like

Hey Anurag,

The trick about determining the subset of classes using a pre-trained network is brilliant! I don’t know if anyone has ever tried that before, so you should definitely let the community know how it went :slight_smile:
In my experience, switching from resnet34 to a deeper version mostly just means the training takes longer in exchange for a small increase in accuracy. You usually don’t have to change any of the major decisions you’ve made along the way when doing that because they aren’t likely to be extremely sensitive (such as data augmentation etc), but it’s likely that you may have to change the batch size (because larger models take more GPU RAM and therefore you might not be able to fit batches quite as large) and learning rate (because of deeper model), but these rarely matter much in terms of model performance.

1 Like

Thanks a lot, Oren. I’ll definitely let the community know after I try this out :slight_smile:

1 Like

Did you have any success with this approach, or any other tips? I am working on a similar problem at the moment, so any advice would be really helpful!