I use ImageDataLoaders.from_folder to load train and validation data. Question: What is a more efficient way to display the number of images in each category of the train/valid dataset, primarly to identify class imbalances.
This is the way, I calculate and plot the number of images in a category at the moment. But thats a bit slow.
Looping through dls.train.datataset to create the mapping train_cls_distribution is slow for me as well. dls.train.datataset is a list of tuples of the form (image, label) (or more precisely here (PIL Image, TensorCategory) ). For some reason I don’t clearly understand, when I loop over this list, Google Colab shows at the bottom some load_image() function being called; I suspect the slowness is due to this.
If you’re using ImageDataLoaders.from_folder(), I’m assuming you have data in the ImageNet-dataset style; that is, a folder for each class. An alternative may be to use the Python os module; something along the lines of
import os
DIR = './data/imagenette2-160/train'
folder_list = os.listdir(DIR)
for folder in folder_list:
print(folder, len([file_name for file_name in os.listdir(DIR + folder)]))