Class balancing (oversampling) image data

What is the most efficient and beginner friendly way to balance classes in Fast.ai v2 while doing multi class classification on image data? There was a topic on this a year ago. Any news?

On tabular data I use imblearn SMOTE.

I have found a nice package for PyTorch there:

My training data class counts are:
{‘hydromedusae_shapeA_sideview_small’: 274,
‘tunicate_salp’: 236,
‘hydromedusae_typeF’: 61,
‘siphonophore_calycophoran_abylidae’: 212,
‘stomatopod’: 24,
‘trochophore_larvae’: 29,
‘copepod_cyclopoid_oithona’: 899,
‘appendicularian_straight’: 242,
‘euphausiids_young’: 38,
‘hydromedusae_bell_and_tentacles’: 75,
‘siphonophore_physonect’: 128,
‘radiolarian_colony’: 158,
‘copepod_calanoid_eggs’: 173,
‘invertebrate_larvae_other_A’: 14,
‘protist_noctiluca’: 625,
‘copepod_cyclopoid_oithona_eggs’: 1189,
‘ctenophore_cydippid_no_tentacles’: 42,
‘hydromedusae_sideview_big’: 76,
‘amphipods’: 49,
‘fish_larvae_deep_body’: 10,
‘hydromedusae_shapeB’: 150,
‘tunicate_partial’: 352,
‘euphausiids’: 136,
‘fish_larvae_thin_body’: 64,
‘copepod_calanoid_flatheads’: 178,
‘tunicate_doliolid_nurse’: 417,
‘shrimp_caridean’: 49,
‘diatom_chain_string’: 519,
‘tunicate_doliolid’: 439,
‘heteropod’: 10,
‘invertebrate_larvae_other_B’: 24,
‘trichodesmium_bowtie’: 708,
‘chaetognath_sagitta’: 694,
‘hydromedusae_shapeA’: 412,
‘copepod_cyclopoid_copilia’: 30,
‘hydromedusae_other’: 12,
‘acantharia_protist’: 889,
‘hydromedusae_narco_young’: 336,
‘tunicate_salp_chains’: 73,
‘siphonophore_calycophoran_sphaeronectes’: 179,
‘protist_fuzzy_olive’: 372,
‘appendicularian_slight_curve’: 532,
‘ctenophore_cydippid_tentacles’: 53,
‘hydromedusae_typeD’: 43,
‘unknown_sticks’: 175,
‘pteropod_butterfly’: 108,
‘tornaria_acorn_worm_larvae’: 38,
‘shrimp_zoea’: 174,
‘echinoderm_larva_seastar_bipinnaria’: 385,
‘trichodesmium_tuft’: 678,
‘jellies_tentacles’: 141,
‘unknown_unclassified’: 425,
‘fecal_pellet’: 511,
‘siphonophore_calycophoran_rocketship_adult’: 135,
‘siphonophore_calycophoran_sphaeronectes_young’: 247,
‘detritus_blob’: 363,
‘hydromedusae_narcomedusae’: 132,
‘hydromedusae_typeE’: 14,
‘trichodesmium_puff’: 1979,
‘trichodesmium_multiple’: 54,
‘siphonophore_calycophoran_sphaeronectes_stem’: 57,
‘copepod_calanoid_frillyAntennae’: 63,
‘pteropod_triangle’: 65,
‘chaetognath_non_sagitta’: 815,
‘appendicularian_s_shape’: 696,
‘artifacts’: 393,
‘acantharia_protist_big_center’: 13,
‘hydromedusae_partial_dark’: 190,
‘ctenophore_cestid’: 113,
‘copepod_calanoid_eucalanus’: 96,
‘protist_star’: 113,
‘detritus_filamentous’: 394,
‘copepod_calanoid’: 681,
‘chordate_type1’: 77,
‘hydromedusae_haliscera’: 229,
‘echinopluteus’: 27,
‘hydromedusae_solmaris’: 703,
‘ephyra’: 14,
‘acantharia_protist_halo’: 71,
‘siphonophore_physonect_young’: 21,
‘fish_larvae_leptocephali’: 31,
‘fish_larvae_medium_body’: 85,
‘hydromedusae_aglaura’: 127,
‘hydromedusae_liriope’: 19,
‘ctenophore_lobate’: 38,
‘shrimp-like_other’: 52,
‘chaetognath_other’: 1934,
‘protist_dark_center’: 108,
‘siphonophore_other_parts’: 29,
‘copepod_calanoid_small_longantennae’: 87,
‘artifacts_edge’: 170,
‘shrimp_sergestidae’: 153,
‘radiolarian_chain’: 287,
‘siphonophore_partial’: 30,
‘hydromedusae_narco_dark’: 23,
‘echinoderm_larva_pluteus_typeC’: 80,
‘copepod_calanoid_large_side_antennatucked’: 106,
‘copepod_calanoid_octomoms’: 49,
‘fish_larvae_very_thin_body’: 16,
‘appendicularian_fritillaridae’: 16,
‘hydromedusae_typeD_bell_and_tentacles’: 56,
‘copepod_other’: 24,
‘pteropod_theco_dev_seq’: 13,
‘protist_other’: 1172,
‘siphonophore_calycophoran_rocketship_young’: 483,
‘echinoderm_larva_pluteus_brittlestar’: 36,
‘unknown_blobs_and_smudges’: 317,
‘decapods’: 55,
‘copepod_calanoid_large’: 286,
‘fish_larvae_myctophids’: 114,
‘polychaete’: 131,
‘hydromedusae_h15’: 35,
‘detritus_other’: 914,
‘echinoderm_larva_pluteus_urchin’: 88,
‘echinoderm_larva_seastar_brachiolaria’: 536,
‘hydromedusae_haliscera_small_sideview’: 9,
‘echinoderm_larva_pluteus_early’: 92,
‘hydromedusae_solmundella’: 123,
‘echinoderm_seacucumber_auricularia_larva’: 96,
‘crustacean_other’: 201,
‘diatom_chain_tube’: 500}

1 Like

Hi @Vytautas , here is one approach that works for multi-label classification: Oversampling for Multi-Label Classification | Kaggle

This should also work for multi-class I think, but probably there is also a simpler way for multi-class, probably using Pytorch weighted random sampler. Do let us know what you end up doing!

Cheers, Darek

1 Like