Help pls! Food categorisation is a multi-label problem or a segmentation problem?

I want to build a NN to solve this problem: given an image of a food plate, identify all the food groups present on it (carbs, dairy, protein etc or bread/eggs/dairy/meat/vegetables etc). I’ve tried to get started on this (spent half day already) but had no luck finding the right dataset.

Initially I thought this would be a simple multi-label image classification problem. However, I’ve not been able to find a food dataset with multiple labels. After scanning some papers along similar lines, it seems that they either trained on single label food images and then somehow used that OR created segmentation dataset using food tray images containing only certain types of dishes.

Can someone please answer the following and help me get on track again:

  1. should this problem be solved using an image segmentation model (if so, is there a dataset I can use),
  2. or is there a multi-label food dataset I can use for this purpose
  3. or, is there a way to train a model on single label data, but then have it spit out a set probabilities corresponding to a subset of those labels?

Argh, I feel so lost and dejected… I thought I understood at least 50% of lesson 3 and here I am confused on what is multi-label and what is segmentation! :sweat:


If you just want to know what’s on the plate, you can approach this as multi-label classification. Segmentation is more fancy. I don’t know about existing datasets but there’s google dataset search, if that helps. For the 3rd point, I think yes. So sort-of like the planet dataset, but if each image just happened to have a single one of the 17 classes? In that case I think you just need it set up the same way — I wonder if never seeing multiple classes in the same image will affect the model.

Of course for data, if you can’t find what you need, you could build it yourself. One way to label it sort-of quickly is to write a program to display each image and let you enter the labels (say, 0:carbs, 1:dairy, etc), and write that to a csv (once you have the actual images you need).

I don’t know if this’ll be more confusing, but here’s a function that displays each image in a class-folder and asks if that image belongs — and returns a list of filepaths to remove based on your answers (needs OpenCV: $ conda install opencv; and works in command-line).

Also a quick break-down:

  • multi-label classification: each image can have 1 or more labels.
  • segmentation is classification of each pixel in an image. then you’ll color the pixels according to their class, that’s the mask, and overlay that on top of the original image.
  • object detection is classification and localization (where something is in an image). my guess is we’ll cover this in part 2.

All these models start with a pretrained (usually on ImageNet) classifier model, and then do some sort of finetuning. When people talk about a ‘convolutional feature extractor’ they’re talking about the part that comes from that original classifier model.