Dataset curation question: How specific should the "negative" class training samples be?

jgreenemi · February 26, 2018, 8:53am

Hey all! First-time poster, enjoying the lessons so far. TL;DR is in bold below.

I’m working on a sandbox project throughout the course to put the lesson material into practice as I’m learning it, which is working out well. However, it got me thinking about best practices when curating a dataset. Here’s the situation:

I’m collecting images for use in binary classification, where the algorithm is detecting if a person is wearing a particular article of clothing or not. The classes would then be “1: contains article” versus “0: does not contain article”. My question then is, what kind of training samples should I be using for the “does not contain article” set - images of people who are not wearing the clothing? Images of disembodied clothing that do not contain the article? Images of people, clothing, and other general photos, none of which contain the article? Should I simply populate the negatively-labeled training set with every photo I can get my hands on that does not feature the article of clothing?

Since we’re really just checking for the existence of that article of clothing in our dataset, it seems to me there’s value in doing the latter, as I’ll be exposing the algorithm to many photos of things that won’t include the article of clothing, which makes the algorithm more robust in being able to handle any kind of photo thrown at it. If I only gave it images of people and images of clothing, it won’t know how to handle an image of a car, with no people or clothing in it, which could result in the image being classified improperly if minute features of the car happen to be reminiscent of the article of clothing we’re looking for.

On the flipside, will this make the algorithm more likely to false-positive images that have other common features with the positively-labeled training set (i.e. since all our positive examples show at least some kind of clothing, will the algorithm be more prone to positively classifying any image with clothing in it)? To simplify, will I be training the algorithm to flag the wrong things by having such a general negative training set?

This is not something I have tested, since I’m finding it a little time consuming to collect samples for the dataset, and want to focus my time efficiently (not just try everything, but rather try what makes most sense). As such, I’d appreciate your insights and why you think the way you do about it. Thanks!