Multi-Class Probability - Data Loading from CSV with Fastai

farlion · February 27, 2018, 10:31pm

Ahoi there,

apologies if this has already been covered elsewhere, but I couldn’t find anything specific.

Imagine I have a training set specified via CSV, similar to planet’s (lesson 2 image models) CSV:

image_name,tags
train_0,haze primary
train_1,agriculture clear primary water
train_2,clear primary
...

…but with an exact probability wanted for each label. So the CSV would look like this:

image_name,haze,primary,agriculture,clear,water
train_0,0.383147,0.616853,0,0,0
train_1,0.616853,0.038452149,0.578400851,0.418397819,0.198455181
train_2,0,0.104752126,0.512100874,0,0.054453

…then is there already a good way to load these in via fastai’s ImageClassifierData methods?
Or would I need to code up something custom?

Thanks in advance for any hints

farlion · March 11, 2018, 10:00pm

Ended up hacking something together: https://gist.github.com/workflow/294c3cc2c202e196a2687700136e3dc2

Happy to clean that up if it turns out useful to anyone.

adaptivekernel · June 6, 2018, 8:49pm

@farlion I might actually use this. Thanks.
To confirm, the assumption of this gist is that I have images denoted as a column, and multiple classes as columns in a CSV? Is it ok if the names of classes are the same even if the column is different (Left vs. Right orientations for example).

Can you tell me what license to use for this code? I try to only use code that has Apache or more liberal.

farlion · June 21, 2018, 6:33pm

Hey @adaptivekernel, very sorry for the late reply. Not exactly sure what you mean by names of classes being the same for different columns, can you show me a brief example?
Here’s an example of the code in action for the Kaggle Galaxies competition: https://gist.github.com/workflow/60e4f586e3f7ef0aaefde091c3a488b3

Oh, and for a license, please just http://www.wtfpl.net/ hope this didn’t block you in any way.

adaptivekernel · June 22, 2018, 3:23pm

Hi @farlion,

Looking that example, this might not be what I am looking for. Specifically I have data in the form of something like:
ImageName, Type, Color
I001, Bike , Blue
I002, Pedestrian, Red

Are probabilities explicitly expected by this, or can it be a specific class? If not my plan was to change my representation to:

ID, Values
I001, Type:Bike Color:Blue

Unless you have any suggestions. I think this is in the direction of what I am looking for, but maybe not.

This is perfect thanks! I love that license. I just like clarifying these things where possible. =)

farlion · June 23, 2018, 6:11pm

Hmm, interesting! This is definitely different from what I was doing (predicting independent exact probabilities for a number of fixed classes).

I’m by no means an expert on what you’re trying to do, but if you have a small number of categories per categorical variable (i.e. only a few types and colors) then your approach to turn it into “standard” binary multi-label classification (again not sure this is idiomatic jargon) looks reasonable.

Are these independent things you are trying to predict (i.e. type does not affect color in any way) for the same image, or could there be some underlying relationships (bikes are more likely to be blue than pedestrians are)?

adaptivekernel · June 23, 2018, 10:47pm

Hi @farlion,

Thanks for continuing the discussion.

That was my thought as well. I think I struggle with the fact that the data has enough columns that I had syntax concerns and was hoping that maybe I could adapt what you did to have it so I could read the CSV with multiple columns rather than try to make one very large clunky one.

That is where things get complicated. To be more specific, my database is describe aspects of streets (not aspects that AV folks care about). So I think generally they can be treated as independent, but there are some exceptions. For example, a bike lanes buffer might only appear if there is a bike lane, etc. For now, I want to say an assumption of independence is acceptable, but not perfect for all attribute.

Would be curious what you think might be a good approach or way to format the data.