Why use one hot encoding instead of integer encoding?

imago · April 27, 2019, 1:06pm

Hello!

I understand what one hot encoding is, but I can’t for the life of me figure out why you’d want to do it. How does it help with anything? I’m all for the categorization of data such as creating a integer based dictionary of words in any text, where each integer represents a word, but why would you take this integer representation and blow it up into one hot encoding? To me, this just adds n amounts of pointless zeroes.

knesgood · April 27, 2019, 1:12pm

This isn’t a complete answer, but one reason is that label encoding, when passed to most algorithms, assumes that the order of the labels matter. This is a problem when, say, looking at days of the week. Tuesday isn’t twice as much anything as Thursday, but linear models can’t know that. This is less necessary with CART models (that can split out on specific values) and Neural Networks with embedding matrices (because each value gains it’s own mathematical representation.

bernd.heidemann · April 27, 2019, 3:33pm

This article helped me:

https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/