How to turn text labels from test data sets in to numerical representations with out introducing bias

(Christopher Dossman) #1

Hello Everyone! Just Earned Basic :smile:

I have a question about how to craft features such that they lead to the best results for my Neural Networks.

Let me start with two examples I face in my current projects.

  • In one of my NNs I take wind direction as a input measured in degrees (0-360). It was brought to my attention that from an input perspective this might be a source of bias. If the wind shifts from NW to NE it will go from High 300’s to Below 50. This numerical represenation can lead to some strange cusps in my data when training the RNN

    • Part of my solution to this was to split wind direction in to 8 columns and have them be binary ether the wind is blowing this way or not
  • The other example is with some labels that I have applying to loan default prediction NN. If I transform these text labels such as “Credit Card Refinancing” or “Car Loan” in to numerical representations wouldn’t this cause bias based on which label gets which number? “Car Loan” (Mapped to 9) being some how better then “Credit Card Refinancing” (mapped to 2).

    • I had a similar idea of turning all these different purposes in to there own binary column that act an inputs

My main question is what is the name of this data problem and what are some of the more recent thoughts on how to handle converting text labels in to numerical representations that won’t inadvertently add bias to my network.

Wiki: Lesson 1
(David Gutman) #2

The term you’re looking for is “categorical variable”.

You need to one hot encode that column (aka dummy variables).

You only need to encode n - 1 of the categories since the last category can be defined by the rest as zero (and some methods you will need to leave one out).

There’s plenty of literature on this out there.

Alternatively you can try an embedding layer if there are many categories.

(Christopher Dossman) #3

Thanks for the quick and informative reply! Got plenty of reading ahead of me now. :smile:

(Jeremy Howard (Admin)) #4

We cover this in lesson 4 in particular, with another good example in lesson 7.