Designing Neural Networks with different input data types

dndln · July 13, 2017, 12:20pm

hey guys, I’m wondering about how to structure the input for a problem, hope I can get some tips. I have a bunch of attributes to describe an entity, the attributes are different data types (string, constant, int, float), I understand I need to one-hot encode the strings and constants, so my input for an entity would be something like [[0,0,1,0], 1359283, [1,0], 45.1, 2.1, 0.17, [0.43,0.63,0.11,0.35]].

So do I just flatten this vector to something like [0,0,1,0,1359283, 1,0, 45.1, 2.1, 0.17, 0.43,0.63,0.11,0.35] and feed as input? Feels like if I do it this way it loses a lot of information like the (0,0,1,0) is supposed to be a tuple together, and part of the tuple like 1,0 are completely independent of the rest like 1359283.

Or do I separate them into different input layers and merge them? Basically merge([input(shape=(x,4,1)), input(shape=(x,1,), dtype=‘int64’), input(shape=(x,2,1), input(shape=(x,1,), dtype=‘float64’), input(shape=(x,1,), dtype=‘float64’), input(shape=(x,1,), dtype=‘float64’), input(shape=(x,4,1), dtype=‘float64’)]), x being the length of my samples.

Another big problem I’m running into here, is that I’m not sure how to deal with optional attributes. I remember in our word2vec class, the words that are outside the 500 word range are all assigned index 500. But here for integer and float attributes that are optional, I don’t know what should I do here? Do I assign them a random number like 0 or -1? Or do I make it a vector - like [1, 0.63] which means the attribute for this entity has a value, [0,0] means it doesn’t?

Hope I can get some help here, thanks a lot!

pietz · July 13, 2017, 1:42pm

on the off-chance that you’re rather new to this field: have you tried non-deep machine learning techniques? those usually perform better on tabular data and are much easier to handle. NNs mostly shine on images and/or if some form of “human intuition” seems needed.

libs like TPOT will do most of the work for you and generate the best performing model on a given problem. if you haven’t tried stuff like random forests or SVMs, give them a shot first.

dndln · July 14, 2017, 8:44am

Thanks pietz, yes I am rather new to modelling data (currently a data analyst). I do understand the basis of random forests/SVM’s e.g. xgboost but I wanted to approach the problem from a NN perspective. I’ve heard @jeremy talk about how deep learning has the potential to outperform the other methods, and wanted to give it a shot.

TPOT looks great, saves me from manually having to do feature selection. Thanks for the suggestion!