Dealing with sparse and tabular integer data

Mihar · August 3, 2017, 3:30pm

I have a dataset of 200K rows and 38 columns. All of them are integer values ( up to 2671) and I am trying to build a NN to classify the records into two gropus. Using random forest I could reach the accuracy of 88% on validation set but my NN is not close to that. On the best shot I could reach 95% on training and 84% on the validation set. Here is how the data lookes like:

Uploading…

I tried a model like:

model = Sequential()
model.add(Embedding(3000, 10, input_length=38))
model.add(Flatten())
model.add(Dense(300, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(200, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(300, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(200, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(300, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(200, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(300, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(200, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(300, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(200, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dropout(0.25))
model.add(Dense(2, activation='softmax'))
model.compile(Adam(), loss='binary_crossentropy', metrics=['accuracy'])

Indeed the first layer is an Embedding layer.

here is some output:
Epoch 8/30
180000/180000 [==============================] - 38s - loss: 0.1208 - acc: 0.9492 - val_loss: 0.6160 - val_acc: 0.8348
Epoch 9/30
180000/180000 [==============================] - 39s - loss: 0.1231 - acc: 0.9487 - val_loss: 0.6178 - val_acc: 0.8430
Epoch 10/30
180000/180000 [==============================] - 40s - loss: 0.1189 - acc: 0.9500 - val_loss: 0.6605 - val_acc: 0.8480
Epoch 11/30
180000/180000 [==============================] - 39s - loss: 0.1187 - acc: 0.9504 - val_loss: 0.6340 - val_acc: 0.8433
Epoch 12/30
180000/180000 [==============================] - 39s - loss: 0.1172 - acc: 0.9506 - val_loss: 0.7138 - val_acc: 0.8340
Epoch 13/30
180000/180000 [==============================] - 38s - loss: 0.1160 - acc: 0.9513 - val_loss: 0.6475 - val_acc: 0.8425
Epoch 14/30
180000/180000 [==============================] - 39s - loss: 0.1163 - acc: 0.9516 - val_loss: 0.7101 - val_acc: 0.8398
Epoch 15/30
180000/180000 [==============================] - 38s - loss: 0.1161 - acc: 0.9511 - val_loss: 0.7448 - val_acc: 0.8383
Epoch 16/30
180000/180000 [==============================] - 39s - loss: 0.1136 - acc: 0.9529 - val_loss: 0.7683 - val_acc: 0.8433
Epoch 17/30
180000/180000 [==============================] - 39s - loss: 0.1113 - acc: 0.9532 - val_loss: 0.7168 - val_acc: 0.84

Any idea on how to improve the accuray?
I tried other approaches like normalizing the columns and avoiding the mebedding but didnt work.

msp · August 4, 2017, 8:51am

You’re overfitting, since your training accuracy is much higher than your validation accuracy. Try to apply the steps to address overfitting.