Dealing with sparse and tabular integer data

I have a dataset of 200K rows and 38 columns. All of them are integer values ( up to 2671) and I am trying to build a NN to classify the records into two gropus. Using random forest I could reach the accuracy of 88% on validation set but my NN is not close to that. On the best shot I could reach 95% on training and 84% on the validation set. Here is how the data lookes like:

Uploading…

I tried a model like:

model = Sequential()
model.add(Embedding(3000, 10, input_length=38))
model.add(Flatten())
model.add(Dense(300, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(200, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(300, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(200, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(300, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(200, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(300, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(200, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(300, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dense(200, activation='relu'))
model.add(BatchNormalization()) 
model.add(Dropout(0.25))
model.add(Dense(2, activation='softmax'))
model.compile(Adam(), loss='binary_crossentropy', metrics=['accuracy'])

Indeed the first layer is an Embedding layer.

here is some output:
Epoch 8/30
180000/180000 [==============================] - 38s - loss: 0.1208 - acc: 0.9492 - val_loss: 0.6160 - val_acc: 0.8348
Epoch 9/30
180000/180000 [==============================] - 39s - loss: 0.1231 - acc: 0.9487 - val_loss: 0.6178 - val_acc: 0.8430
Epoch 10/30
180000/180000 [==============================] - 40s - loss: 0.1189 - acc: 0.9500 - val_loss: 0.6605 - val_acc: 0.8480
Epoch 11/30
180000/180000 [==============================] - 39s - loss: 0.1187 - acc: 0.9504 - val_loss: 0.6340 - val_acc: 0.8433
Epoch 12/30
180000/180000 [==============================] - 39s - loss: 0.1172 - acc: 0.9506 - val_loss: 0.7138 - val_acc: 0.8340
Epoch 13/30
180000/180000 [==============================] - 38s - loss: 0.1160 - acc: 0.9513 - val_loss: 0.6475 - val_acc: 0.8425
Epoch 14/30
180000/180000 [==============================] - 39s - loss: 0.1163 - acc: 0.9516 - val_loss: 0.7101 - val_acc: 0.8398
Epoch 15/30
180000/180000 [==============================] - 38s - loss: 0.1161 - acc: 0.9511 - val_loss: 0.7448 - val_acc: 0.8383
Epoch 16/30
180000/180000 [==============================] - 39s - loss: 0.1136 - acc: 0.9529 - val_loss: 0.7683 - val_acc: 0.8433
Epoch 17/30
180000/180000 [==============================] - 39s - loss: 0.1113 - acc: 0.9532 - val_loss: 0.7168 - val_acc: 0.84

Any idea on how to improve the accuray?
I tried other approaches like normalizing the columns and avoiding the mebedding but didnt work.

You’re overfitting, since your training accuracy is much higher than your validation accuracy. Try to apply the steps to address overfitting.

1 Like