MultiClass Classification using Dense Layers

@jeremy I was solving a multiclass classification prediction problem with only dense layers.
Dataset: ~115 dimensions, 10 classes to predict
Network architecture:
model = Sequential()
model.add(Dense(117, input_dim=117, init=‘normal’, activation=‘relu’))
model.add(Dropout(0.3))
model.add(Dense(512,activation=‘relu’))
model.add(Dropout(0.2))
model.add(Dense(10, activation=‘softmax’))
model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’])
model.fit(train, train_label, batch_size=64, nb_epoch=25, validation_data=(test, test_label))

But I am getting very poor results.
Epoch 11/25
45377/45377 [==============================] - 5s - loss: 1.7601 - acc: 0.4009 - val_loss: 1.7170 - val_acc: 0.4108
Epoch 12/25
45377/45377 [==============================] - 5s - loss: 1.7760 - acc: 0.3999 - val_loss: 1.8254 - val_acc: 0.4095
Epoch 13/25
45377/45377 [==============================] - 6s - loss: 1.7552 - acc: 0.4006 - val_loss: 1.7814 - val_acc: 0.4322

Any suggestions to improve the model ?

You are underfitting. Try:

  • Reducing the learning rate
  • Removing the 1st dropout layer
  • Making the first dense layer Dropout(512

What kind of data is it?

This is a bunch of categorical + continuous (numeric) dataset which usually predicts the products(10 of them) for users.

I reduced the Learning rate and also changed the architecture but its giving even poor results.
model = Sequential()

Input Layer

model.add(Dense(117, input_dim=117, init=‘normal’, activation=‘relu’))
model.add(Dense(128,activation=‘relu’))
model.add(Dropout(0.3))

Layer 1

model.add(Dense(128,activation=‘relu’))
model.add(Dropout(0.3))

Layer 2

model.add(Dense(128,activation=‘relu’))
model.add(Dropout(0.2))

Layer 3

model.add(Dense(128,activation=‘relu’))
model.add(Dropout(0.5))

Layer 4

model.add(Dense(128,activation=‘relu’))
model.add(Dropout(0.2))

Output Layer

model.add(Dense(10, activation=‘softmax’))
model.summary()
model.fit(train, train_label, batch_size=100, nb_epoch=25, validation_data=(test, test_label))

Epoch 24/25
45377/45377 [==============================] - 7s - loss: 1.8421 - acc: 0.3829 - val_loss: 1.8311 - val_acc: 0.3866
Epoch 25/25
45377/45377 [==============================] - 7s - loss: 1.8386 - acc: 0.3860 - val_loss: 1.8006 - val_acc: 0.3976

You’ve made too many changes, and not the ones I suggested - and now have an awful lot of layers! The model I described would be written like so:

model = Sequential([
  Dense(512, input_dim=117, activation='relu'),
  Dense(512,activation='relu'),
  Dropout(0.2),
  Dense(10, activation='softmax')
  ])
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=1e-4), metrics=['accuracy'])
model.fit(train, train_label, batch_size=64, nb_epoch=25, validation_data=(test, test_label))

(I also just noticed you replaced the default init with normal - I don’t think you should do that, since glorot initialization is a better idea, I believe)

1 Like

I used the model which you suggested.
model = Sequential([
Dense(512, input_dim=117, activation=‘relu’),
Dense(512,activation=‘relu’),
Dropout(0.2),
Dense(10, activation=‘softmax’)
])
model.fit(train, train_label, batch_size=64, nb_epoch=25, validation_data=(test, test_label))

Results:
Epoch 22/25
45377/45377 [==============================] - 12s - loss: 1.7614 - acc: 0.4039 - val_loss: 1.7435 - val_acc: 0.4099
Epoch 23/25
45377/45377 [==============================] - 13s - loss: 1.7473 - acc: 0.4088 - val_loss: 1.7320 - val_acc: 0.4242
Epoch 24/25
45377/45377 [==============================] - 12s - loss: 1.7384 - acc: 0.4089 - val_loss: 1.7213 - val_acc: 0.4170
Epoch 25/25
45377/45377 [==============================] - 12s - loss: 1.7382 - acc: 0.4091 - val_loss: 1.9721 - val_acc: 0.4039

Any thoughts where I might be going wrong ?

Looks like you’re not overfitting any more. So you’ll have to think about your feature engineering, since it sounds like you have structured data. In general, deep learning isn’t the best tool for structured data - or at least I haven’t seen many people make it work well.

1 Like

Thanks Jeremy it echoes my view as well “deep learning isn’t the best tool for structured data”.
I wanted to convince myself on that. Yes this is structured data.
I experimented normal machine learning tasks like random forest and getting f1 score (Train)= 0.8602 and fi (Test) = 8138 for the same dataset.
Can I conclude that more research has to be done for structured data and usually deep learning might not be the best tool ?

1 Like

Yes I think that’s a reasonable conclusion - although I also think there’s no reason DL couldn’t turn out to be just as good as random forests if more people work on DL for structured data. It’s something I’d be interested in spending time on sometime, since I’ve been a major RF fan for a long time!

3 Likes

BTW in your most recent snippet you didn’t show the compile step. What learning rate did you use? Have you tried decreasing it a lot?

I used
model.compile(loss=‘categorical_crossentropy’, optimizer=Adam(lr=1e-4), metrics=[‘accuracy’])

I also tried decreasing it optimizer=Adam(lr=1e-06) not much improvement
Epoch 24/25
45377/45377 [==============================] - 12s - loss: 1.8287 - acc: 0.3665 - val_loss: 1.7992 - val_acc: 0.3839
Epoch 25/25
45377/45377 [==============================] - 12s - loss: 1.8294 - acc: 0.3665 - val_loss: 1.7990 - val_acc: 0.3839

Thanks for reporting back!

Are you one-hot encoding all the categorical variables? If not - you definitely need to. If some are very high cardinality (i.e. have many levels) use an Embedding layer for them instead.

1 Like

Yes I am using OneHotEncoding for the label and also for categorical features

Just wondering what is the theory behind embedding layer for high cardinality ?

It’s purely a computational/memory saving. Rather than multiplying by a one-hot encoded matrix, which if high cardinality would be huge, it’s quicker and less memory intensive to simply use an integer to index into it directly. The result is, of course, identical.

1 Like

That is a very good tip for high cardinality variables.

It is! And not just for deep learning - for regression, GlmNet, etc I’ve always used this approach instead of creating dummy variables. :slight_smile:

Another technique which might be useful for categorical is hashing since it has the property to bin similar items to the same index.

1 Like

Could you explain a little bit more about using embedding layer instead of using one-hot encoding.

I have always used one-hot encoding and of course it is very slow and memory-consuming.