MultiClass Classification using Dense Layers

janardhanp22 · November 17, 2016, 6:47am

@jeremy I was solving a multiclass classification prediction problem with only dense layers.
Dataset: ~115 dimensions, 10 classes to predict
Network architecture:
model = Sequential()
model.add(Dense(117, input_dim=117, init=‘normal’, activation=‘relu’))
model.add(Dropout(0.3))
model.add(Dense(512,activation=‘relu’))
model.add(Dropout(0.2))
model.add(Dense(10, activation=‘softmax’))
model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’])
model.fit(train, train_label, batch_size=64, nb_epoch=25, validation_data=(test, test_label))

But I am getting very poor results.
Epoch 11/25
45377/45377 [==============================] - 5s - loss: 1.7601 - acc: 0.4009 - val_loss: 1.7170 - val_acc: 0.4108
Epoch 12/25
45377/45377 [==============================] - 5s - loss: 1.7760 - acc: 0.3999 - val_loss: 1.8254 - val_acc: 0.4095
Epoch 13/25
45377/45377 [==============================] - 6s - loss: 1.7552 - acc: 0.4006 - val_loss: 1.7814 - val_acc: 0.4322

Any suggestions to improve the model ?

jeremy · November 17, 2016, 7:40pm

You are underfitting. Try:

Reducing the learning rate
Removing the 1st dropout layer
Making the first dense layer Dropout(512

What kind of data is it?

janardhanp22 · November 17, 2016, 8:49pm

This is a bunch of categorical + continuous (numeric) dataset which usually predicts the products(10 of them) for users.

janardhanp22 · November 18, 2016, 12:31am

I reduced the Learning rate and also changed the architecture but its giving even poor results.
model = Sequential()

Input Layer

model.add(Dense(117, input_dim=117, init=‘normal’, activation=‘relu’))
model.add(Dense(128,activation=‘relu’))
model.add(Dropout(0.3))

Layer 1

model.add(Dense(128,activation=‘relu’))
model.add(Dropout(0.3))

Layer 2

model.add(Dense(128,activation=‘relu’))
model.add(Dropout(0.2))

Layer 3

model.add(Dense(128,activation=‘relu’))
model.add(Dropout(0.5))

Layer 4

model.add(Dense(128,activation=‘relu’))
model.add(Dropout(0.2))

Output Layer

model.add(Dense(10, activation=‘softmax’))
model.summary()
model.fit(train, train_label, batch_size=100, nb_epoch=25, validation_data=(test, test_label))

Epoch 24/25
45377/45377 [==============================] - 7s - loss: 1.8421 - acc: 0.3829 - val_loss: 1.8311 - val_acc: 0.3866
Epoch 25/25
45377/45377 [==============================] - 7s - loss: 1.8386 - acc: 0.3860 - val_loss: 1.8006 - val_acc: 0.3976

jeremy · November 18, 2016, 1:17am

You’ve made too many changes, and not the ones I suggested - and now have an awful lot of layers! The model I described would be written like so:

model = Sequential([
  Dense(512, input_dim=117, activation='relu'),
  Dense(512,activation='relu'),
  Dropout(0.2),
  Dense(10, activation='softmax')
  ])
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=1e-4), metrics=['accuracy'])
model.fit(train, train_label, batch_size=64, nb_epoch=25, validation_data=(test, test_label))

(I also just noticed you replaced the default init with normal - I don’t think you should do that, since glorot initialization is a better idea, I believe)

janardhanp22 · November 18, 2016, 3:26am

I used the model which you suggested.
model = Sequential([
Dense(512, input_dim=117, activation=‘relu’),
Dense(512,activation=‘relu’),
Dropout(0.2),
Dense(10, activation=‘softmax’)
])
model.fit(train, train_label, batch_size=64, nb_epoch=25, validation_data=(test, test_label))

Results:
Epoch 22/25
45377/45377 [==============================] - 12s - loss: 1.7614 - acc: 0.4039 - val_loss: 1.7435 - val_acc: 0.4099
Epoch 23/25
45377/45377 [==============================] - 13s - loss: 1.7473 - acc: 0.4088 - val_loss: 1.7320 - val_acc: 0.4242
Epoch 24/25
45377/45377 [==============================] - 12s - loss: 1.7384 - acc: 0.4089 - val_loss: 1.7213 - val_acc: 0.4170
Epoch 25/25
45377/45377 [==============================] - 12s - loss: 1.7382 - acc: 0.4091 - val_loss: 1.9721 - val_acc: 0.4039

Any thoughts where I might be going wrong ?

jeremy · November 18, 2016, 3:33am

Looks like you’re not overfitting any more. So you’ll have to think about your feature engineering, since it sounds like you have structured data. In general, deep learning isn’t the best tool for structured data - or at least I haven’t seen many people make it work well.

janardhanp22 · November 18, 2016, 3:52am

Thanks Jeremy it echoes my view as well “deep learning isn’t the best tool for structured data”.
I wanted to convince myself on that. Yes this is structured data.
I experimented normal machine learning tasks like random forest and getting f1 score (Train)= 0.8602 and fi (Test) = 8138 for the same dataset.
Can I conclude that more research has to be done for structured data and usually deep learning might not be the best tool ?

jeremy · November 18, 2016, 4:48am

Yes I think that’s a reasonable conclusion - although I also think there’s no reason DL couldn’t turn out to be just as good as random forests if more people work on DL for structured data. It’s something I’d be interested in spending time on sometime, since I’ve been a major RF fan for a long time!

jeremy · November 18, 2016, 4:49am

BTW in your most recent snippet you didn’t show the compile step. What learning rate did you use? Have you tried decreasing it a lot?

janardhanp22 · November 18, 2016, 5:29am

I used
model.compile(loss=‘categorical_crossentropy’, optimizer=Adam(lr=1e-4), metrics=[‘accuracy’])

I also tried decreasing it optimizer=Adam(lr=1e-06) not much improvement
Epoch 24/25
45377/45377 [==============================] - 12s - loss: 1.8287 - acc: 0.3665 - val_loss: 1.7992 - val_acc: 0.3839
Epoch 25/25
45377/45377 [==============================] - 12s - loss: 1.8294 - acc: 0.3665 - val_loss: 1.7990 - val_acc: 0.3839

jeremy · November 18, 2016, 11:36am

Thanks for reporting back!

jeremy · November 18, 2016, 11:38am

Are you one-hot encoding all the categorical variables? If not - you definitely need to. If some are very high cardinality (i.e. have many levels) use an Embedding layer for them instead.

janardhanp22 · November 18, 2016, 6:29pm

Yes I am using OneHotEncoding for the label and also for categorical features

janardhanp22 · November 18, 2016, 6:30pm

Just wondering what is the theory behind embedding layer for high cardinality ?

jeremy · November 18, 2016, 10:10pm

It’s purely a computational/memory saving. Rather than multiplying by a one-hot encoded matrix, which if high cardinality would be huge, it’s quicker and less memory intensive to simply use an integer to index into it directly. The result is, of course, identical.

janardhanp22 · November 19, 2016, 4:02am

That is a very good tip for high cardinality variables.

jeremy · November 19, 2016, 4:19am

It is! And not just for deep learning - for regression, GlmNet, etc I’ve always used this approach instead of creating dummy variables.

janardhanp22 · November 19, 2016, 5:50pm

Another technique which might be useful for categorical is hashing since it has the property to bin similar items to the same index.

kankak · December 21, 2016, 10:27am

Could you explain a little bit more about using embedding layer instead of using one-hot encoding.

I have always used one-hot encoding and of course it is very slow and memory-consuming.