Advice on building a multi-class neural network on 50K records of short text data (up to 10 words)

Hi all,

it’s a bit of a conceptual question but I wanted to reach out to you to get your opinion and perhaps point me to some good references. As pointed out in the title I’m building a NN on top of short textual data and would like to classify it into 20 different categories. The row count is roughly 50K observations with quite severe class imbalance where the smallest class counts roughly 100 rows (I’m using the case weight option in keras to combat that a bit).

I’ve tried multiple different solutions already:

  1. Started with simple dense architectures on top of a prepared DFM matrix
  2. Then build the same model but instead add an embedding layer on top
  3. Experimented also with RNN and LSTMs both using an embedding layer

All of those solutions have obviously different performance but even though during training the model is able to reach 99% accuracy both on the training and validation samples, the test accuracy always stagnates at around 86% for any type of bootstrap from that population (I’ve tried different resamples too to verify that). I experimented already with a lot of different setting, architectures, added regularization and nothing really help me break that barrier.

With all of that at hand I begin to wonder whether there’s something I’m missing out here except for the obvious answer which is: “get more training data”, for instance:

  • What would be the recommended number of tokens to consider for such a task?
  • What should be the size of the embedding (in the context of the number of tokens)?
  • Why isn’t my RNN/ LSTM performing much better that a simple dense NN applied to a pruned dfm?
  • Is it worth pruning my dictionary before passing it into the tokenizer and applying RNN / LSTM?
  • What should be my batch size and number epochs? Currently the accuracy of the model stops increasing at epoch = 3 with a batch size of 32

Any recommendations and advice would be really welcome! If you have any references to papers that discuss a similar topic it would be great if you can pass them along.

What classifier are you using in the final layer? Is it logistic regression? Perhaps you can try attention too.

I assume you have tried using a pretrained Ulmfit already?

My thoughts for the moment:

  1. 86% accuracy on 20 categories does not sound that bad to me, but it very much depends on the task and the kind of categories
  2. Check for mislablings and ill defined categories
  3. If you are categorising based on keywords, then you might try some preprocessing and bow methods (NBSVM have worked well for me in the past, although in my case I ended up using ulmfit
  4. What is the loss doing? Ok the accuracy, but it sounds like you are overfitting to me
  5. Have you tried increasing dropouts?

Yes, with a softmax activation. Haven’t heard of attention - what is that?

Ad. Ulmfit - haven’t tried that yet, is it available for R?
Ad. 2 - yes, I’m currently in the process of that along with adding more training data
Ad. 3 - what specific pre-processing do you have in mind? I haven’t come across a lot of recommendations of doing preprocessing for embedding applications. Regarding embeddings itself I tried training them as one of my final network layers and I still would need to try using word2vec separately to on them.
Ad. 4 - I didn’t quite get the question. I also thought that I’m overfitting but regardless of my actions the output remains more or less the same (including different resamples)
Ad. 5 - yes, training accuracy dropped across all resamples

I think part of the problem is that the entire dataset is composed of approx 20. quite unique subsamples (quite different pieces of texts that are in some sense different from each other) and there is simply not enough generalisation within the set itself at the moment.

Just so you know (apologies if I am assuming wrong): you are in the forum of th courses and library, so I had assumed you had watched the videos.

I’m case you haven’t, I strongly recommend you do, as the NLP section will get you up to speed on UlmFit and transfer learning for NLP.

Increasing the dropout is supposed to have your training accuracy drop, but your objective is having the validation loss go down, so that’s fine.

The loss is different from your accuracy metric, which might stay level, but again I highly recommend you have a look at the course. You won’t regret it

I’m almost done with part 1 (there’s 1 video to go if I remember correctly). However, I don’t think UlmFit was covered already so I’ll just keep on watching.

Regarding the loss/ accuracy topic, below are the prints of each epoch

Epoch 1/3
[==============================] - 42s 2ms/step - loss: 0.3946 - acc: 0.9052 - val_loss: 0.1381 - val_acc: 0.9642
Epoch 2/3
[==============================] - 41s 2ms/step - loss: 0.0860 - acc: 0.9789 - val_loss: 0.0594 - val_acc: 0.9840
Epoch 3/3
[==============================] - 38s 2ms/step - loss: 0.0411 - acc: 0.9911 - val_loss: 0.0285 - val_acc: 0.9941

So from the training process it doesn’t seem that I’m overfitting at all. Then if I run it on the test set I get the following results:

[1] 0.8739776

[1] 0.8683386

Your output looks a bit unfamiliar to me, so I suspect you are using an older version of the course.

I am not sure whether the “loss” I am seeing is training loss or validation loss: you would want to compare the movements of the two to gauge whether you are overfitting or not.

The one that just was made public yesterday covers the NLP part in lesson 4, so you might want to have a look at it.
Or maybe just head to the docs:

1 Like