When are character embeddings better than word embeddings and vice-versa?

wgpubs · June 9, 2017, 3:55pm

In my particular case, I need to build a classifier(s) for sentiment analysis to determine if a given piece of text represents something positive/negative, a threat, a suggestion, a complaint, and/or is nonsense.

I’m thinking about building a separate classifier for each (e.g., one for determining if the text is positive or negative, one for determining if it is a threat, etc…) since each piece of copy can have multiple characteristics.

My questions are two-fold:

Specfically, should I attempt to accomplish this using character embeddings or word embeddings?
Generally, what kind of NLP problems are best suited to looking at characters one-by-one vs. looking at words, and vice-versa?

Based on what I infer from the course, I’m inclined to believe that word embeddings are more appropriate to sentiment analysis while character embeddings are more appropriate for predicting things like the next character or generating a bunch of similarly worded text based on a sample.

If anyone has more specific NLP resources they could share that would be great. Most of my work-work seems to be going in this direction and I’d really like to understand what architectures work best for one scenario over another.

Even · June 9, 2017, 4:21pm

Your intuition that you’d use word embeddings is correct. If you think about the problem space characters don’t contain any sentiment; ‘a’ is no more threatening than ‘e.’ Words on the other hand contain sentiment, and finally tuples or sequences of words contain more but explode the dimensionality of the problem.

Siraj Ravel has a quick introductory video on how to do this using tensorflow. His videos are a little flashy and quick for my taste but he’s a good resource for deep learning via tensorflow.

gilbert · June 11, 2017, 4:00pm

I’d say that it really depend on your dataset and model. For instance: if you’re looking at dataset made of twitter feed. Definitely characters based / characters->words embeddings makes a lot of sense because of the nature of the syntax used. Unlike, a dataset where the chances of misspelled/ custom words is highly unlikely.
Not to say that characters-based won’t give you the extra % gain (they might in both cases), but from a model/effort/margin point of you, sometime it doesn’t make sense.

Another question you should ask yourself is can you use pre-trained embedding for your dataset? and if so which ones.

my 2 cents.