Transfer Learning with OpenAI's Unsupervised Sentiment Neuron?

So OpenAI just announced they built a highly effective sentiment model based on a training an LSTM to predict the next character in a product review.

They put a lot of compute time into the first training step which I don’t have the money or time to reproduce…

We first trained a multiplicative LSTM with 4,096 units on a corpus of 82 million Amazon reviews to predict the next character in a chunk of text. Training took one month across four NVIDIA Pascal GPUs, with our model processing 12,500 characters per second.

But Their code appears to have weights saved - so I’m interested if there’s an opportunity to use the representations learned for some other text task? Any thoughts on where this idea breaks down? Where the pitfalls might be?

2 Likes

Looks pretty cool, but probably will fail outside of its trained domain (they even test that and mention it in their paper).

Our work highlights the sensitivity of learned representa- tions to the data distribution they are trained on. The results make clear that it is unrealistic to expect a model trained on a corpus of books, where the two most common gen- res are Romance and Fantasy, to learn an encoding which preserves the exact sentiment of a review. Likewise, it is unrealistic to expect a model trained on Amazon product reviews to represent the precise semantic content of a tex- tual description of an image or a video.

Their model apparently doesn’t perform as well even on other review datasets (e.g. yelp), probably because the things reviewed on Yelp are very different from those on Amazon.

Thanks for the thoughts! I totally understand what they’re saying, but I wonder if their model has learned some things that could be useful in other copuses. For example, I was thinking more about finetunning like we did with VGG in part 1. I’m imagining cropping off the top N layers of the model, freezing those weights and then adding in some additional layers, retraining the character level model on some new dataset. I guess I should just try it :slight_smile:

I think it’s a one level model, isn’t it? Just one multiplicative LSTM(4096).

Something I was wondering if maybe you could use this to create context aware word vectors (at least for this corpus)?

i.e.

Grab the* hidden* state* at* the* asterisks*

h1 - h0 --> context vector for Grab
h2 - h1 --> context vector for the

etc

I’d expect that this would have similar vectors for words used in similar contexts but might have different vectors for the same word with different meanings (e.g. https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo).

I imagine it couldn’t hurt to add activations from this model as an additional feature to any model you’re building. i.e. as well as your word embeddings, have the pre-trained activations as an additional input.

You could even try then connecting up the full pre-trained model so as to fine-tune it, although it is pretty big (4096 hidden activations).