Objective: Discover N number of themes in a collection of user reviews and then link each independent review back to the discovered themes.
Data: I have a decent size corpus that includes a collection of reviews, the human-derived themes based on the reviews, and which themes each review is associated too. Reviews are mostly associated to a single theme and no more than five themes. As for themes, there are at most 20 per set of reviews.
Where I’m at now:
I’m looking for recommendations on 1) how to structure the datasets for ML and 2) candidate model architectures that might prove helpful.
I was thinking a seq2seq model might be a good place to start but I’m not sure how the X, y pairs should be created in order to discover multiple themes.
Given X observations divide them into Y groups. Sounds to me like clustering. Not sure whether that works for text data.
Based on the data a multi-class classifier with the themes being the classes? Custom head on the NLP model that we worked with?
The themes will vary by set of reviews and vary year-over-year, and so I don’t think I’ll be able to set them up as labels for classification. They will need to be discovered if possible.
And as I mentioned above, I do have training data in the form of reviews, themes, and what reviews are associated to what themes.
Wondering if a good approach might be grabbing the hidden state representations of each review, and then applying LDA/PCA to discover various clusters. From that, perhaps a model can be trained to summarize all the reviews in a given cluster … that would be the “theme”.
I dunno. Just talking out loud at this point.