Objective: Discover N number of themes in a collection of user reviews and then link each independent review back to the discovered themes.
Data: I have a decent size corpus that includes a collection of reviews, the human-derived themes based on the reviews, and which themes each review is associated too. Reviews are mostly associated to a single theme and no more than five themes. As for themes, there are at most 20 per set of reviews.
Where I’m at now:
I’m looking for recommendations on 1) how to structure the datasets for ML and 2) candidate model architectures that might prove helpful.
I was thinking a seq2seq model might be a good place to start but I’m not sure how the X, y pairs should be created in order to discover multiple themes.