Document clustering and keyword extraction

RiB · July 19, 2017, 1:51pm

Hi all,

I have a few days on my hands and I was thinking to apply some of the techniques learned in the course to extract keywords from documents and/or to cluster documents and possibly write a blog post about it.

I retrieved a few years of Ask Ubuntu questions (user + title + body + keywords) and I wanted to answer one or more of the following questions (ideally the first two and perhaps then understand how to go about the next ones):

Can I extract keywords using title and/or body?
Can I cluster together question based on content?
Can I cluster users based on the type of questions they ask?
Can I recommend questions to users to answer based on the questions they previously asked or answered?

Has anyone any experience with this and can perhaps share his working pipeline?

I wanted to start with TF-IDF plus classifier as benchmark, then apply deep learning to see if I can improve. I was thinking perhaps a LSTM architecture could work for keyword extraction and was curious about using Mean-Shift clustering on the TF-IDF matrix, but perhaps some of you have working experience or better ideas?

Thanks!