Predict matching questions and answers

sinsji · April 15, 2019, 2:55pm

There is a Kaggle competition fro CareerVillage.org which connects questions from users to suitable professionals that might be interested in answering the particular questions.

The data contains over 10.000 previously posted questions. Most questions have 1 or 2 answers, a few up till 30 answers. Both the answers and questions have attributes like ‘likes’, ‘hashtags’, date and time attributes. Further there are attributes to the users that posted the question and the users (professionals) that answered it like the date of joining, profession, interests, groups joined, previous emails with suggested questions received (for professionals).

The professionals get new open questions to daily and weekly emails, and groups and hashtags they follow. All this information is available.

The goal is to find a sensible way to predict for each answer whether a specific professional would be interested in answering it.

Is there a logical/mathematical way to approach such a problem? I’m not really sure where to start. For each unique question and answer there are only a couple of observations. Many professionals did not answer THAT specific question. Usually only one or two which answered a particular question.

PM: I’m interested in the competition because it seems a good dataset to practice with text. And the concept of matching questions and answers could be easily generalized topics outside of career choice.

shawn · April 15, 2019, 7:19pm

It sounds like there’s a lot of information to work with here. Have you considered this as a collaborative filtering task? It sounds like a good fit to me - you have many questions and many people answering them, and you would like to identify latent factors that predict who will want to answer the question. Lesson 4 covers this topic.

Lesson 4 also covers tabular data (such as the likes, date/time, hashtags in your data set).

Of course, the questions and answers themselves contain important text features which could be extracted from a language model. Eventually, all of these inputs could be combined into a single model.

aychang · April 19, 2019, 4:09pm

This is going a little bit off topic from the course, but this looks like it could be initially constructed as a bipartite graph based model where your users are one node set and question posts are your second node set and the edges connecting them are the answers. Your nodes can have attributes like the user profession/interests and questions are represented by text embeddings you train on a language model with fastai. You can try to use graph network techniques to produce edges between new questions and users this would essentially “connect” the best professional for that specific question.

It seems like you are interested in a text based models on fastai, so graph networks may not be something you are interested in, but I see it as an interesting approach to this problem/competition.

sinsji · April 20, 2019, 12:10pm

Thanks for both suggestions. I’m not particularly interested in the subjects of the competition: career advice, but the problem itself is a very good challenge for me. I know a bit about statistics, enough to get confused, to little to be useful

I will read up about both approaches:

collaborative filtering / recommendation systems
graph models

The careervillage.org think is almost over, so I will use the practice for future encounters.

PM: recommendations on further reading (beside the video on lesson 4) are appreciated of course! The advanced discussion on level 4 has more info on recommendation systems I found.

Thanks again.