How to build a QA system with training data and wikipedia corpus

mushroom · April 20, 2019, 12:52pm

Hello,
I am interested in NLP, and i’m think about building a true/false answering system. I have training dataset including questions and labels (true / false), the answers are based on the wikipedia. For example, the question “Is Tom Cruise in the movie The Avengers?”, the answer is “false”. The basic idea is that treat a question as a query, and do information retrieval in wikipedia, and return true or false.

I’m thinking to use RNN model to build this system, but i am not very familiar with RNN, how should I put the training dataset and wikipedia data as inputs into the model?

I have this idea but I am not sure if it is correct, can I use the wikipedia data ONLY to train a RNN model, then use this trained model to train my training dataset for classification (as a transfer learning thing)?

Moreover, is there a way that besides answering true / false, but also returning evidences ? For example, the answer for a question is “true”, besides returning “true”, but also returning strings from wikipedia to support the answer.

thank you

poppingtonic · April 25, 2019, 1:36am

Hi @mushroom, the main guidance you’re looking for can be found in lesson 4 of this year’s course, available here. That’s a link to the in-class discussion when the course was live, and here’s a link to the video. It’ll show you how to build a classifier in the exact way you’ve described. In fact, you won’t even need to pretrain a model on Wikipedia first, it’s already been built, you’ll just fine-tune the published one for your task. Later, when you’ve tried the method yourself on your own dataset, you can look in the community forum on how to extend the model to a question-answering system.

mushroom · April 25, 2019, 6:14am

haha I can do image task, but text task is still new to me, thank you so much for your advice !

poppingtonic · April 25, 2019, 8:30am

Glad you’re enjoying it so far. Welcome to the forum!