How to build a QA system with training data and wikipedia corpus

#1

Hello,
I am interested in NLP, and i’m think about building a true/false answering system. I have training dataset including questions and labels (true / false), the answers are based on the wikipedia. For example, the question “Is Tom Cruise in the movie The Avengers?”, the answer is “false”. The basic idea is that treat a question as a query, and do information retrieval in wikipedia, and return true or false.

I’m thinking to use RNN model to build this system, but i am not very familiar with RNN, how should I put the training dataset and wikipedia data as inputs into the model?

I have this idea but I am not sure if it is correct, can I use the wikipedia data ONLY to train a RNN model, then use this trained model to train my training dataset for classification (as a transfer learning thing)?

Moreover, is there a way that besides answering true / false, but also returning evidences ? For example, the answer for a question is “true”, besides returning “true”, but also returning strings from wikipedia to support the answer.

thank you

0 Likes

(Brian Muhia) #2

Hi @mushroom, the main guidance you’re looking for can be found in lesson 4 of this year’s course, available here. That’s a link to the in-class discussion when the course was live, and here’s a link to the video. It’ll show you how to build a classifier in the exact way you’ve described. In fact, you won’t even need to pretrain a model on Wikipedia first, it’s already been built, you’ll just fine-tune the published one for your task. Later, when you’ve tried the method yourself on your own dataset, you can look in the community forum on how to extend the model to a question-answering system.

0 Likes

#3

haha I can do image task, but text task is still new to me, thank you so much for your advice !

1 Like

(Brian Muhia) #4

Glad you’re enjoying it so far. Welcome to the forum!

0 Likes