I have a question answer chat text corpus where each Q-A also has a pre-Question text and post-Answer text.
The task I am trying to solve is to classify the type of answer for each question.
I am stuck at what kind of feature engineering I should be doing? How do I incorporate the pre and post context-texts along with the Question/Answer.
I am attaching a sample from my dataset below.
I’m afraid there is no way to know in advance which combination of features will work best.
But maybe a few thoughts from my own experience with feature engineering in NLP can help:
- first of all I would train a baseline only with the response field as input and see how it performs
- try to label a few records by yourself to get a feeling which kinds of other text fields contain useful information for this classification task
- then you can try to add the text fields that you think are promising in an iterative way and see how the model with additional inputs performs compared to the baseline
- for combining features, you could simply concatenate the texts. What I usually do is add some kind of separator token between concatenated parts of the input. E.g. following the fast.ai terminology, I added tokens xxbot and xxeot for beginning and end of title.
- note: it’s important that you also think about inference. At the time the model needs to predict the answer type in production, are all features available?
Thank you @stefan-ai for your suggestions. I will give them a try.
For now, I am trying to find the
POS-tags for each block of text (question, response, precedent, subsequent) and weight the sentence-embeddings of these texts with the embeddings of their respective
I think feature extraction in text is an inherently complex task and I feel it lacks much behind compared to Computer Vision tasks where we have a lots of different types of features for an image data.
I will update if I find anything new.