Active learning

codeck · November 28, 2017, 6:36pm

Hi all,

I would like to know has there been any research on active learning in the area of NLP @jeremy ur thoughts could help.

jeremy · November 29, 2017, 4:02pm

I’m not aware of any, but I haven’t looked into it. Maybe @anamariapopescug would know…

anamariapopescug · November 29, 2017, 5:47pm

hi - there is research on using active learning for various NLP tasks, yes. Do you have a specific task (or area) in mind ?

codeck · November 29, 2017, 9:24pm

Yup classification of docs.

jeremy · November 29, 2017, 9:25pm

That’s a rather… concise description Could you provide more detail and background?

codeck · November 30, 2017, 2:47am

Ya sure… so the objective is to minimize the number of documents humans hav to label. Which means the system needs to lesrn while the humans are labelling and paralelly check its prediction accuracy and have a confidence measure that needs to be really high if humans were to accept the machine’s labels.

anamariapopescug · December 1, 2017, 5:22am

hi, i’m happy to check on this for you. will let you know what i can find - there’s definitely active learning work you can repurpose for this in any case, i’ll dig up some things.

codeck · December 1, 2017, 6:11am

Thats awesome looking forward to it!!

laavanya · July 29, 2019, 7:51am

@anamariapopescug Any luck on this?

I am looking to implement active learning for text classification/image classification where instead of re-training manually after a period, we allow the model to learn from manually tagged data as and when the tagging is being done. Is there a way to do this currently?

@jeremy, your suggestions on this?

Andreas_Daiminger · July 23, 2020, 10:02am

Hi everybody!
Active Learning is an amazing topic. I have been thinking about a possible active learning approach for text classification on data with erroneous and highly inconsistent labels. Real-life datasets are full of annotation inconsistencies. So the idea is:

Train an initial model on the inconsistent data
Predict on new data and let human agents review the predictions.
Here it gets interesting. If a human agent reviews a prediction as wrong and corrects the label to the new value corrected_pred:
a) The reviewed datapoint is used as a seed to find similar texts in the annotated dataset. This can be done by using the model encoder for vectorization and finding nearest neighbours through cosine similarity.
b) Get the top n most similar datapoints with label != corrected_pred and let them also be reviewed by humans. The goal here is to verify that we have discovered more wrongly annotated data.
Retrain the model with the improved dataset

What do you think? Is this a good/bad idea? @jeremy would be amazing to have your input!

jeremy · July 23, 2020, 11:13pm

Yes I think approaches like this are terrific.