Active learning

Hi all,

I would like to know has there been any research on active learning in the area of NLP @jeremy ur thoughts could help.

1 Like

I’m not aware of any, but I haven’t looked into it. Maybe @anamariapopescug would know…

2 Likes

hi - there is research on using active learning for various NLP tasks, yes. Do you have a specific task (or area) in mind ?

1 Like

Yup classification of docs.

That’s a rather… concise description :wink: Could you provide more detail and background?

2 Likes

Ya sure… so the objective is to minimize the number of documents humans hav to label. Which means the system needs to lesrn while the humans are labelling and paralelly check its prediction accuracy and have a confidence measure that needs to be really high if humans were to accept the machine’s labels.

1 Like

hi, i’m happy to check on this for you. will let you know what i can find - there’s definitely active learning work you can repurpose for this in any case, i’ll dig up some things.

3 Likes

Thats awesome looking forward to it!!

@anamariapopescug Any luck on this?

I am looking to implement active learning for text classification/image classification where instead of re-training manually after a period, we allow the model to learn from manually tagged data as and when the tagging is being done. Is there a way to do this currently?

@jeremy, your suggestions on this?

2 Likes

Hi everybody!
Active Learning is an amazing topic. I have been thinking about a possible active learning approach for text classification on data with erroneous and highly inconsistent labels. Real-life datasets are full of annotation inconsistencies. So the idea is:

  1. Train an initial model on the inconsistent data
  2. Predict on new data and let human agents review the predictions.
  3. Here it gets interesting. If a human agent reviews a prediction as wrong and corrects the label to the new value corrected_pred:
    a) The reviewed datapoint is used as a seed to find similar texts in the annotated dataset. This can be done by using the model encoder for vectorization and finding nearest neighbours through cosine similarity.
    b) Get the top n most similar datapoints with label != corrected_pred and let them also be reviewed by humans. The goal here is to verify that we have discovered more wrongly annotated data.
  4. Retrain the model with the improved dataset

What do you think? Is this a good/bad idea? @jeremy would be amazing to have your input!

1 Like

Yes I think approaches like this are terrific.

1 Like