Data labeling


I have ~350K text examples and around ~100K are labelled. Anyone can suggest what technique should I choose to label other examples.

I have tried basic techniques/algorithms like clustering, ML model but results are not good.

Any suggestions?

Labelling tool with active learning should be the way to go. If you have access to Prodigy, you could use that. It seems that KNIME also supports this.

There is also one other tool ->

Try using SageMaker ground truth labeling service on AWS

I hope there will be more better answer because I have few k of image and text that are waiting me to label. But spending days and days for labeling is just a waste of time. Some of the image require multi labeling. Do any one have a better solution than putting them into folder or inputting more than one labels for one image in excel?

Data labelling and cleaning IMO is some of the hardest/most time consuming process for it. If you’re nifty with JS or C++, (or any other app based program) you could write a quick mini-app that goes through a directory of images and you could select a few labels for it.

Something like so:


I’ll add though this won’t be a fun process, but data labeling (for the most part) never is.

I’ll add though that the start of this thread was on NLP labeling, not image labeling :slight_smile:

try out

1 Like

Hi Najaf,

I would recommend to go for outsourcing options. You can try labeling tools and run your ML models within the labeling tools then have outsourcing partners review your annotations. E.g. Label Studio is a great open-source option where you can deploy your model and evaluate your predictions within the interface (and it’s opensource!).

Or just go straight to outsourcing with AWS Sage Maker, Google, Scale AI and many other service help with that.

But if you are looking for medical annotations by field specific practitioners, look into: