Data labeling

Hi,

I have ~350K text examples and around ~100K are labelled. Anyone can suggest what technique should I choose to label other examples.

I have tried basic techniques/algorithms like clustering, ML model but results are not good.

Any suggestions?

Labelling tool with active learning should be the way to go. If you have access to Prodigy, you could use that. It seems that KNIME also supports this.

https://www.knime.com/blog/labeling-with-active-learning

There is also one other tool ->https://github.com/RTIInternational/SMART

Try using SageMaker ground truth labeling service on AWS

I hope there will be more better answer because I have few k of image and text that are waiting me to label. But spending days and days for labeling is just a waste of time. Some of the image require multi labeling. Do any one have a better solution than putting them into folder or inputting more than one labels for one image in excel?

Data labelling and cleaning IMO is some of the hardest/most time consuming process for it. If you’re nifty with JS or C++, (or any other app based program) you could write a quick mini-app that goes through a directory of images and you could select a few labels for it.

Something like so: https://datascience.stackexchange.com/questions/14039/tool-to-label-images-for-classification

Or labelbox.io

I’ll add though this won’t be a fun process, but data labeling (for the most part) never is.

I’ll add though that the start of this thread was on NLP labeling, not image labeling :slight_smile:

try out https://platform.ai/

1 Like

Hi Najaf,

I would recommend to go for outsourcing options. You can try labeling tools and run your ML models within the labeling tools then have outsourcing partners review your annotations. E.g. Label Studio is a great open-source option where you can deploy your model and evaluate your predictions within the interface (and it’s opensource!).

Or just go straight to outsourcing with AWS Sage Maker, Google, Scale AI and many other service help with that.

But if you are looking for medical annotations by field specific practitioners, look into: https://annomed.io/

Hey @najaf

Labelling a huge batch of 350K text examples is really challenging, but I will suggest the standard methods like clustering or basic ML models, which might not be enough in your case. I recently discovered a fantastic blog by Hi-Tech BPO that delves into effective data annotation strategies for machine learning projects. This might be exactly what you need to tackle your project! Check it out here: Effective Ways of Data Annotation for ML Projects.

Hope this turns out to be a game-changer for you!

I built Ramen AI to solve data labeling problem without the need to train model or pre-label data. It’s free to use.