How to efficiently annotate your own datasets?

I work in a social science lab and labelling data is always a bottleneck.
Mind you, this is just labeling an initial dataset to validate against crowdsourced labels.

What kind of tools, tricks, or platforms do you use to label data within your institution?

For mutually exclusive images I send annotators a zip file of images and have them drag images into sub-directories for each class. What about for non-mutually exclusive images (IE where one image has more than one label)? How about for metadata and text?

I hope that the question is clear, but more importantly I hope this thread can serve as a resource for others who need to label datasets without (or before) crowd workers,

I have not used this myself, but I have heard people that were very happy with this, although it is not free:

I am in not way affiliated with this, just an option to maybe check out. (especially for nlp stuff it seems) It is by the authors of the spaCy nlp lib.
But seems to be for image annotation as well (but the happy people mentioned above were from the nlp space)

1 Like

If you don’t mind spending some money, you can get your labeling done through a crowdsourcing site such as Mechanical Turk

Cancel that, this is for the initial dataset. Good question that I will keep looking at.

https://github.com/Microsoft/VoTT is ok for images

  • Split the classes up. Have each annotator do one class only.

  • Choose what part of the dataset to label. Drago @ Waymo speeks to one of the challenges early in this video https://www.youtube.com/watch?v=Q0nGo2-y0xY The model won’t improve much if the new training data covers a part of the distribution it’s already nailing.

I did an article on more generic concepts regarding data quality here:

You can try using coco:


For video check out Diffgram! Frame Level Accuracy