Dataset curation and dataset publication

Does anyone have resources on data collection best practices, in particular of image datasets?

Also, any articles from industry companies on the advantages and disadvantages of publishing datasets?

Thank you!

I haven’t found a single source that covers this well but I’d recommend reading Datasheets for Datasets: [1803.09010] Datasheets for Datasets if you haven’t already. This covers the publication part but a lot of it is also relevant for collecting data that won’t be made public.

It might also be worth looking at the huggingface datasets which try and implement that paper: Hugging Face – The AI community building the future.. The quality of documentation varies a bunch but it might give some examples of good and less good documentation.

There is some discussion of creating training data as part of the full stack deep learning course that might also be worth looking at.

I’d also love to see more resources on this. I’ve found a lot of the discussion of this focuses quite a bit of the technicalities and less on the broader goals and approaches.


If you are looking for a known source to get an image dataset resource then I think the first name that comes into mind is Kaggle. Similarly, you can also curate such images from royalty-free image resources Labelme or Google’s Open Images. However, your plan is about using the client’s image datasets for mobile applications then it is better to go with a secure database. For my mobile applications, I usually prefer databases like Postgres, MySQL and AWS DynamoDB to get more secure data collection.

Similarly, if we talk about the pros and cons of publishing datasets then clients’ datasets use RAM that’s why they are fast and permits you to utilize indexes on the fly. On the contrary, if we talk about the cons then I think its single-user feature could be a con in many cases.