Challenges while creating your own dataset

OmarAmin · November 5, 2017, 11:24am

Hi All,

If i’m interested to work on a problem which doesn’t have any available dataset online.

1- Do you know how can I manually download images for this task? as google image search has a limit of 300 images or something near that, so do you know any other source that i can download around 10k images from for finetuning purposes?

2- do you have any suggestions regarding how can i clean the downloaded datasets from similar images? assuming that i downloaded these images using google image search by using many similar keywords (i.e bicycle, bike … etc) so most probably some images might appear twice and in rare cases they won’t be identical images, but would be almost the same

3- Also, There’re some problems with google images like that almost all images have watermarks and they’re most probably edited using photoshop, also many images aren’t taken in real scenarios, some have its background removed.

4- Also is it very harmful to crop images from videos (The same object appears in many frames with almost the same background)

Thank you in advance

angryziber · November 5, 2017, 11:29am

Can you give an example of any problem statement wherein you’ve faced lack of availability of dataset?

OmarAmin · November 5, 2017, 1:55pm

a while ago i was trying to collect a dataset for wheelchairs and baby strollers, and i faced all these challenges, images that are available are redundant, full of noisy text and watermarks, though i found a dataset which contains around 400 images, which are not enough for generalization

jeremy · November 5, 2017, 5:11pm

You could try the 100m image flickr dataset http://yfcc100m.appspot.com/ . Or search google images for different date ranges to get different image sets, and pause between searches to avoid getting throttled.

Do a nearest neighbors on the penultimate layer activations perhaps?

amritv · November 6, 2017, 8:27am

I too have been trying to get a custom type dataset. I am working on rx medication recognition and I can get a number of good commercial datasets of rx medications, however I need a dataset of pills from a consumer standpoint ie pictures taken with cell phones with varied backgrounds, lighting, angles, etc.

I just started a Pinterest board with the aims to get people to pin pictures of their pills! A way of crowd sourcing a data set Not sure how viable it will be but I think it’s worth a shot.

wgpubs · November 7, 2017, 6:00pm

Please see here for a simple way to scrape images from a google search using selenium:

wgpubs · November 8, 2017, 1:27am

Anyone have any recommendations for cleaning up images with weird transparency issues?

I put together a set of images, there are .gifs, .jpegs, and .png files with transparency. The .png files I can take care of by converting to RGBA, but I can’t figure out what to do with the .gifs and .jpegs. You can’t convert them to RGBA and python yells at me if I try to convert them to RGB. I’m beginning to think the right option is to simply remove them.

Thoughts?

s.s.o · January 24, 2018, 8:50pm

I didn’t noticed when the last post written.

Gif, png and jpg are file formats. But RGB and RGBA are color spaces. So you should convert color spaces to RGB. For instance load a jpeg file check the color space (it might be HSV,CIE Lab* or something else). Then convert HSV to RGB. If rgb has alpha channel just get rid of it. It’ll remove the transparency. You can check this opencv link.

ilarum · February 2, 2018, 5:49am

@wgpubs
How do I get copy of this file into my remote ubuntu on paperspace?

cbaumgartner · January 4, 2019, 12:14am

@jeremy are there any resources you recommend that deal with the legality of web-scraping images? If I build a dataset using web-scraped images and train a model, am I in trouble? What if I then deploy that model in a commercial setting? Do I really have to read the terms of service for the various sources of all the images? Any guidance would be much appreciated.