Tips for building large image datasets

@txuninho - I have not had CUDA memory errors in the use of ai_utilities. Moreover, I don’t believe it uses the GPUs. If lowering your batch size doesn’t help, try restarting Jupiter Server. If that doesn’t help, please provide me sample code and error output.

Thanks.

I am using google collab and I wanted to know if we can run this script in Google Collab notebook or not? If you used Anaconda on local machine how would you train large number of images without GPU?

I entered this command:

$ googleimagesdownload -k "baseball game" -s medium -l 500 -o fastai/courses/dl1/data/baseballcricket -i train/baseball -cd /usr/local/bin/chromedriver

Error message:

Item no.: 1 --> Item name = baseball game
Evaluating…
Looks like we cannot locate the path the ‘chromedriver’ (use the ‘–chromedriver’ argument to specify the path to the executable.) or google chrome browser is not installed on your machine (exception: Message: ‘chromedriver’ executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home
)

I got this UnsatisfiableError:

UnsatisfiableError: The following specifications were found to be in conflict:
icrawler -> python[version=’>=3.6,<3.7.0a0’]
python=3.7
Use “conda search --info” to see the dependencies for each package.

when I attempted conda install -c hellock icrawler.

How can I resolve this?

How do I get the downloaded image data set from PC into google collab ?

You can do it using google drive -

from google.colab import drive
drive.mount(’/content/gdrive’)

It will show you a link and ask you for an authorization code. Click on that link to get the authorization code and you are good to go

1 Like

Hey I’m not able to download images in google colab using google image downloader.whenever i increases the limit beyond 100 i’ll get an error.

Hello! I just started the 2019 course and finished watching the Lesson 1 video. I’m excited to try to build my own dataset and train the deep learning classifier on this dataset.

One question I couldn’t find an answer for from a cursory search of this forum was guidelines for the size of the data set.

I saw that the The Oxford-IIIT Pet Dataset had 37 categories with roughly 200 images.

Are there some general best practices for the size of an image classification dataset?

Not sure if you ever resolved this.
I finally figured out that the conda install -c hellock icrawler was trying to install in the base conda environment which was python 3.7. fastai was using 3.6.
So if the specific conda fastai environment is say myFastai
then you have to use
conda install --name myFastai -c hellock icrawler
I created a clone and installed it there.
conda create --clone fastai-3.6 --prefix $CONDAFI_PATH/fastai3.6
conda activate $CONDAFI_PATH/fastai3.6
conda install --prefix $CONDAFI_PATH/fastai3.6 -c hellock icrawler
pip install python-magic

This makes me nervous, anything on Facebook actually makes me nervous hah. But on that note, be mindful of permissions when getting images from Facebook. You can read here - https://developers.facebook.com/docs/graph-api/reference/photo/#permissions. For example, you can only get user photos if when they authenticate with your app they have granted you said permission.


I found solution on this thread and it worked for me!

Newbie here… just watched lesson 1 and trying to build an image set. I’m using the Google Cloud setting.

Has anyone used rar files on google drive? I found a dataset I’d like to use, which is shared on google drive as a rar file. Here’s the URL:
https://drive.google.com/open?id=1sRHjwTx0akh9L9EAjiKK6OE4sBnwTZJg

I was able to download the file to my laptop but cannot access it from the jupyternotebook on GCP (I guess if I run a web server on my laptop and share this dataset that way, it’ll likely work, but I wonder if there is another way).

From what I heard/read so far, it depends. The example Jeremy gave in the lesson 1 video of baseball vs. cricket has 30 images, IIRC. I guess if there are more categories, you will need more images (so every category has at least several); if the risk of overfitting is higher, e.g. images with the same label have same visual features by accident, perhaps more images will help.

Hi @lindyrock i am not able to download more than 100 images at a time even after i have downloaded the chromedriver. I am using Google Colab as my Jupyter Environment.

I used icrawler and was able to download upto 700images. Attaching colab link.

https://colab.research.google.com/drive/14Zwx9Uh9p8lf-H18sLoaI3s8ByT5jGk6

5 Likes

for what it’s worth - I wrote a tiny helper script to help with gathering image data using google_images_download

Check out https://github.com/maxlvl/dl_helper_scripts/blob/master/helper_scripts/google_image_download_helper.py

Feel free to use and alter, as you’ll see the keywords and time_ranges have been hardcoded to loop over. I’ll be adding more helper scripts to that repository as I go along the 2019 course so feel free to star/watch the repository (or contribute if the inspiration strikes you).

This is designed to run on a linux VM (I run mine on GCP, but I’m sure it should work elsewhere that has chromedriver and has python running in a virtualenv)

2 Likes

Thanks Lindy!

Lindy/All:
Working on p2.xlarge instance on EC2 service by AWS, and stuck on ‘chromedriver’?!

I’ve installed: (a) Google images download ; (b) chromedriver
using your steps (1) and in:

Moving ‘chromedriver’ to /usr/bin/chromedriver
Unfortunately, ‘chromedriver’ path can’t be located:

Item no.: 1 --> Item name = shadow on highway and freeway Evaluating… Looks like we cannot locate the path the ‘chromedriver’ (use the ‘–chromedriver’ argument to specify the path to the executable.) or google chrome browser is not installed on your machine (exception: Message: unknown error: cannot find Chrome binary (Driver info: chromedriver=2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7),platform=Linux 4.4.0-1099-aws x86_64) )

Tried to install google-chrome using:

curl https://intoli.com/install-google-chrome.sh | bash

I get bunch of “command not found” messages, like:

Downloaded google-chrome-stable_current_x86_64.rpm bash: line 70: rpm: command not found Installing the required font dependencies. bash: line 76: yum: command not found bash: line 100: repoquery: command not found http://: Invalid host name. Extracting glibc… bash: line 107: rpm2cpio: command not found bash: line 100: repoquery: command not found http://: Invalid host name. Extracting util-linux… bash: line 107: rpm2cpio: command not found bash: line 100: repoquery: command not found http://: Invalid host name. Extracting libmount… bash: line 107: rpm2cpio: command not found bash: line 100: repoquery: command not found http://: Invalid host name. Extracting libblkid… bash: line 107: rpm2cpio: command not found bash: line 100: repoquery: command not found http://: Invalid host name. Extracting libuuid… bash: line 107: rpm2cpio: command not found bash: line 100: repoquery: command not found http://: Invalid host name. Extracting libselinux… bash: line 107: rpm2cpio: command not found bash: line 100: repoquery: command not found http://: Invalid host name. Extracting pcre… bash: line 107: rpm2cpio: command not found Finding dependency for ldd.sh bash: line 176: repoquery: command not found Finding dependency for ldd.sh

1 Like

I have the same issue on paperspace/gradient :thinking:

am.sharan .ipynb using colab:

Continuing the discussion from Tips for building large image datasets:

https://colab.research.google.com/drive/14Zwx9Uh9p8lf-H18sLoaI3s8ByT5jGk6#scrollTo=gaRW9mnxSiUc

worked.

2 Likes

Hi,

I decided to classify mushrooms and found this website, where observers can upload photos of mushrooms along with the species they believe the mushrooms belong to (and maybe more/less specific denominations, e.g. infraspecific name/stirp). The community also contributes by stating their opinions based on the photos.

Well, anyways, the maintainers explicitly ask not to scrape the website but rather drop them an email. I did so and got a reply within 5 minutes or so. 30 min later I got access to 10⁶ mushroom figs. Great people.

2 Likes