[SOLVED] Lesson 2: Creating your own dataset from Google Images

Hi, I’m having issues when attempting to download images. I’m using Google Cloud Platform. Using the provided code, I downloaded .csv files for three different labels:

urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

But the instructions are a little vague in the notebook. I’m assuming that after downloading those, I need to convert them to .txt files, right? Given that the file variables are named things like urls_grizzly.txt. Is there some best practice for converting these files to .txt? When I try to download the images using the provided code, I get the following errors:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-11-e85756baeaa4> in <module>
----> 1 download_images(path/file, dest, max_pics=200)

/opt/anaconda3/lib/python3.7/site-packages/fastai/vision/data.py in download_images(urls, dest, max_pics, max_workers, timeout)
    192 def download_images(urls:Collection[str], dest:PathOrStr, max_pics:int=1000, max_workers:int=8, timeout=4):
    193     "Download images listed in text file `urls` to path `dest`, at most `max_pics`"
--> 194     urls = open(urls).read().strip().split("\n")[:max_pics]
    195     dest = Path(dest)
    196     dest.mkdir(exist_ok=True)

/opt/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Any help would be greatly appreciated.

IIRC the error means that there are some values at the beginning of your file that you need to get rid of before download_images is able to start the loop of urls and download each url.
This might help: https://stackoverflow.com/questions/29481568/skipping-0xff-byte-when-using-pandas-read-csv

I figured out the problem I was having. I should have just changed each file variable to a csv to match what I was downloading using the script referenced above. It looks like this notebook should be modified to avoid confusion for beginners like me.

Hi Michael,

I’m having the same issue you described above. Could you please let me know the step-by-step process you used to fixed this?

Cheers,
Milad