[SOLVED] Lesson 2: Creating your own dataset from Google Images

(Michael Mullen) #1

Hi, I’m having issues when attempting to download images. I’m using Google Cloud Platform. Using the provided code, I downloaded .csv files for three different labels:

urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

But the instructions are a little vague in the notebook. I’m assuming that after downloading those, I need to convert them to .txt files, right? Given that the file variables are named things like urls_grizzly.txt. Is there some best practice for converting these files to .txt? When I try to download the images using the provided code, I get the following errors:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-11-e85756baeaa4> in <module>
----> 1 download_images(path/file, dest, max_pics=200)

/opt/anaconda3/lib/python3.7/site-packages/fastai/vision/data.py in download_images(urls, dest, max_pics, max_workers, timeout)
    192 def download_images(urls:Collection[str], dest:PathOrStr, max_pics:int=1000, max_workers:int=8, timeout=4):
    193     "Download images listed in text file `urls` to path `dest`, at most `max_pics`"
--> 194     urls = open(urls).read().strip().split("\n")[:max_pics]
    195     dest = Path(dest)
    196     dest.mkdir(exist_ok=True)

/opt/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Any help would be greatly appreciated.


(Jochem) #2

IIRC the error means that there are some values at the beginning of your file that you need to get rid of before download_images is able to start the loop of urls and download each url.
This might help: https://stackoverflow.com/questions/29481568/skipping-0xff-byte-when-using-pandas-read-csv


(Michael Mullen) #3

I figured out the problem I was having. I should have just changed each file variable to a csv to match what I was downloading using the script referenced above. It looks like this notebook should be modified to avoid confusion for beginners like me.


(Milad Dakka) #4

Hi Michael,

I’m having the same issue you described above. Could you please let me know the step-by-step process you used to fixed this?