Lesson 1 - Dataset Issues - GCP

I tried to use an image dataset found online to build a CNN based on lesson 1 and 2 teachings, but I had an issue loading it into the Google Cloud Platform Consol. I simply didn’t know how to do it.
I copied/pasted the CurlWget code to download the dataset via my terminal but I didn’t know where to get it from. The downloading was complete but the dataset was nowhere to be found (or I didn’t know where to look). How can one usually upload a dataset on a Jupyter Notebook acquired this way?
Also, if I manage to know where the dataset is and load it successfully, can I still use the untar_data function? The dataset is in zip format and needs to be unzipped. Untar_data does that perfectly (based on the video of the course) but it needs a url as an argument. From where can I get this url? CurlWget doesn’t provide that.

I think I am basically asking how to download a dataset from the web and use it on a Jupyter Notebook hosted on Google Cloud.

I tried a different approach and retrieved the direct link of the dataset (which ends with .zip) and copied and pasted it as an argument to the untar_data function. Problem: i get this error: "Downloaded file {fname} does not match checksum expected! Remove that file from {data_dir} and try your code again. What should I do?


Did you find a solution to the checksum issue. I am having the same issue. I will let you know if I figure it out.

For anyone else out there who has similar issues:

I never resolved the untar_data checksum issue but realized I didn’t need to use the function to get my data onto my instance. I uploaded the tar file using the notebook upload button and then un-tarred the file into the location I wanted from the terminal command line. Then the data could be used by:

data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

i.e. specifying the training files are in the current directory (train=".") and to take 20% of these files as the validation data (valid_pct=0.2). This is explained in lesson 2.



I am getting the same checksum error when trying to use the CUB_200_2011 dataset for Lesson 1 using paperspace. I would love to know if someone has a solution because it would make it really easy to test different datasets. It appears to download the images but then I get this error:

KeyError Traceback (most recent call last)
----> 1 path = untar_data(‘https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011’); path

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/datasets.py in untar_data(url, fname, dest, data, force_download)
156 fname = download_data(url, fname=fname, data=data)
157 data_dir = Config().data_path()
–> 158 assert _check_file(fname) == _checks[url], f"Downloaded file {fname} does not match checksum expected! Remove that file from {data_dir} and try your code again."
159 tarfile.open(fname, ‘r:gz’).extractall(dest.parent)
160 return dest

KeyError: ‘https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011

It will help if you work directly on the terminal.

lets say the data url is - www.example.com/data.zip. From the terminal, run the following commands

cd fastaiv3 mkdir data cd data
Here we went inside the fastaiv3 directory and made a directory called data and then went inside that directory. Now we’ll download the data

wget www.example.com/data.zip

This will download the data.zip in the location fastaiv3/data. To extract it, run the following command

unzip data.zip

You’ll get all the contents of the data.zip in fastaiv3/data

In the notebook, remove the line with untar_data and directly define the path as

path = Path("~/fastaiv3/data")

Then just go ahead with notebook. Let me know if there are any clarifications needed.



It still didn’t work for me. The notebook still couldn’t find the path to the images, even after I uploaded them.



It still didn’t work for me. I did the exact same steps before but the notebook couldn’t simply identify the path.
I get this error: IndexError: index 0 is out of bounds for axis 0 with size 0, which basically says that the images couldn’t be found.

I uploaded the images via the terminal, unzipped them and got the path but when I run it on Jupyter it says the index 0 is out of bounds.

Does the data have to be in a specific format in order for the CNN to work?