Lesson 1 - Dataset Issues - GCP

youcefjd · February 13, 2019, 5:10am

Hi everyone,

I tried to use an image dataset found online to build a CNN based on lesson 1 and 2 teachings, but I had an issue loading it into the Google Cloud Platform Consol. I simply didn’t know how to do it.
I copied/pasted the CurlWget code to download the dataset via my terminal but I didn’t know where to get it from. The downloading was complete but the dataset was nowhere to be found (or I didn’t know where to look). How can one usually upload a dataset on a Jupyter Notebook acquired this way?
Also, if I manage to know where the dataset is and load it successfully, can I still use the untar_data function? The dataset is in zip format and needs to be unzipped. Untar_data does that perfectly (based on the video of the course) but it needs a url as an argument. From where can I get this url? CurlWget doesn’t provide that.

I think I am basically asking how to download a dataset from the web and use it on a Jupyter Notebook hosted on Google Cloud. A little guidance would be extremely helpful.

Thank you!

youcefjd · February 13, 2019, 1:52pm

I tried a different approach and retrieved the direct link of the dataset (which ends with .zip) and copied and pasted it as an argument to the untar_data function. Problem: i get this error: "Downloaded file {fname} does not match checksum expected! Remove that file from {data_dir} and try your code again. What should I do?

brendantuckey · February 27, 2019, 5:08pm

Hi Youcef,
Did you find a solution to the checksum issue. I am having the same issue. I will let you know if I figure it out.

brendantuckey · March 1, 2019, 4:28pm

For anyone else out there who has similar issues:

I never resolved the untar_data checksum issue but realized I didn’t need to use the function to get my data onto my instance. I uploaded the tar file using the notebook upload button and then un-tarred the file into the location I wanted from the terminal command line. Then the data could be used by:

data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

i.e. specifying the training files are in the current directory (train=".") and to take 20% of these files as the validation data (valid_pct=0.2). This is explained in lesson 2.

Hope this helps.

scottso0931 · April 6, 2019, 1:26am

I am getting the same checksum error when trying to use the CUB_200_2011 dataset for Lesson 1 using paperspace. I would love to know if someone has a solution because it would make it really easy to test different datasets. It appears to download the images but then I get this error:

KeyError Traceback (most recent call last)
in
----> 1 path = untar_data(‘https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011’); path

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/datasets.py in untar_data(url, fname, dest, data, force_download)
156 fname = download_data(url, fname=fname, data=data)
157 data_dir = Config().data_path()
–> 158 assert _check_file(fname) == _checks[url], f"Downloaded file {fname} does not match checksum expected! Remove that file from {data_dir} and try your code again."
159 tarfile.open(fname, ‘r:gz’).extractall(dest.parent)
160 return dest

KeyError: ‘https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011’

tminima · April 6, 2019, 7:21am

It will help if you work directly on the terminal.

lets say the data url is - www.example.com/data.zip. From the terminal, run the following commands

cd fastaiv3 mkdir data cd data
Here we went inside the fastaiv3 directory and made a directory called data and then went inside that directory. Now we’ll download the data

wget www.example.com/data.zip

This will download the data.zip in the location fastaiv3/data. To extract it, run the following command

unzip data.zip

You’ll get all the contents of the data.zip in fastaiv3/data

In the notebook, remove the line with untar_data and directly define the path as

path = Path("~/fastaiv3/data")

Then just go ahead with notebook. Let me know if there are any clarifications needed.

youcefjd · April 6, 2019, 2:42pm

Hi Brendan,

It still didn’t work for me. The notebook still couldn’t find the path to the images, even after I uploaded them.

youcefjd · April 6, 2019, 2:44pm

Hi Shivam,

It still didn’t work for me. I did the exact same steps before but the notebook couldn’t simply identify the path.
I get this error: IndexError: index 0 is out of bounds for axis 0 with size 0, which basically says that the images couldn’t be found.

I uploaded the images via the terminal, unzipped them and got the path but when I run it on Jupyter it says the index 0 is out of bounds.

youcefjd · April 19, 2019, 5:25pm

Does the data have to be in a specific format in order for the CNN to work?