I tried to use an image dataset found online to build a CNN based on lesson 1 and 2 teachings, but I had an issue loading it into the Google Cloud Platform Consol. I simply didn’t know how to do it.
I copied/pasted the CurlWget code to download the dataset via my terminal but I didn’t know where to get it from. The downloading was complete but the dataset was nowhere to be found (or I didn’t know where to look). How can one usually upload a dataset on a Jupyter Notebook acquired this way?
Also, if I manage to know where the dataset is and load it successfully, can I still use the untar_data function? The dataset is in zip format and needs to be unzipped. Untar_data does that perfectly (based on the video of the course) but it needs a url as an argument. From where can I get this url? CurlWget doesn’t provide that.
I think I am basically asking how to download a dataset from the web and use it on a Jupyter Notebook hosted on Google Cloud. A little guidance would be extremely helpful.
I tried a different approach and retrieved the direct link of the dataset (which ends with .zip) and copied and pasted it as an argument to the untar_data function. Problem: i get this error: "Downloaded file {fname} does not match checksum expected! Remove that file from {data_dir} and try your code again. What should I do?
I never resolved the untar_data checksum issue but realized I didn’t need to use the function to get my data onto my instance. I uploaded the tar file using the notebook upload button and then un-tarred the file into the location I wanted from the terminal command line. Then the data could be used by:
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)
i.e. specifying the training files are in the current directory (train=".") and to take 20% of these files as the validation data (valid_pct=0.2). This is explained in lesson 2.
I am getting the same checksum error when trying to use the CUB_200_2011 dataset for Lesson 1 using paperspace. I would love to know if someone has a solution because it would make it really easy to test different datasets. It appears to download the images but then I get this error:
It will help if you work directly on the terminal.
lets say the data url is - www.example.com/data.zip. From the terminal, run the following commands
cd fastaiv3 mkdir data cd data
Here we went inside the fastaiv3 directory and made a directory called data and then went inside that directory. Now we’ll download the data
wget www.example.com/data.zip
This will download the data.zip in the location fastaiv3/data. To extract it, run the following command
unzip data.zip
You’ll get all the contents of the data.zip in fastaiv3/data
In the notebook, remove the line with untar_data and directly define the path as
path = Path("~/fastaiv3/data")
Then just go ahead with notebook. Let me know if there are any clarifications needed.
It still didn’t work for me. I did the exact same steps before but the notebook couldn’t simply identify the path.
I get this error: IndexError: index 0 is out of bounds for axis 0 with size 0, which basically says that the images couldn’t be found.
I uploaded the images via the terminal, unzipped them and got the path but when I run it on Jupyter it says the index 0 is out of bounds.