How to download a dataset from a URL?

akshaych · October 27, 2018, 11:56am

So, I was working with week 1 notebook and I have a question. How can I download a dataset and work on it? In my case, I want to download the Devanagari dataset which is like the MNIST dataset but for Hindi alphabets.

I found the dataset on the UCI ML repository here: https://archive.ics.uci.edu/ml/machine-learning-databases/00389/

So, I want to know how can I download it and start working on it. I have tried the untar_data() and download_data() functions but it’s not working for me. Any help would be appreciated.

Cheers.

arunoda · October 27, 2018, 1:38pm

What’s the error you are getting?
For me it’s follow redirects. Here’s a thread to GitHub: https://github.com/fastai/fastai/issues/983

akshaych · October 27, 2018, 2:37pm

I have managed to download the dataset with download_data() function. It has downloaded a .zip.tgz file and now I don’t know how to decompress it into files and folders. Any ideas?

arunoda · October 27, 2018, 4:54pm

hmm. May be try some shell commands like:

!cd /dir/to/data && tar <some command>

Mauro · October 28, 2018, 12:25pm

To unzip a file, try this command inside your juptyer notebok

import zipfile
zip_ref = zipfile.ZipFile(‘foldername/filename.zip’, ‘r’)
zip_ref.extractall(‘foldername/’)
zip_ref.close()

akshaych · October 30, 2018, 1:13am

Thanks, @Mauro. This was really helpful.

Forseti · March 5, 2020, 2:49am

Im super new here and also ran into problems. The untar_data() and download_data() functions are super confusing to use. Couldn’t figure it out.

I finally did it this way:

!wget -P /root/.fastai/data/ https://sid.erda.dk/public/archives/ff17dc924eba88d5d01a807357d6614c/TestIJCNN2013.zip

and

!unzip -q /root/.fastai/data/TestIJCNN2013.zip -d /root/.fastai/data/test/

problem is this dataset is not a good pick right after the first lesson. They don’t even provide labels at least I couldn’t find them. Going to create my own little toy data set now.

vivekharshey1 · April 27, 2021, 3:01pm

I was faced with a problem to download a dataset in the virtual machine. These are some of the ways-

First of all, we can use the untar_data method of fastai for download as well as untaring the data. It works well for the standard datasets used in the fastai course which are stored in the cloud in gzip format. However, this method cannot be used to download the file from google drive. (At least I couldn’t)
There is code available in python which can extract any zip file, I have tested it.
https://stackoverflow.com/questions/3451111/unzipping-files-in-python
So, the problem boils down to downloading the datasets. We can use wget command to download the dataset. It works very fast, and I have downloaded the BACH test dataset which is of 3GB in less than 5 min. https://zenodo.org/record/3632035/files/ICIAR2018_BACH_Challenge_TestDataset.zip
But this does not work for Google drive shared link directly.
To download from Google drive, we can use the following bash script which uses the curl command-
(NB: Just select this code, copy it and on the terminal paste Shift+Ctrl+V, or right-click and paste option.)

#!/bin/bash
fileid=“file id of Google drive shared file”
filename=“Write your filename”
curl -c ./cookie -s -L “https://drive.google.com/uc?export=download&id=${fileid}” > /dev/null
curl -Lb ./cookie "https://drive.google.com/uc?export=download&confirm=awk '/download/ {print $NF}' ./cookie&id={fileid}" -o {filename}

You must replace the file id with the google drive id of the file and the filename is the name of the file which u want to give the file in double-quotes. It must be made sure that the file is shared publicly (must have edit permission) then only it works. I have tested it and it works fine.

After downloading the zip file, you can unzip it with the tar command or method in step 2.

The command wget can also be used to download the Google drive file.
(Files > 100 Mb are large files) Also change docs.google to drive.google

For large files run the following command with necessary changes in FILEID and FILENAME:
wget --load-cookies /tmp/cookies.txt “https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate ‘https://docs.google.com/uc?export=download&id=FILEID’ -O- | sed -rn ‘s/.confirm=([0-9A-Za-z_]+)./\1\n/p’)&id=FILEID” -O FILENAME && rm -rf /tmp/cookies.txt

So, for small file run following command on your terminal:
wget --no-check-certificate ‘https://docs.google.com/uc?export=download&id=FILEID’ -O FILENAME

https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/
It works, I have tested it.

In addition, to Google drive you can also use wget to download a file from any link
wget https://zenodo.org/record/3632035/files/ICIAR2018_BACH_Challenge_TestDataset.zip -O ‘ICIAR2018_BACH_Challenge_TestDataset.zip’
Also, you can download using curl command also.
curl -o ICIAR2018_BACH_Challenge_TestDataset.zip https://zenodo.org/record/3632035/files/ICIAR2018_BACH_Challenge_TestDataset.zip
(example code for BACH dataset)
https://www.howtogeek.com/447033/how-to-use-curl-to-download-files-from-the-linux-command-line/

thatgeeman · April 27, 2021, 9:17pm

Just reposting the link to a recent answer for a similar question that uses download_data and untar_data. Tested for the OPs dataset and this works.