For everyone out there who are experiencing slow download speed for several of fastai’s datasets (which are hosted in AWS), I found a trick to make it download 5 to 10 times faster.
aws s3 cli command to download the files instead of any https clients.
- Search for relevant
fast.aidatasets bucket on Registry of Open Data on AWS.
- List files within the S3 bucket:
aws s3 ls s3://fast-ai-imageclas/ --no-sign-request
- Copy file:
aws s3 cp s3://fast-ai-imageclas/oxford-iiit-pet.tgz oxford-iiit-pet.tgz --no-sign-request
Install the AWS CLI using one of the available options here: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html
URLof the dataset you want to download in python. e.g.
(At the time of this writing, the values of the above will show
Take note of the last url part before the file name (e.g.
fast-ai-imageclas). That is the name of the dataset in AWS.
Go to Registry of Open Data on AWS and search for the dataset name (e.g.
Grab the aws cli command under the AWS CLI Access (No AWS account required) section and run it to list all available files for downloading (e.g.
aws s3 ls s3://fast-ai-imagelocal/ --no-sign-request)
This will output a list of files similar with the following:
2018-11-08 03:01:39 452316199 biwi_head_pose.tgz 2018-10-26 19:12:24 598913237 camvid.tgz 2018-10-09 00:43:31 4639722845 pascal-voc.tgz 2020-01-24 14:44:55 1637796771 pascal_2007.tgz 2020-01-24 14:45:27 2618908000 pascal_2012.tgz 2020-03-17 14:48:05 33276453 siim_small.tgz 2020-02-11 19:03:23 6601110169 skin-lesion.tgz 2020-12-22 00:43:35 14744474 tcga_small.tgz
Adjust the command to download a file using AWS CLI. e.g.
aws s3 cp s3://fast-ai-imagelocal/pascal_2007.tgz pascal_2007.tgz --no-sign-request
Once the download is finished, copy the file to the fastai’s
archivesfolder (which is usually located under
Then go back to your code and run the usual
untar_datafunction for your dataset (e.g.
path = untar_data(URLs.PASCAL_2007)). It will check that the dataset is already present and it will resume the operation to extract the data. That’s it!
wget (or copy pasting the url in browser) my download speed was between 80-150 KB/s. Switching to the aws cli increased the download speed to about 700-1000 KB/s. The reason is that the AWS S3 buckets (the places where the datasets are being stored) are located in Northern Virginia (
us-east-1 region) and this will slow things down for non US users.
Since this is usually a 1 time issue for most people, you can use
docker to run a container and download the files, like so:
docker run --rm -it -v /home/user/.fastai/archive:/mnt amazon/aws-cli s3 cp s3://fast-ai-imageclas/oxford-iiit-pet.tgz /mnt/oxford-iiit-pet.tgz