For everyone out there who are experiencing slow download speed for several of fastai’s datasets (which are hosted in AWS), I found a trick to make it download 5 to 10 times faster.
TL;DR
Use aws s3 cli command to download the files instead of any https clients.
Check the URL of the dataset you want to download in python. e.g. URLs.PETS, or URLs.PASCAL_2007. (At the time of this writing, the values of the above will show https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet.tgz and https://s3.amazonaws.com/fast-ai-imagelocal/pascal_2007.tgz)
Take note of the last url part before the file name (e.g. fast-ai-imagelocal or fast-ai-imageclas). That is the name of the dataset in AWS.
Grab the aws cli command under the AWS CLI Access (No AWS account required) section and run it to list all available files for downloading (e.g. aws s3 ls s3://fast-ai-imagelocal/ --no-sign-request)
This will output a list of files similar with the following:
Adjust the command to download a file using AWS CLI. e.g. aws s3 cp s3://fast-ai-imagelocal/pascal_2007.tgz pascal_2007.tgz --no-sign-request
Once the download is finished, copy the file to the fastai’s archives folder (which is usually located under ~/.fastai/archive).
Then go back to your code and run the usual untar_data function for your dataset (e.g. path = untar_data(URLs.PASCAL_2007)). It will check that the dataset is already present and it will resume the operation to extract the data. That’s it!
Results
With wget (or copy pasting the url in browser) my download speed was between 80-150 KB/s. Switching to the aws cli increased the download speed to about 700-1000 KB/s. The reason is that the AWS S3 buckets (the places where the datasets are being stored) are located in Northern Virginia (us-east-1 region) and this will slow things down for non US users.
Tip:
Since this is usually a 1 time issue for most people, you can use docker to run a container and download the files, like so:
My download speed from the notebook running locally was just 500kb/s. After I turned on VPN, it raised to ~1.7mb/s. And with this method, it raised to ~12mb/s not dependent on VPN. Great result!
I think a lack of interest in this optimization can be explained by the fact that not a lot of people run notebooks locally. Also, most downloads are probably not as slow as mine.