Tip to download datasets faster

For everyone out there who are experiencing slow download speed for several of fastai’s datasets (which are hosted in AWS), I found a trick to make it download 5 to 10 times faster.

TL;DR

Use aws s3 cli command to download the files instead of any https clients.

  1. Search for relevant fast.ai datasets bucket on Registry of Open Data on AWS.
  2. List files within the S3 bucket: aws s3 ls s3://fast-ai-imageclas/ --no-sign-request
  3. Copy file: aws s3 cp s3://fast-ai-imageclas/oxford-iiit-pet.tgz oxford-iiit-pet.tgz --no-sign-request

Long description

  1. Install the AWS CLI using one of the available options here: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html

  2. Check the URL of the dataset you want to download in python. e.g. URLs.PETS, or URLs.PASCAL_2007.
    (At the time of this writing, the values of the above will show https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet.tgz and https://s3.amazonaws.com/fast-ai-imagelocal/pascal_2007.tgz)

  3. Take note of the last url part before the file name (e.g. fast-ai-imagelocal or fast-ai-imageclas). That is the name of the dataset in AWS.

  4. Go to Registry of Open Data on AWS and search for the dataset name (e.g. fast-ai-imagelocal)

  5. Grab the aws cli command under the AWS CLI Access (No AWS account required) section and run it to list all available files for downloading (e.g. aws s3 ls s3://fast-ai-imagelocal/ --no-sign-request)
    This will output a list of files similar with the following:

2018-11-08 03:01:39  452316199 biwi_head_pose.tgz
2018-10-26 19:12:24  598913237 camvid.tgz
2018-10-09 00:43:31 4639722845 pascal-voc.tgz
2020-01-24 14:44:55 1637796771 pascal_2007.tgz
2020-01-24 14:45:27 2618908000 pascal_2012.tgz
2020-03-17 14:48:05   33276453 siim_small.tgz
2020-02-11 19:03:23 6601110169 skin-lesion.tgz
2020-12-22 00:43:35   14744474 tcga_small.tgz
  1. Adjust the command to download a file using AWS CLI. e.g. aws s3 cp s3://fast-ai-imagelocal/pascal_2007.tgz pascal_2007.tgz --no-sign-request

  2. Once the download is finished, copy the file to the fastai’s archives folder (which is usually located under ~/.fastai/archive).

  3. Then go back to your code and run the usual untar_data function for your dataset (e.g. path = untar_data(URLs.PASCAL_2007)). It will check that the dataset is already present and it will resume the operation to extract the data. That’s it!


Results

With wget (or copy pasting the url in browser) my download speed was between 80-150 KB/s. Switching to the aws cli increased the download speed to about 700-1000 KB/s. The reason is that the AWS S3 buckets (the places where the datasets are being stored) are located in Northern Virginia (us-east-1 region) and this will slow things down for non US users.


Tip:

Since this is usually a 1 time issue for most people, you can use docker to run a container and download the files, like so:

docker run --rm -it -v /home/user/.fastai/archive:/mnt amazon/aws-cli s3 cp s3://fast-ai-imageclas/oxford-iiit-pet.tgz /mnt/oxford-iiit-pet.tgz

1 Like

All download commands

For convenience, here’s a list of aws cli commands used to grab the available datasets today:

Coco related:

  • aws s3 cp s3://fast-ai-coco/annotations_trainval2017.zip annotations_trainval2017.zip --no-sign-request
  • aws s3 cp s3://fast-ai-coco/coco_sample.tgz coco_sample.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-coco/coco_tiny.tgz coco_tiny.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-coco/giga-fren.tgz giga-fren.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-coco/image_info_test2017.zip image_info_test2017.zip --no-sign-request
  • aws s3 cp s3://fast-ai-coco/image_info_unlabeled2017.zip image_info_unlabeled2017.zip --no-sign-request
  • aws s3 cp s3://fast-ai-coco/panoptic_annotations_trainval2017.zip panoptic_annotations_trainval2017.zip --no-sign-request
  • aws s3 cp s3://fast-ai-coco/stuff_annotations_trainval2017.zip stuff_annotations_trainval2017.zip --no-sign-request
  • aws s3 cp s3://fast-ai-coco/test2017.zip test2017.zip --no-sign-request
  • aws s3 cp s3://fast-ai-coco/train2017.zip train2017.zip --no-sign-request
  • aws s3 cp s3://fast-ai-coco/unlabeled2017.zip unlabeled2017.zip --no-sign-request
  • aws s3 cp s3://fast-ai-coco/val2017.zip val2017.zip --no-sign-request

NLP related:

  • aws s3 cp s3://fast-ai-nlp/ag_news_csv.tgz ag_news_csv.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/amazon_review_full_csv.tgz amazon_review_full_csv.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/amazon_review_polarity_csv.tgz amazon_review_polarity_csv.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/dbpedia_csv.tgz dbpedia_csv.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/giga-fren.tgz giga-fren.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/imdb.tgz imdb.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/sogou_news_csv.tgz sogou_news_csv.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/wikitext-103.tgz wikitext-103.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/wikitext-2.tgz wikitext-2.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/yahoo_answers_csv.tgz yahoo_answers_csv.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/yelp_review_full_csv.tgz yelp_review_full_csv.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-nlp/yelp_review_polarity_csv.tgz yelp_review_polarity_csv.tgz --no-sign-request

Image localization related:

  • aws s3 cp s3://fast-ai-imagelocal/biwi_head_pose.tgz biwi_head_pose.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imagelocal/camvid.tgz camvid.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imagelocal/pascal-voc.tgz pascal-voc.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imagelocal/pascal_2007.tgz pascal_2007.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imagelocal/pascal_2012.tgz pascal_2012.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imagelocal/siim_small.tgz siim_small.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imagelocal/skin-lesion.tgz skin-lesion.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imagelocal/tcga_small.tgz tcga_small.tgz --no-sign-request

Image classification related:

  • aws s3 cp s3://fast-ai-imageclas/CUB_200_2011.tgz CUB_200_2011.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/bedroom.tgz bedroom.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/caltech_101.tgz caltech_101.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/cifar10.tgz cifar10.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/cifar100.tgz cifar100.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/food-101.tgz food-101.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagenette-160.tgz imagenette-160.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagenette-320.tgz imagenette-320.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagenette.tgz imagenette.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagenette2-160.tgz imagenette2-160.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagenette2-320.tgz imagenette2-320.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagenette2.tgz imagenette2.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagewang-160.tgz imagewang-160.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagewang-320.tgz imagewang-320.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagewang.tgz imagewang.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagewoof-160.tgz imagewoof-160.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagewoof-320.tgz imagewoof-320.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagewoof.tgz imagewoof.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagewoof2-160.tgz imagewoof2-160.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagewoof2-320.tgz imagewoof2-320.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/imagewoof2.tgz imagewoof2.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/mnist_png.tgz mnist_png.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/mnist_var_size_tiny.tgz mnist_var_size_tiny.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/oxford-102-flowers.tgz oxford-102-flowers.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/oxford-iiit-pet.tgz oxford-iiit-pet.tgz --no-sign-request
  • aws s3 cp s3://fast-ai-imageclas/stanford-cars.tgz stanford-cars.tgz --no-sign-request
1 Like

Perhaps it would be a good idea to integrate aws python sdk to fastai library to make the download faster.

If people think this sounds good, I wouldn’t mind adding that functionality myself :slightly_smiling_face:. Please react with a :heart: if you would like to have this feature.

1 Like

My download speed from the notebook running locally was just 500kb/s. After I turned on VPN, it raised to ~1.7mb/s. And with this method, it raised to ~12mb/s not dependent on VPN. Great result!

I think a lack of interest in this optimization can be explained by the fact that not a lot of people run notebooks locally. Also, most downloads are probably not as slow as mine.

  • List item