Downloading specific folders with Kaggle API

Martin2 · September 11, 2020, 9:35am

Hi there,

Is there a way to download a specific folder from a Kaggle competition directly in Colab using the Kaggle API?

For example:
This competition SIIM melanoma classification has the train and test files in 3 different file formats for a total of 108Gb of data, while I only want to use either the JPEG or the DICOM files (not both).

Thanks in advance!

ali_baba · September 11, 2020, 11:49am

Hello Martin!
You can most certainly use the kaggle API to pull individual files from a kaggle competition. While I’m not as familiar with colab – the same command should work for Colab.

Selecting a random competition as an example: https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring/data

The following command would download all of the files in the competitions:
! kaggle competitions download -c the-nature-conservancy-fisheries-monitoring-p {path}
You only need to specify the competition itself and the path to deposit the files in (If you don’t specify the path, the API will download the files into your current working directory, wherever your notebook is currently)

To download a specific file I normally use:
! kaggle competitions download -f sample_submission_stg1.csv.zip -p {path} -c the-nature-conservancy-fisheries-monitoring

The difference being you need to specify the actual file desired, the path and the competition itself.

And here’s a useful reference: https://github.com/Kaggle/kaggle-api

Martin2 · September 11, 2020, 12:02pm

I did get to that point pretty much to download single individual files. But the folders that I want to download contains 11k images.

What I tried:

I downloaded the train.csv and test.csv file and opened those. This gives a dataframe with the first colomn named ‘image_name’.

-Then I used ‘train_images = ‘jpeg/train/’ + train_data[‘image_name’].sort_values()’ and then send them through the API like this:
'for filename in train_images:
!kaggle competitions download -f filename -p /content/train/ -c siim-isic-melanoma-classification ’

But this give me a ‘404 - Not Found’ for every image that it tries to download unfortunately.

Any suggestions?

ali_baba · September 11, 2020, 4:43pm

I just tried with the kaggle CLI and wget to extract only a folder from a kaggle competition and had no luck

You can use something like:
!kaggle competitions download -f train/xxx -p {path} siim-isic-melanoma-classification

where “xxx” is replaced by the exact file name in the training folder. Unfortunately this API call will only extract one single file at a time. I tried running it with multiple paths in the same command but it does not work. So this is super cumbersome if you want to pull out more than a few files at a time.

You could create a function that keeps calling that download api call and replaces the ‘xxx’ with the filenames included in a list that you’re iterating over. But even that will require you to manually extract the file names you want and put them into a list.

There may be a better way to accomplish this, but I can’t seem to figure it out. Would be interested to hear if anyone knows of a more feasible option because this is a genuinely useful question. For example there is a new kaggle competition by RSNA which has a dataset that is almost 1TB(!!!)

Martin2 · September 19, 2020, 8:52am

I got the Kaggle API to download a list of specific files like this:

train_images = 'jpeg/train/' + train_data['image_name'].sort_values() + '.jpg'

for filename in train_images:
!kaggle competitions download -f $filename -p /content/train/ -c siim-isic-melanoma-classification

Unfortunately it stops downloading after a few files and says that there’s too many requests. There is a download button on the kaggle website to download a specific subfolder, so I assume that there must be another way to do it.