Download Kaggle datasets directly to Google Compute Instance (Newbie Help)

Hi all! Just getting started with fast.ai and have enjoyed it tremendously so far.

One of the first challenges when trying to train my own model is loading a training data set onto my google compute instance. I was trying to download a large (5GB+) dataset from Kaggle, but Kaggle does not allow direct url to files (eg. where I could use download_data(url))

One option is to use scp to transfer local files to remote (and vice versa). Example command for newbies like me:
gcloud compute scp [local_path] jupyter@[instancename]:~/[destination_in_home_folder]

The problem is my upload is slow. Also just feels like a waste of bandwidth, and disk space to download to my computer just so I can upload to my instance.

Luckily, Kaggle has an official API (replaced the unofficial kaggle-cli tool) that you can use to download data directly from your instance shell. There are other posts touching on the API, but I didn’t see any for downloading datasets to a cloud instance. Seemed like it would be a common enough situation that it would be worth posting a quick guide in case it saves people trouble.

Using these steps, I downloaded this 5GB dataset to my instance in <5 minutes, where it would have taken me >2hrs to download then re-upload (almost made up for the time it took me to figure this out). Hope it’s helpful!

Use Kaggle API to Download Data Directly to Cloud Instance

  1. Install Kaggle: SSH into your instance, and pip install kaggle (instructions here, but it should just work)
  2. Get API Credentials All API requests need credentials to identify yourself. Just go to https://www.kaggle.com/[kaggle_username]/account, scroll down and click “Create API Token”. It will download kaggle.json with your username & authkey.
  3. Load Credentials on Instance You need to put kaggle.json into /home/jupyter/.kaggle. You can use scp for this (see above). That’s it. Now you can use all the kaggle api commands.
  4. Download your dataset You can just use this command:
    kaggle datasets download -d [dataset_identifier] -p *your_destination_path*
    To get dataset_identifier, you can just browse to the dataset you want on the kaggle website. There’s actually a button that helpfully gives you the API command directly.

    Take a look through the docs as the API has several useful features (example).
1 Like

Another option you can do is once you’re logged in, if you go to the data you want and hover over the download button, you can copy that link and do a wget too :slight_smile: