Kaggle data on Google Colab example

Earlier examples were not working for me. Below my approach to download Kaggle competition data with Colab.
As an example I’ll use the Kannada-MNIST dataset.
Practically it copies this approach:
https://stackoverflow.com/questions/49310470/using-kaggle-datasets-in-google-colab.

Step 1 and 2 are required one time only.

  1. Request a new API token on Kaggle (profile > account > API) and download JSON file to your computer.

  2. Run the following code

Upload the JSON file

from google.colab import files
files.upload()

Install Kaggle

!pip install -q kaggle

The Kaggle API client expects this file to be in ~/.kaggle, so move it there.

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

This permissions change avoids a warning on Kaggle tool startup.

!chmod 600 ~/.kaggle/kaggle.json

  1. Create a folder on Google drive for your competition data:

    path = Path(base_dir + 'data/kannada')
    path.mkdir(parents=True, exist_ok=True)
    path
    

Download that sh*t
!kaggle competitions download -c Kannada-MNIST -p /content/gdrive/My\ Drive/fastai-v3/data/kannada

Unzip .zip files

! unzip -q -n '{path}/Dig-MNIST.csv.zip' -d '{path}'
! unzip -q -n '{path}/train.csv.zip' -d '{path}'
! unzip -q -n '{path}/test.csv.zip' -d '{path}'

Maybe not the smartest work around, it worked for me :slight_smile:

2 Likes

How to use Kaggle datasets in Google Colab?

A quick guide to use Kaggle datasets inside Google Colab

https://medium.com/unpackai/how-to-use-kaggle-datasets-in-google-colab-f9b2e4b5767c

Hello. I’m new in kaggle and deeplearning. I build a preprocessing dataset and a dataloader on rsna embolism competition (the competition is closed but it is for my own learning ). And when i start the fit one cycle on the dataset there is 45 hours remaining for one epoch.(gpu is on, batch size 64)
Is it normal because of the huge rsna dataset (1 million images)?or is it because of my preprocessing pipeline who is too complicated? I don’t really know what is the ‘normal’ amount of time for training a model with this type of huge dataset.