How to download data for Lesson 2 from Kaggle for Planet Competition

Dear All

Since we all will be using the planet dataset for the Lesson 2, I thought it would be best to put down the steps to do this on AWS. I have done this and been able to run the note book successfully. Hope this helps.

  1. Install Kaggle CLI (if done, Go to Step 2)
    pip install kaggle-cli

  2. Configure your kaggle account
    kg config –u <your username (your email most likely)> -p <your password> -c <competition name>
    Note:
    a. Go to Kaggle Competition Website, Login and accept the rules of competition
    b. If you’ve always signed into Kaggle using a linked social media account, you will get an error using the kaggle cli, which requires that you have a separate kaggle login. Fortunately, Kaggle has a solution: if you select Forgot Password?, you’ll receive an email with a few different options, the 3rd of which lets you set up your own Kaggle username/password and connects it to your original social media account
    c. How to find Kaggle competition name – Go to Kaggle competition page in kaggle website and take the name. For ex – if page is https://www.kaggle.com/c/planet-understanding-the-amazon-from-space, then competition name is planet-understanding-the-amazon-from-space

  3. Download the data
    kg download

  4. Extract data: zip files
    unzip –q <filename.zip>

  5. Extract data: tar files
    7za x <filename.tar.7z> This extracts 7z format and delivers an output <filename.tar>
    tar xf <filename.tar>

  6. You only need the following files for running the notebook (as per my understanding for now. @jeremy will probably explain this in the next class)
    a. train-jpg
    b. test-jpg
    c. test-jpg-additional
    d. train_v2.csv
    e. test_v2_file_mapping.csv
    f. sample_submission_v2.csv

  7. I deleted the rest of the files as the device was running out of space, but if you have space you can keep it in a separate folder under data/planet.

38 Likes

Great help! BTW to make your code here on the forums stand out, and not get formatted by markdown, do this:

```
kg download
```

This is how that looks:

kg download
1 Like

Thanks Jeremy! Will do that going forward.

You can edit your current post as well there is a pencil icon at the bottom

image

Thanks Kevin! I just did that and hope that it is ok now.

1 Like

I keep getting list index out of range errors. I’ve tried switching the competition between dog-breed-identification and planet-understanding-the-amazon-from-space. I’m pretty sure I"m using the correct username and pass.

I also accepted the competition terms.

Try pip install kaggle-cli --upgrade.

1 Like

Thank you! A definite improvement from the earlier error. However it now tells me that the file resolves to an html document rather than a file. I’m fairly certain I’ve accepted the competition terms…

Edit: resolved the issue. I was using my kaggle username instead of the email address I used to sign up.

Future users might try
kg config –u <your email you signed up with> -p <your password> -c <competition name>

4 Likes

Thanks Tom! I have edited my original post reflect that username is most likely your email.

thank you very much. FYI… planet data requires 100G (60G after cleaning up tar files).

You probably only need the .jpg.tar.7z files for Jeremy’s notebook. Its much more reasonable in size 600MB each for the zipped train / test images.

1 Like

Yeah I don’t think anyone in the competition found the tif files useful, so don’t worry about getting them.

as I’ve been waiting 5 minutes to p7zip the tif files…

So then all we need is:

test-jpg-additional
test-jpg
test_v2_file_mapping.csv
train-jpg
train_v2.csv

Is that all the files I should need for this competition?

Are these .torrent files anything to pay attention to?

2 Likes

Yup that’s at. You don’t need the torrent files - that’s just an alternative download method.

Is it faster or what would be the advantage of those?

Probably no advantage at this stage - here’s some info about it if you’re interested: https://www.techsupportalert.com/what-is-bittorrent . Largely it’s to benefit Kaggle, but it’s only helpful when a competition is active and busy.

2 Likes

Ahh… I forgot… thanks…

I thought I would avoid the download issue by using the Crestle pre-loaded files, but then ran into the problem that the test images seem not to have been uploaded there.

So I got the two 7zip files loaded up, but then can’t seem to extract them with the commands they provided on the data page for the competition.

I tried re-installing 7zip but ran into some weird dependency issue – something about the version of lxml being wrong.

Is there any other unzipper that can be used to extract the tar file?

Appreciate any advice :slight_smile:

Hi Maureen, Are you trying to unzip in crestle or in AWS?

Hi, @pnvijay, I’m trying to unzip n Crestle