How to download data for Lesson 2 from Kaggle for Planet Competition


(Vijay Narayanan Parakimeethal) #1

Dear All

Since we all will be using the planet dataset for the Lesson 2, I thought it would be best to put down the steps to do this on AWS. I have done this and been able to run the note book successfully. Hope this helps.

  1. Install Kaggle CLI (if done, Go to Step 2)
    pip install kaggle-cli

  2. Configure your kaggle account
    kg config –u <your username (your email most likely)> -p <your password> -c <competition name>
    Note:
    a. Go to Kaggle Competition Website, Login and accept the rules of competition
    b. If you’ve always signed into Kaggle using a linked social media account, you will get an error using the kaggle cli, which requires that you have a separate kaggle login. Fortunately, Kaggle has a solution: if you select Forgot Password?, you’ll receive an email with a few different options, the 3rd of which lets you set up your own Kaggle username/password and connects it to your original social media account
    c. How to find Kaggle competition name – Go to Kaggle competition page in kaggle website and take the name. For ex – if page is https://www.kaggle.com/c/planet-understanding-the-amazon-from-space, then competition name is planet-understanding-the-amazon-from-space

  3. Download the data
    kg download

  4. Extract data: zip files
    unzip –q <filename.zip>

  5. Extract data: tar files
    7za x <filename.tar.7z> This extracts 7z format and delivers an output <filename.tar>
    tar xf <filename.tar>

  6. You only need the following files for running the notebook (as per my understanding for now. @jeremy will probably explain this in the next class)
    a. train-jpg
    b. test-jpg
    c. test-jpg-additional
    d. train_v2.csv
    e. test_v2_file_mapping.csv
    f. sample_submission_v2.csv

  7. I deleted the rest of the files as the device was running out of space, but if you have space you can keep it in a separate folder under data/planet.


(Jeremy Howard (Admin)) #2

Great help! BTW to make your code here on the forums stand out, and not get formatted by markdown, do this:

```
kg download
```

This is how that looks:

kg download

(Vijay Narayanan Parakimeethal) #3

Thanks Jeremy! Will do that going forward.


(Kevin Bird) #4

You can edit your current post as well there is a pencil icon at the bottom

image


(Vijay Narayanan Parakimeethal) #5

Thanks Kevin! I just did that and hope that it is ok now.


(Tom Weber) #6

I keep getting list index out of range errors. I’ve tried switching the competition between dog-breed-identification and planet-understanding-the-amazon-from-space. I’m pretty sure I"m using the correct username and pass.

I also accepted the competition terms.


(Jeremy Howard (Admin)) #7

Try pip install kaggle-cli --upgrade.


(Tom Weber) #8

Thank you! A definite improvement from the earlier error. However it now tells me that the file resolves to an html document rather than a file. I’m fairly certain I’ve accepted the competition terms…

Edit: resolved the issue. I was using my kaggle username instead of the email address I used to sign up.

Future users might try
kg config –u <your email you signed up with> -p <your password> -c <competition name>


(Vijay Narayanan Parakimeethal) #9

Thanks Tom! I have edited my original post reflect that username is most likely your email.


(Debashish Panigrahi) #10

thank you very much. FYI… planet data requires 100G (60G after cleaning up tar files).


(Ramesh Sampath) #11

You probably only need the .jpg.tar.7z files for Jeremy’s notebook. Its much more reasonable in size 600MB each for the zipped train / test images.


(Jeremy Howard (Admin)) #12

Yeah I don’t think anyone in the competition found the tif files useful, so don’t worry about getting them.


(Kevin Bird) #13

as I’ve been waiting 5 minutes to p7zip the tif files…

So then all we need is:

test-jpg-additional
test-jpg
test_v2_file_mapping.csv
train-jpg
train_v2.csv

Is that all the files I should need for this competition?

Are these .torrent files anything to pay attention to?


(Jeremy Howard (Admin)) #14

Yup that’s at. You don’t need the torrent files - that’s just an alternative download method.


(Kevin Bird) #15

Is it faster or what would be the advantage of those?


(Jeremy Howard (Admin)) #16

Probably no advantage at this stage - here’s some info about it if you’re interested: https://www.techsupportalert.com/what-is-bittorrent . Largely it’s to benefit Kaggle, but it’s only helpful when a competition is active and busy.


(Debashish Panigrahi) #17

Ahh… I forgot… thanks…


(Maureen Metzger) #18

I thought I would avoid the download issue by using the Crestle pre-loaded files, but then ran into the problem that the test images seem not to have been uploaded there.

So I got the two 7zip files loaded up, but then can’t seem to extract them with the commands they provided on the data page for the competition.

I tried re-installing 7zip but ran into some weird dependency issue – something about the version of lxml being wrong.

Is there any other unzipper that can be used to extract the tar file?

Appreciate any advice :slight_smile:


(Vijay Narayanan Parakimeethal) #19

Hi Maureen, Are you trying to unzip in crestle or in AWS?


(Maureen Metzger) #20

Hi, @pnvijay, I’m trying to unzip n Crestle