How to download data for Lesson 2 from Kaggle for Planet Competition

pnvijay · November 10, 2017, 11:58am

Dear All

Since we all will be using the planet dataset for the Lesson 2, I thought it would be best to put down the steps to do this on AWS. I have done this and been able to run the note book successfully. Hope this helps.

Install Kaggle CLI (if done, Go to Step 2)
pip install kaggle-cli
Configure your kaggle account
kg config –u <your username (your email most likely)> -p <your password> -c <competition name>
Note:
a. Go to Kaggle Competition Website, Login and accept the rules of competition
b. If you’ve always signed into Kaggle using a linked social media account, you will get an error using the kaggle cli, which requires that you have a separate kaggle login. Fortunately, Kaggle has a solution: if you select Forgot Password?, you’ll receive an email with a few different options, the 3rd of which lets you set up your own Kaggle username/password and connects it to your original social media account
c. How to find Kaggle competition name – Go to Kaggle competition page in kaggle website and take the name. For ex – if page is https://www.kaggle.com/c/planet-understanding-the-amazon-from-space, then competition name is planet-understanding-the-amazon-from-space
Download the data
kg download
Extract data: zip files
unzip –q <filename.zip>
Extract data: tar files
7za x <filename.tar.7z> This extracts 7z format and delivers an output <filename.tar>
tar xf <filename.tar>
You only need the following files for running the notebook (as per my understanding for now. @jeremy will probably explain this in the next class)
a. train-jpg
b. test-jpg
c. test-jpg-additional
d. train_v2.csv
e. test_v2_file_mapping.csv
f. sample_submission_v2.csv
I deleted the rest of the files as the device was running out of space, but if you have space you can keep it in a separate folder under data/planet.

jeremy · November 10, 2017, 4:42pm

Great help! BTW to make your code here on the forums stand out, and not get formatted by markdown, do this:

```
kg download
```

This is how that looks:

kg download

pnvijay · November 11, 2017, 2:56am

Thanks Jeremy! Will do that going forward.

KevinB · November 11, 2017, 3:46am

You can edit your current post as well there is a pencil icon at the bottom

pnvijay · November 11, 2017, 7:33am

Thanks Kevin! I just did that and hope that it is ok now.

tweber · November 13, 2017, 4:24am

I keep getting list index out of range errors. I’ve tried switching the competition between dog-breed-identification and planet-understanding-the-amazon-from-space. I’m pretty sure I"m using the correct username and pass.

I also accepted the competition terms.

jeremy · November 13, 2017, 4:30am

Try pip install kaggle-cli --upgrade.

tweber · November 13, 2017, 4:44am

Thank you! A definite improvement from the earlier error. However it now tells me that the file resolves to an html document rather than a file. I’m fairly certain I’ve accepted the competition terms…

Edit: resolved the issue. I was using my kaggle username instead of the email address I used to sign up.

Future users might try
kg config –u <your email you signed up with> -p <your password> -c <competition name>

pnvijay · November 13, 2017, 6:13am

Thanks Tom! I have edited my original post reflect that username is most likely your email.

Deb · November 15, 2017, 1:45am

thank you very much. FYI… planet data requires 100G (60G after cleaning up tar files).

ramesh · November 15, 2017, 1:49am

You probably only need the .jpg.tar.7z files for Jeremy’s notebook. Its much more reasonable in size 600MB each for the zipped train / test images.

jeremy · November 15, 2017, 4:34am

Yeah I don’t think anyone in the competition found the tif files useful, so don’t worry about getting them.

KevinB · November 15, 2017, 4:44am

as I’ve been waiting 5 minutes to p7zip the tif files…

So then all we need is:

test-jpg-additional
test-jpg
test_v2_file_mapping.csv
train-jpg
train_v2.csv

Is that all the files I should need for this competition?

Are these .torrent files anything to pay attention to?

jeremy · November 15, 2017, 4:46am

Yup that’s at. You don’t need the torrent files - that’s just an alternative download method.

KevinB · November 15, 2017, 4:47am

Is it faster or what would be the advantage of those?

jeremy · November 15, 2017, 4:49am

Probably no advantage at this stage - here’s some info about it if you’re interested: https://www.techsupportalert.com/what-is-bittorrent . Largely it’s to benefit Kaggle, but it’s only helpful when a competition is active and busy.

Deb · November 15, 2017, 6:12am

Ahh… I forgot… thanks…

memetzgz · November 15, 2017, 3:20pm

I thought I would avoid the download issue by using the Crestle pre-loaded files, but then ran into the problem that the test images seem not to have been uploaded there.

So I got the two 7zip files loaded up, but then can’t seem to extract them with the commands they provided on the data page for the competition.

I tried re-installing 7zip but ran into some weird dependency issue – something about the version of lxml being wrong.

Is there any other unzipper that can be used to extract the tar file?

Appreciate any advice

pnvijay · November 15, 2017, 4:58pm

Hi Maureen, Are you trying to unzip in crestle or in AWS?

memetzgz · November 15, 2017, 5:00pm

Hi, @pnvijay, I’m trying to unzip n Crestle