Lesson 3 - Can't Download Planet Data Images Tar Archive

jayess · December 17, 2019, 8:43pm

Has anyone else had any trouble downloading the train-jpg.tar.7z archive that’s shown in the lecture? I was able to download it last week, but now it doesn’t seem to exist. If I run

!kaggle competitions download -c planet-understanding-the-amazon-from-space -f train-jpg.tar.7z -p {path}

I get 404 - Not Found as a response. So, I listed all of the files available in the competition and found out that it’s not available

!kaggle competitions files -c planet-understanding-the-amazon-from-space

name                                                size  creationDate         
-------------------------------------------------  -----  -------------------  
sample_submission_v2.csv/sample_submission_v2.csv    3MB  2019-12-15 22:14:13  
train_v2.csv/train_v2.csv                            1MB  2019-12-15 22:14:13  
test_v2_file_mapping.csv/test_v2_file_mapping.csv  600KB  2019-12-15 22:14:13  
Kaggle-planet-train-tif.torrent                      1MB  2019-12-15 22:14:13  
Kaggle-planet-test-tif.torrent                       2MB  2019-12-15 22:14:13

Has anyone else had any issues with this?

Khoury_T · December 18, 2019, 9:23am

I have same issue while downloading things.

ZSW · December 19, 2019, 5:40am

Hi Jay & Khoury, I am facing the same issue as well

Sanwal · December 20, 2019, 7:08am

Has anyone made any progress on this. The files are in .torrent now and I am trying to unpack them?

methodmatters · December 20, 2019, 10:38am

Hi all,

I ran into this same problem just now with the 404 error. There seems to be a way to still download the two files that we need manually. It’s a much slower process, but in theory we should get the same result in the end.

In order to download the data manually, go to the data page and accept the terms and conditions for the project: https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/data

Click on the files that you want to download. You’ll have to do them one at a time. According to the lesson 3 planet notebook, these files are:

train-jpg.tar
train_v2.csv

After you click on the file you want to download, you should see a window at the bottom of the screen with the file name, and a couple of boxes in the upper right-hand corner of the window. One of the boxes gives you the option to download the data.

Here is a picture that hopefully illustrates what you need to do where:

I was able to download the files manually, and am now uploading them to GCP. It’s going to take a while (there might be a more efficient way to do this - curious if anyone has tips), but I think this should allow you to get the data into the cloud in order to analyze it with the lesson notebook.

methodmatters · December 20, 2019, 2:21pm

Just a follow up - I was able to work with the data I got using the method described above. Transferred it to GCP, moved it to the correct folder as described in the tutorial (’/home/jupyter/.fastai/data/planet’), and unzipped it using the conda package referenced in the notebook (eidl7zip).

Am going through the notebook and everything seems to be working!

jayess · December 20, 2019, 9:48pm

What I’ve done is to download the train-jpg.tar archive and upload it to Google Drive. Now I can mount my Drive folder from my notebook and use the data from there. It’s a bit annoying because I had to download and upload the data on a slow connection, but once it’s in Drive it’s fine.

anurag.bhatia · December 28, 2019, 8:50pm

I use Paperspace-Gradient. After many hours, I found a rather messy way of doing things, but it does work, for sure: I first downloaded the 2 files from Kaggle, 1 csv file and 1 images’ folder, to my laptop. I then uploaded the zipped version in the data/planet directory.

And finally, unzipped it inside the Jupyter notebook…

unzip -q -n {path}/train-jpg.zip -d {path}

For some reason, train_11795.jpg was giving me trouble. So, I removed that single row from the csv file as well that single image from the images’ folder.

Thereafter, everything worked like magic.

Jonny5 · December 30, 2019, 4:11pm

Hi all!

For those of you that want to download directly on their notebook server:

This assumes you are logged in into Kaggle and you have accepted the conditions of the competition.

Open the Chrome browser on your local machine
Install the cookie.txt extension from this link
Go to the Kaggle dataset
Locate the download button button of the dataset you want (see example in image below)
Copy the link (right-click on that button)
Export your cookies using the newly added plugin
Go to your remote notebook
Upload the cookie.txt file (I put it in the data folder)
Run the following wget command

! wget --load-cookies data/cookies.txt PASTE_YOUR_LINK

Example of copying download link, downloading the cookies and the resulting script with correct naming (you can do it with the above script, but this makes it a bit easier):

Sorry for the badly composed image, but as a new user I can only upload 1 image.

Hope this helps out some of you!

Best regards!

kendrick_lamar · January 1, 2020, 6:38pm

wow that download button is… hidden…

talumbau · January 7, 2020, 2:55pm

This is just what I did as well (basically). Really odd that the big train and test tar files are no longer a part of the “package” from the kaggle competition download command. That seems to be the main issue. If you just execute kaggle competitions download -c planet-understanding-the-amazon-from-space you get many of the files in the listing but not the test/train data. Strange. Anyway, the only thing I did differently was use the CurlWget Chrome plug-in to actually get the data on my GCP instance. Just posting here as a recommendation to others to use that plug in since it makes it easy to get the data on your actual cloud instance (as long as you are comfortable copy/pasting a wget command in to a terminal).

xslipstream · January 20, 2020, 7:49pm

this worked! thanks

talha3111997 · January 22, 2020, 10:28am

Bundle of thanks brother,
facing this problem for a long time. Now solved Alhamdulilah

PalaashAgrawal · February 1, 2020, 9:22am

@Jonny5
Thanks for your solution. I was able to download the .tar.7z file after a long struggle.
However, now when I tried to unpack the file from {path} through the following command.

! 7za -bd -y -so x {path}/train-jpg.tar.7z | tar xf - -C {path.as_posix()}

I’m getting the following error:

ERROR: /home/jupyter/.fastai/data/planet/train-jpg.tar.7z
/home/jupyter/.fastai/data/planet/train-jpg.tar.7z
Open ERROR: Can not open the file as [7z] archive
ERRORS:
Is not archive
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors

Can anyone tell me what is wrong?
Wonder what other people did to proceed.

farid · February 3, 2020, 8:35pm

It is most likely a file permission problem. Before unzipping your file, run this

!chmod 600  /home/jupyter/.fastai/data/planet/train-jpg.tar.7z

PalaashAgrawal · February 12, 2020, 9:37am

@farid
No, the problem still exists. Same error. Please tell me anything else that might be a problem.
Thanks

PalaashAgrawal · February 12, 2020, 9:49am

@farid @Jonny5

I think I see the problem. The method of downloading the .7z file as suggested by @Jonny5 has apparently not worked for me.
When I ran the command

!wget --load-cookies data/planet/cookies.txt
{url}
-O {path}/train-jpg.tar.7z

I get the following message

–2020-02-12 09:10:31-- Kaggle: Your Home for Data Science
Resolving www.kaggle.com (www.kaggle.com)… 35.244.233.98
Connecting to www.kaggle.com (www.kaggle.com)|35.244.233.98|:443… connected.
HTTP request sent, awaiting response… 302 Found
Location: Kaggle: Your Home for Data Science [following]
–2020-02-12 09:10:31-- Kaggle: Your Home for Data Science
Reusing existing connection to www.kaggle.com:443.
HTTP request sent, awaiting response… 200 OK
Length: unspecified [text/html]
Saving to: ‘/home/jupyter/.fastai/data/planet/train-jpg.tar.7z’

/home/jupyter/.fast [ <=> ] 8.76K --.-KB/s in 0.009s

2020-02-12 09:10:32 (961 KB/s) - ‘/home/jupyter/.fastai/data/planet/train-jpg.tar.7z’ saved [8973]

Now when I checked the size of my path directory, its only 1.6 MB, which is just the size of the .csv file. So apparently, the .7z folder does not contain any data.

Any suggestions?

JonathanSum · February 12, 2020, 1:37pm

Did you search the solution on the forum?
https://forums.fast.ai/search?q=planet%20category%3A20

jpenna · February 22, 2020, 10:03pm

Following the suggestion of using wget, I used this to download to the expected folder and without any Chrome plugin:

Go to the contest page
Open Chrome Developer Tools (go to the menu > More tools > Developer Tools) and go to the Network tab
On the Kaggle contest page click the “Download All” button in the Download section
Cancel the download, click the “download-all” row in the Developer Tools and look for “cookie” under “Request headers”. Copy all the content of the “cookie” header and replace “PASTE_THE_COOKIE_HERE” in the command below
Get the download link of the file by right clicking the download button for the “train-jpg.tar” file and replace “PASTE_LINK_HERE” in the command below
Paste this whole command in your jupyter notebook and it will download the set to the expected folder

wget -O {path}/train-jpg.tar.7z \
--header="Cookie: PASTE_THE_COOKIE_HERE" \
PASTE_LINK_HERE

aditya.swami · March 3, 2020, 10:46am

hey @methodmatters! i’m a bit of newbie here. how did you upload the file into gcp? i’m trying to figure out how to access the folder ’/home/jupyter/.fastai/data/planet’. Thanks!