Lesson 3 - Can't Download Planet Data Images Tar Archive

I have same issue while downloading things.

Hi Jay & Khoury, I am facing the same issue as well

Has anyone made any progress on this. The files are in .torrent now and I am trying to unpack them?

Hi all,

I ran into this same problem just now with the 404 error. There seems to be a way to still download the two files that we need manually. It’s a much slower process, but in theory we should get the same result in the end.

In order to download the data manually, go to the data page and accept the terms and conditions for the project: https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/data

Click on the files that you want to download. You’ll have to do them one at a time. According to the lesson 3 planet notebook, these files are:

  • train-jpg.tar
  • train_v2.csv

After you click on the file you want to download, you should see a window at the bottom of the screen with the file name, and a couple of boxes in the upper right-hand corner of the window. One of the boxes gives you the option to download the data.

Here is a picture that hopefully illustrates what you need to do where:

I was able to download the files manually, and am now uploading them to GCP. It’s going to take a while (there might be a more efficient way to do this - curious if anyone has tips), but I think this should allow you to get the data into the cloud in order to analyze it with the lesson notebook.

13 Likes

Just a follow up - I was able to work with the data I got using the method described above. Transferred it to GCP, moved it to the correct folder as described in the tutorial (’/home/jupyter/.fastai/data/planet’), and unzipped it using the conda package referenced in the notebook (eidl7zip).

Am going through the notebook and everything seems to be working!

1 Like

What I’ve done is to download the train-jpg.tar archive and upload it to Google Drive. Now I can mount my Drive folder from my notebook and use the data from there. It’s a bit annoying because I had to download and upload the data on a slow connection, but once it’s in Drive it’s fine.

1 Like

I use Paperspace-Gradient. After many hours, I found a rather messy way of doing things, but it does work, for sure: I first downloaded the 2 files from Kaggle, 1 csv file and 1 images’ folder, to my laptop. I then uploaded the zipped version in the data/planet directory.

And finally, unzipped it inside the Jupyter notebook…

unzip -q -n {path}/train-jpg.zip -d {path}

For some reason, train_11795.jpg was giving me trouble. So, I removed that single row from the csv file as well that single image from the images’ folder.

Thereafter, everything worked like magic.

Hi all!

For those of you that want to download directly on their notebook server:

This assumes you are logged in into Kaggle and you have accepted the conditions of the competition.

  1. Open the Chrome browser on your local machine
  2. Install the cookie.txt extension from this link
  3. Go to the Kaggle dataset
  4. Locate the download button button of the dataset you want (see example in image below)
  5. Copy the link (right-click on that button)
  6. Export your cookies using the newly added plugin
  7. Go to your remote notebook
  8. Upload the cookie.txt file (I put it in the data folder)
  9. Run the following wget command
! wget --load-cookies data/cookies.txt PASTE_YOUR_LINK

Example of copying download link, downloading the cookies and the resulting script with correct naming (you can do it with the above script, but this makes it a bit easier):

Sorry for the badly composed image, but as a new user I can only upload 1 image.

Hope this helps out some of you!

Best regards!

11 Likes

wow that download button is… hidden…

1 Like

This is just what I did as well (basically). Really odd that the big train and test tar files are no longer a part of the “package” from the kaggle competition download command. That seems to be the main issue. If you just execute kaggle competitions download -c planet-understanding-the-amazon-from-space you get many of the files in the listing but not the test/train data. Strange. Anyway, the only thing I did differently was use the CurlWget Chrome plug-in to actually get the data on my GCP instance. Just posting here as a recommendation to others to use that plug in since it makes it easy to get the data on your actual cloud instance (as long as you are comfortable copy/pasting a wget command in to a terminal).

this worked! thanks

Bundle of thanks brother,
facing this problem for a long time. Now solved Alhamdulilah

@Jonny5
Thanks for your solution. I was able to download the .tar.7z file after a long struggle.
However, now when I tried to unpack the file from {path} through the following command.

! 7za -bd -y -so x {path}/train-jpg.tar.7z | tar xf - -C {path.as_posix()}

I’m getting the following error:

ERROR: /home/jupyter/.fastai/data/planet/train-jpg.tar.7z
/home/jupyter/.fastai/data/planet/train-jpg.tar.7z
Open ERROR: Can not open the file as [7z] archive
ERRORS:
Is not archive
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors

Can anyone tell me what is wrong?
Wonder what other people did to proceed.

It is most likely a file permission problem. Before unzipping your file, run this

!chmod 600  /home/jupyter/.fastai/data/planet/train-jpg.tar.7z

@farid
No, the problem still exists. Same error. Please tell me anything else that might be a problem.
Thanks :slight_smile:

@farid @Jonny5

I think I see the problem. The method of downloading the .7z file as suggested by @Jonny5 has apparently not worked for me.
When I ran the command

!wget --load-cookies data/planet/cookies.txt
{url}
-O {path}/train-jpg.tar.7z

I get the following message

–2020-02-12 09:10:31-- https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/download-directory/fBesYSh7qE3PuxXtB1SS%2Fversions%2FDMmq3a6XjGpH6e8EUe3c%2Fdirectories%2Ftrain-jpg.tar
Resolving www.kaggle.com (www.kaggle.com)… 35.244.233.98
Connecting to www.kaggle.com (www.kaggle.com)|35.244.233.98|:443… connected.
HTTP request sent, awaiting response… 302 Found
Location: https://www.kaggle.com/account/login?ReturnUrl=%2Fc%2Fplanet-understanding-the-amazon-from-space%2Fdownload-directory%2FfBesYSh7qE3PuxXtB1SS%2Fversions%2FDMmq3a6XjGpH6e8EUe3c%2Fdirectories%2Ftrain-jpg.tar [following]
–2020-02-12 09:10:31-- https://www.kaggle.com/account/login?ReturnUrl=%2Fc%2Fplanet-understanding-the-amazon-from-space%2Fdownload-directory%2FfBesYSh7qE3PuxXtB1SS%2Fversions%2FDMmq3a6XjGpH6e8EUe3c%2Fdirectories%2Ftrain-jpg.tar
Reusing existing connection to www.kaggle.com:443.
HTTP request sent, awaiting response… 200 OK
Length: unspecified [text/html]
Saving to: ‘/home/jupyter/.fastai/data/planet/train-jpg.tar.7z’

/home/jupyter/.fast [ <=> ] 8.76K --.-KB/s in 0.009s

2020-02-12 09:10:32 (961 KB/s) - ‘/home/jupyter/.fastai/data/planet/train-jpg.tar.7z’ saved [8973]

Now when I checked the size of my path directory, its only 1.6 MB, which is just the size of the .csv file. So apparently, the .7z folder does not contain any data.

Any suggestions?

Did you search the solution on the forum?
https://forums.fast.ai/search?q=planet%20category%3A20

Following the suggestion of using wget, I used this to download to the expected folder and without any Chrome plugin:

  1. Go to the contest page
  2. Open Chrome Developer Tools (go to the menu > More tools > Developer Tools) and go to the Network tab
  3. On the Kaggle contest page click the “Download All” button in the Download section
  4. Cancel the download, click the “download-all” row in the Developer Tools and look for “cookie” under “Request headers”. Copy all the content of the “cookie” header and replace “PASTE_THE_COOKIE_HERE” in the command below
  5. Get the download link of the file by right clicking the download button for the “train-jpg.tar” file and replace “PASTE_LINK_HERE” in the command below
  6. Paste this whole command in your jupyter notebook and it will download the set to the expected folder
wget -O {path}/train-jpg.tar.7z \
--header="Cookie: PASTE_THE_COOKIE_HERE" \
PASTE_LINK_HERE
3 Likes

hey @methodmatters! i’m a bit of newbie here. how did you upload the file into gcp? i’m trying to figure out how to access the folder ’/home/jupyter/.fastai/data/planet’. Thanks! :slight_smile:

Long story short - SSH into your virtual machine. From that window, you can choose the “cog” in the upper right-hand corner. This will give the option to manually select which file to upload. You can navigate to the file on your computer, and ask it to upload.

This will upload the file into your root directory in GCP. You’ll need to manually copy the files to the directory referenced in the notebook. Nothing hugely complicated here. Again in the SSH window you created earlier, you can navigate to wherever the file is and move it to the correct directory (whichever one is referenced in the course notebooks). The commands are basic linux - e.g. cp for copy, mv for move… A quick google search should get you the basics of how it works…