Lesson 3 - Can't Download Planet Data Images Tar Archive

Hi @Jonny5 Thanks, your suggested steps worked :+1:

Yes - I am having massive problems with the download process. I love the tutorials but I must admit I am losing massive amounts of time when I try to get the requisite datafiles into my notebooks. So nothing to add except a feeling of extraordinary frustration!

Have you search on the forum that someone uploaded it the training file for you on drive? Or you want people to upload it on Dropbox for you, so you don’t need to spend time to check the Google drive?

Hi Jitendra , I uploaded cookies.txt in content folder and then used wget command , it worked fine for me.

Thanks. This worked for me

Hi @PalaashAgrawal Did you manage to solve it? I’m facing the same problem on Colab. I did what @Jonny5 was proposing but I did not get any error though. But still is not working.

@PabloMC I am also getting the same error.

I have not been able to solve the problem, but I am unsure if you may have an additional one due to google drive. Notice that it says

ERROR: no more files in /content/drive/My

I did it via uploading the cookie as @Jonny5 suggested and

url = *The cookie url*
! wget --load-cookies /content/cookies.txt /{url}/ -O {path}/train-jpg.tar.7z

But as I said that does not solve the 7z untaring process.

 ! 7za -bd -y -so x '{path}'/train-jpg.tar.7z | tar xf - -C {path.as_posix()}

and still returns the same error

You can try a simpler solution. Someone uploaded the dataset separately as a zip file. https://www.kaggle.com/nikitarom/planets-dataset
You can get this file using the wget command, or the Kaggle API command directly.
CHeers

7 Likes

It works! Many thanks @PalaashAgrawal
One writes, for example

! kaggle datasets download nikitarom/planets-dataset -p "{path}"
! unzip -q -n '{path}'/planets-dataset.zip -d '{path}' 

And

(path/'planet'/'planet').ls()

will get you what files are in the folder. You may also use that solution @adit007

4 Likes

I found a solution.This applies to any Dataset from Kaggle.So is a permanent solution(Used in Google Colab)

  1. go to the competition page where you wanna download the .tar
  2. press f12 and go to network panel
  3. start the download and cancel it
  4. You will see a request called train-jpg.tar.7z?..
  5. right click -> Copy as Curl(bash)
  6. paste it into notebook and put an ! markin front.
  7. Very important: add --get at the end of the command
    I dont know much bash but i just experimented around.
    Took me 3 hours to find this.Its working smooth
    after that you can use:-
    !p7zip -d train-jpg.tar.7z
    !tar -xvf train-jpg.tar
    this will extract the data to your path
1 Like

Hi, I ran into same error. Here is an approach that worked.

Somehow earlier the file size of train-jpg.tar.7z was showing smaller than that of Kaggle, hence the error of unable to open the file. I changed the method of copying the files over to the remote machine (where I run the notebook).

Steps:

  1. Download the file train-jpg.tar.7z from Kaggle directly.
  2. Copy over to remote machine, using scp command:
    gcloud compute scp ~/Downloads/train-jpg.tar.7z @my-instance:/home/jupyter/.fastai/data/planet/ --zone us-west1-b

If using GCP, you can find the documentation for command 2) at https://cloud.google.com/compute/docs/gcloud-compute#connecting

And this time all the contents must have correctly copied because the command to unpack the data worked. Hope this helps.

1 Like

I am unable to copy the link to the download button.
Can anyone please share the same?

Thank you!

@Sunit, in your browser, open the inspect panel on the kaggle page where the dataset resides(in google chrome you have to right-click and choose inspect) then click on the network tab. then hit the download button of the dataset you want to download. you will see URLs under the network tab. click on none that has the name starting with train-jpg.tar and on the right side you should see a tab for header. click on it and copy the url.
Remember to paste your url between “”(double quotes) when pasting the url after wget --load-cookies cookies.txt.

Hope that works for you.

2 Likes

Thanks!

[quote=“elie, post:44, topic:60309”]
click on none that has the name starting with train-jpg.tar and on the right side you should see a tab for header.
[/quote]Do you mean click on one that has the name or did you mean click on none. Could someone explain “right side you should see a tab for header”?

Thanks for a working solution.

Slight problem with this on Paperspace(free account) is that, after downloading all the files, it cannot be unzipped as you run out of disk space while unzipping the contents of the folder.

Very similar to a post above, i have followed the following steps to download the data from kaggle planets data.

Checking the files available for download

! kaggle competitions files planet-understanding-the-amazon-from-space

Using Curl to download relevant files

  • Go to the competition page
  • Press Ctrl+Shift+i, go to the Network tab
  • Click on the folder train-jpg.tar and start downloading. Cancel the download once the download begins.
  • You will notice under the Network tab train-jpg.tar.7z?. Right click on it and copy as cURL (bash)/(cmd) depending on your native OS.
  • cd into the relevant directory where you would want to store the file and paste the cURL.
  • At the end of the command type “-o {desired filename with the extension}”. For ex: train-jpg.tar.7z & train_v2.csv.zip for this project.

Curl command for reference

curl 'https://storage.googleapis.com/kaggle-competitions-data/kaggle-v2/6322/868312/upload/train_v2.csv.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1595223509&Signature=Jo%2FyMZoXypD3IC5xbsr%2B8YgVdvYU%2FA1qhe2mKTi%2BFh%2FS3c4PbvxEf9lJBIJBeWiWm896gt654z4iKJ3jVtHt9Cgrt81vHo9RH5vLl0Bv4EB2E8dXq1LkpsT6vOVN8tnU55453MIeZqtqhd%2Fm1RKXHdbiJZt9jRtICJLTDnhAoBn8kpADGAV9rgNmLTA2CH6Nu5TI0429cxcEQ15nEp7NIySyqxpSd6%2B7FYoNKdJvY0SFjG0y8h0RNH%2B4BWtdTc1Tzz%2BjTSM0MpP%2FCGhKNN1VCTN9z8bZatyNIoa1xwKPnmb16zu0RJZNp%2FVZLdapBn8DnQKWd691G0xmch1PZ45MlA%3D%3D&response-content-disposition=attachment%3B+filename%3Dtrain_v2.csv.zip' \ -H 'authority: storage.googleapis.com' \ -H 'upgrade-insecure-requests: 1' \ -H 'dnt: 1' \ -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36 Edg/83.0.478.64' \ -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \ -H 'sec-fetch-site: cross-site' \ -H 'sec-fetch-mode: navigate' \ -H 'sec-fetch-user: ?1' \ -H 'sec-fetch-dest: document' \ -H 'referer: https://www.kaggle.com/' \ -H 'accept-language: en-US,en;q=0.9' \ --compressed> > > > > -o train_v2.csv.zip

thanks for your solution, it is working for me as well! I spent many hours before finding your solution, it was very frustrating, I do not know why the standard notebook does not work.

Thanks!

@cdaigneault
Glad I could help. Actually the original kaggle dataset was removed from the site, hence this issue!