[SOLVED] How to download large datasets even when you have datalimit is 1.5GB/day?

So I have been struggling with dataset downloads, which takes days. And at the same time, Google Colab has come up with Tesla T4 GPUs, so I have come with a weird plan to download data. Can someone who uses these tools guide me?

Main Goal:- Data should not be downloaded by me.

Solution:- I am considering upgrading my google drive storage to 100Gb, and the plan is to make Google download the datasets for me.

How:- This is where I need help. But one thing that comes to my mind, is to use curl or wget in google colab and maybe that will work. Have not tried that yet? So, does anyone have faced this issue or knows a workaround that I am not able to think of.

EDIT:- Getting another internet connection is not an option based on my current circumstances.

Solution discussed in the reply.

1 Like

You may check this web page … It’s about " How to Upload large files to Google Colab and remote Jupyter notebooks"

I went through the link and it was quite informative. But I was thinking, is there any way that I could avoid downloading the data in the first place. In the link you shared, it shows how to move data from Google drive to colab, but to get data to Google Drive would use twice of my data, first to download on my own computer and then upload it to Drive.

I am thinking of doing some wget or curl command somewhere so I don’t have to download the data (something similar to kaggle kernels, where if you allow Internet connection all the downloads are done by the Kaggle).

The only price you have to pay is for Google Drive storage.

How it works?
I show it using Google Colab as that is where we need data. Some datalimits before that. Google Colab gives you 50GB disk space and 12GB RAM for free. This storage is temporary, once the kernel is terminated this space is gone. To get permanent storage we will use Google Drive, which gives 15GB of free storage. You can create multiple accounts, if your datasets are small like in NLP. But the 100GB price plan for $2/month is a offer that you must consider (also for 1year subscription you get 2 months free).

General Guideline

  • Download using wget in Google Colab
  • Move it to Google Drive

Code

  • To download use this command in Google Colab

    # This command you have to work around, but when it works
    # I got around 15MB/sec download speed.
    wget 'your_url'
    


    Note:- Not sharing the wget command, as license issues.

    If you do ls, you get

    !ls
    # MRNet-v1.0.zip	sample_data
    
  • To move the downloaded zip file to Google Drive
    Run this cell and then you would get a verification URL.

    from google.colab import drive
    drive.mount('/content/gdrive')
    

    All your Google drive can be found in this directory in /content/gdrive/My Drive/

    !ls /content/gdrive/My\ Drive/
    

Link I found useful

3 Likes

There shouldn’t ever be a need to download data to your PC - just download it to your server (e.g. colab, crestle, gcp, etc).

Not able to afford cloud instances right now. Colab is good but how do you deal with 1 CPU core for data loading, would it not give problems (but you can leave the kernel running). I am thinking of starting on Colab with 100GB google drive. Atleast I would be able to run multiple tests at the same time.

GCP provides $300 credits. That’s ~1000 hours! :slight_smile:

Yeah I had not explored that much, my debit card still has 1 week before it’s arrival. But I would definitely make use of these once it appears. I had stayed away from learning much about cloud due to these reasons. But I will definitely spend time on this tomorrow.

Thanks for the help.

(Semi-tested approach) You could also try reaching out to the cloud providers and pitching them a project that you’re working on :slightly_smiling_face:

You could try telling them that you’re a fastai student, and that you plan to work on an idea and publish a blog/paper/video on it, you’d mention the service provider too once the project completes. Many cloud providers would be happy to give you quite a bit of breathing room :wink:

3 Likes

Interesting. Would definitely give it a shot.

@kushaj pretty sure it says 358 GB disk space if I choose run time GPU.

Would have to double check that. As I have not yet moved to colab.

no brother its ony have 35 gb with gpu run time

why its take less data to download any thing on colab and kaggle

Assuming you are not referring to physical space by data. As colab provides a remote connection to a notebook which has it’s own CPU/GPU and storage and also it’s own internet backend. So when I give a command to download a dataset, that dataset if being downloaded in that remote notebook which is using the internet connection provided by google. My internet connection has nothing to do with the download.

ok i get it thanks brother question really bothers me .
well brother i just tried to make model where it has to predict number(float value) on basis of customer reviews but it’s first time that i have to classify into (around 1300 different number ) some of them are around 250 times but half of them are less than 50 times and some of them are only present one time

ok i just find a platform
https://platform.peltarion.com/
where i think you get some more memory and gpu check it out

1 Like