So I have been struggling with dataset downloads, which takes days. And at the same time, Google Colab has come up with Tesla T4 GPUs, so I have come with a weird plan to download data. Can someone who uses these tools guide me?
Main Goal:- Data should not be downloaded by me.
Solution:- I am considering upgrading my google drive storage to 100Gb, and the plan is to make Google download the datasets for me.
How:- This is where I need help. But one thing that comes to my mind, is to use curl or wget in google colab and maybe that will work. Have not tried that yet? So, does anyone have faced this issue or knows a workaround that I am not able to think of.
EDIT:- Getting another internet connection is not an option based on my current circumstances.
I went through the link and it was quite informative. But I was thinking, is there any way that I could avoid downloading the data in the first place. In the link you shared, it shows how to move data from Google drive to colab, but to get data to Google Drive would use twice of my data, first to download on my own computer and then upload it to Drive.
I am thinking of doing some wget or curl command somewhere so I don’t have to download the data (something similar to kaggle kernels, where if you allow Internet connection all the downloads are done by the Kaggle).
The only price you have to pay is for Google Drive storage.
How it works?
I show it using Google Colab as that is where we need data. Some datalimits before that. Google Colab gives you 50GB disk space and 12GB RAM for free. This storage is temporary, once the kernel is terminated this space is gone. To get permanent storage we will use Google Drive, which gives 15GB of free storage. You can create multiple accounts, if your datasets are small like in NLP. But the 100GB price plan for $2/month is a offer that you must consider (also for 1year subscription you get 2 months free).
General Guideline
Download using wget in Google Colab
Move it to Google Drive
Code
To download use this command in Google Colab
# This command you have to work around, but when it works
# I got around 15MB/sec download speed.
wget 'your_url'
Not able to afford cloud instances right now. Colab is good but how do you deal with 1 CPU core for data loading, would it not give problems (but you can leave the kernel running). I am thinking of starting on Colab with 100GB google drive. Atleast I would be able to run multiple tests at the same time.
Yeah I had not explored that much, my debit card still has 1 week before it’s arrival. But I would definitely make use of these once it appears. I had stayed away from learning much about cloud due to these reasons. But I will definitely spend time on this tomorrow.
(Semi-tested approach) You could also try reaching out to the cloud providers and pitching them a project that you’re working on
You could try telling them that you’re a fastai student, and that you plan to work on an idea and publish a blog/paper/video on it, you’d mention the service provider too once the project completes. Many cloud providers would be happy to give you quite a bit of breathing room
Assuming you are not referring to physical space by data. As colab provides a remote connection to a notebook which has it’s own CPU/GPU and storage and also it’s own internet backend. So when I give a command to download a dataset, that dataset if being downloaded in that remote notebook which is using the internet connection provided by google. My internet connection has nothing to do with the download.
ok i get it thanks brother question really bothers me .
well brother i just tried to make model where it has to predict number(float value) on basis of customer reviews but it’s first time that i have to classify into (around 1300 different number ) some of them are around 250 times but half of them are less than 50 times and some of them are only present one time