Loading images from Google Cloud, suggested hardware for large dataset

Hello,

May I know if there is any way to create an ImageDataBunch using a DataFrame that contains category labels and paths to a Google Cloud Storage bucket (gs://…)?

I wanted to try the exercise (retrain ResNet34) using my own dataset (~1 mil images, 70GB) but could not figure out how to do this using the fastai library. I ended up transferring my dataset into the VM/GCP instance and converting the paths to local paths.

Also, may I know if there are any suggestions as to what kind of hardware I should choose to train such a large dataset? I chose ‘n1-highmem-4 (4 vCPUs, 26 GB memory)’ and NVIDIA Tesla P4 but when I tried running the .fit_one_cycle() method for the CNN learner, it looked like it would take me at least 2 hours for each cycle (?), which seems really long. I don’t mind using more of my credits if I can get faster results, but I am not sure which options to choose. Are there any suggestions from more experienced people?

Hope I didn’t confuse any of the terminology or duplicate an already existing question (I did a brief search and didn’t see any relevant topics). Would appreciate any guidance, thanks in advance!

4 Likes

Dear wxng

May I ask how do you transfer the dataset(I got ~10GB of image) into the VM/GCP instance from my local pc and converting the paths to local path? I’ve been struggling on it.

Best

Same here, wonder if its possible to create ImageDataBunch with google cloud storage bucket which includes images and csv with labels.

1 Like

I was able to get files into google cloud storage - Google Cloud Platform,

but then faced the same problem wxng faced, so had to work around it and put files from my local machine onto VM and then move from one VM directory to jupyter directory. In order to move all 8,000 + files at once, I zipped them first. So i uploaded a zipped file.

So i pretty much followed these instructions:

Then to move uploaded zip file i did:
sudo mv /home/YOURDIRECTORY/testfile.zip /home/jupyter/

with yourdirectory part popping up in your SSH transfer file window… Will be something like /home/XXXXXX

Once I moved my file I opened jupyter lab.
Started a new python file and executed below:

‘’’
import zipfile as zf

files = zf.ZipFile(‘testfile.zip’, ‘r’)
files.extractall(’/home/jupyter/’)
files.close()

i would recommend creating another folder to extract to… like /home/jupyter/data

in order to remove your zip file you will have to do it via SSH window.
navigate to your jupyter directory. something like

cd …

then see what’s available
ls -la

cd your way into jupyter

and once you are in the same directory as your zip file do

sudo rm “filename”

hope this helps

1 Like

Update —

Found a much faster way to do above. First create cloud storage bucket. Name it. Put your zipped file there… and via SSHed connection to your VM follow this code

sudo gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [OBJECT_DESTINATION_IN_LOCAL]

If you face issues, here is a good thread on how to solve some:

Best of luck

1 Like

I was having the same problem: I had all my images in a GCS bucket and didn’t want to copy everything onto my vm because of the size. I found this very useful:

https://cloud.google.com/storage/docs/gcs-fuse

Cloud Storage FUSE is an open source FUSE adapter that allows you to mount Cloud Storage buckets as file systems on Linux or macOS systems

EDIT – this works perfectly, except it’s slow as hell compared to having the images on the vm…