Lesson 1 -- The dataset I want to use is way too large -- what should I do?

ForSpareParts · December 21, 2019, 4:08am

In the first lesson, we’re encouraged to try a classification problem using another dataset to get a feel for how the library works. I was interested in the Google Landmarks dataset – I was thinking I could try to train off of it and then see if it could recognize photos from my recent vacation. It’s absolutely gigantic, though: 500GB just for the training set.

This seems like it’d be a common problem while learning, so I’m curious what you all would recommend I do. Can I take some random subsampling of the data set? Downscale the images? Do I just need to bite the bullet and sign up for a pricey Paperspace plan?

Peezy · December 21, 2019, 5:14am

Hi I’m pretty new here too! From what I gather I think you should try just taking a small fraction of the dataset and working with that.

Maybe try just the first tar file get your system setup and working on that. You don’t necessarily need to have all the files downloaded at once either. You could download one then delete it to down load the next after you’ve trained on it.

I’d recommend you trim down the dataset and the problem to something smaller so that you will be able to iterate more quickly the faster you can try things the faster you will learn.

Good luck with your project