Google released 2 new Kaggle competitions for landmark recognition and retrieval. The size of the training dataset is 500GB for both the competitions. And you have to download these images, data is not even available in the kernel.
So does anyone have any experiences with competitions of this scale? The number of output labels is around 200K.
Wow, that sounds very challenging! I had problems even with smaller datasets, like Quick Draw Doodle recognition. Would be interested to know about the experience of people who are going to tackle this competition.
Another challenge in this comp is that there is a lot of noise that is pictures say with flowers that are labeled as landmark. And labels are numbers so can’t judge what lebel means. And it’s a two stage comp. All of that making it a nice challenge!
Always work with representative subsets of data in the first instance. Get something working especially your pipeline of operation. Try on another subset. Tune hyper parameters. Implement ideas from the challenge forum. Try on a larger subset before trying on a full dataset.
Keep in mind you don’t need to use all the data. This is especially true if your goal isn’t to score highly on the leaderboard but instead to learn. There have been challenge winners that only use a slice of data. One strategy is to look at similar previous challenges (the same challenge was run last year?) and reimplement others solutions and transfer knowledge to the current challenge.
Finally, there are many challenges out there so choose wisely rather than just diving straight into the first one that appeals. Some really are massive endeavours requiring hundreds of hours of brain time let alone compute time.
Definitely a good advice to start with smaller subsets of data if one never worked on data science competitions before. Still, remember how many hours I’ve spent trying to deal with doodles. Choosing your battles could be a valid approach here
Nevertheless, if you’ll be able to tame these datasets, I would be glad to hear about your experience. Especially how to deal with distributed and parallel computations.