I’m also trying to figure out how to get the whole dataset. It’s a bit complicated due to all the different versions and sources out there.
On the Imagenet website as well as on academic torrents, we have the 2011 release. The link on Imagenet website is broken, but according to academic torrents this is a 1.31 TB file.
But is that the version everyone uses? My guess is no.
On DawnBench (https://github.com/stanford-futuredata/dawn-bench-entries#imagenet-training), they link to the ILSVRC2012 dataset (http://www.image-net.org/challenges/LSVRC/2012/).
Looks like this one is about 150GB.
we
According to this post by Jeremy (Imagenet training project discussion), we can train (for classification) with ILSVRC2017 which is the same as ILSVRC2012 (again, for classification) as far as he knows.
Now, when downloading from Imagenet, I get an 18h ETA, compared to 30 minutes from Kaggle. This is using an AWS Salamander instance. I haven’t tried academic torrents, but I guess it depends on who is seeding.
I’m assuming Kaggle has the ILSVRC2017 version.
Edit to add: on the official ILSVRC2017 download page, I quote “This dataset is unchanged since ILSVRC2012. There are a total of 1,281,167 images for training.The number of images for each synset (category) ranges from 732 to 1300. There are 50,000 validation images, with 50 images per synset. There are 100,000 test images. All images are in JPEG format.”.
File is 155GB.