Is there a good set of instructions on how to download ImageNet? Like, ELI5?
I’ve searched around a bit but it seems like “download imagenet” is supposed to be an obvious instruction. And yet on Image-Net.org, I find so many different versions, and different years, and “Download Image URLs” vs “Download Original Images”. One finds reduced versions of ImageNet out there…
Selecting “Download Original Images” and getting clearance via a .edu email and then clicking on “ImageNet Fall 2011 release” just gets one the message
The URL is not valid."
For the past few days the header has said
“Note: ImageNet is under maintenance. Original images outside ILSVRC are temporarily unavailable.”
So, should I select “ImageNet10K from Deng et all, ECCV2010” instead?
Looks like this one is about 150GB.
According to this post by Jeremy (Imagenet training project discussion), we can train (for classification) with ILSVRC2017 which is the same as ILSVRC2012 (again, for classification) as far as he knows.
Now, when downloading from Imagenet, I get an 18h ETA, compared to 30 minutes from Kaggle. This is using an AWS Salamander instance. I haven’t tried academic torrents, but I guess it depends on who is seeding.
I’m assuming Kaggle has the ILSVRC2017 version.
Edit to add: on the official ILSVRC2017 download page, I quote “This dataset is unchanged since ILSVRC2012. There are a total of 1,281,167 images for training.The number of images for each synset (category) ranges from 732 to 1300. There are 50,000 validation images, with 50 images per synset. There are 100,000 test images. All images are in JPEG format.”.
File is 155GB.
If you just want a few images/folders from Imagenet, I bet you could download the whole tar from Kaggle onto a cheap Salamander instance, untar, then download what you want from Imagenet, without going over the $1 free credit Salamander gives you.
I wrote a tool ImageNet downloader which uses the URLs to download ImageNet images. It takes good URLs and skips the bad ones. You can specify how many classes you want and how many images per class you want.
I also wrote a blog post where I did a little analysis of the state of the ImageNet URLs and came to interesting conclusions. It’s and 2011 URL list and many of the sites are down. You can check out the plots.
I managed to figure out the problem - ImageNet has a certain indexing that they use which is different from the directory alphabetic indexing that is used by ImageFolder on the current dataset. I suppose this subset of ImageNet doesn’t have all the classes and that’s why there’s a mismatch between predictions and directory listing. You have to use the full indexing to get adequate accuracy using a pretrained network