Is there a good set of instructions on how to download ImageNet? Like, ELI5?
I’ve searched around a bit but it seems like “download imagenet” is supposed to be an obvious instruction. And yet on Image-Net.org, I find so many different versions, and different years, and “Download Image URLs” vs “Download Original Images”. One finds reduced versions of ImageNet out there…
Selecting “Download Original Images” and getting clearance via a .edu email and then clicking on “ImageNet Fall 2011 release” just gets one the message
"### Oops!
The URL is not valid."
For the past few days the header has said
“Note: ImageNet is under maintenance. Original images outside ILSVRC are temporarily unavailable.”
So, should I select “ImageNet10K from Deng et all, ECCV2010” instead?
Ok, new question: Why does the kaggle download command only say “403 - Forbidden”?
I ran pip install kaggle, created a Kaggle account, authenticated using Google, generated an API token, put that in my ~/.kaggle/kaggle.json, ran the chmod 600 so no one else can read it, but still…
I tried pulling a few images by url via wnid but I also get messages indicating the site is under maintenance. I’m wondering if it has been “under maintenance” since this post; i.e. 2 months.
I don’t have enough space on my machine to download the 155GB file from Kaggle, which I believe is only 200,000 images out of the 14M on ImageNet anyway.
I’m also trying to figure out how to get the whole dataset. It’s a bit complicated due to all the different versions and sources out there.
On the Imagenet website as well as on academic torrents, we have the 2011 release. The link on Imagenet website is broken, but according to academic torrents this is a 1.31 TB file.
But is that the version everyone uses? My guess is no.
Looks like this one is about 150GB.
we
According to this post by Jeremy (Imagenet training project discussion), we can train (for classification) with ILSVRC2017 which is the same as ILSVRC2012 (again, for classification) as far as he knows.
Now, when downloading from Imagenet, I get an 18h ETA, compared to 30 minutes from Kaggle. This is using an AWS Salamander instance. I haven’t tried academic torrents, but I guess it depends on who is seeding.
I’m assuming Kaggle has the ILSVRC2017 version.
Edit to add: on the official ILSVRC2017 download page, I quote “This dataset is unchanged since ILSVRC2012. There are a total of 1,281,167 images for training.The number of images for each synset (category) ranges from 732 to 1300. There are 50,000 validation images, with 50 images per synset. There are 100,000 test images. All images are in JPEG format.”.
File is 155GB.
If you just want a few images/folders from Imagenet, I bet you could download the whole tar from Kaggle onto a cheap Salamander instance, untar, then download what you want from Imagenet, without going over the $1 free credit Salamander gives you.
I wrote a tool ImageNet downloader which uses the URLs to download ImageNet images. It takes good URLs and skips the bad ones. You can specify how many classes you want and how many images per class you want.
I also wrote a blog post where I did a little analysis of the state of the ImageNet URLs and came to interesting conclusions. It’s and 2011 URL list and many of the sites are down. You can check out the plots.
Since you don’t have space for all of it, perhaps you’d like to work with a subset of Imagenet? FastAI has you covered. http://files.fast.ai/data/imagenet-sample-train.tar.gz
Iirc there are 700 classes here, though I forgot how many images you’d get for each class.
How are the classes indexed for both suggested subsets?
I tried downloading the one from @dreambeats but I got a low accuracy with a Pytorch pretrained models and I assume it’s because I loaded them with ImageFolder instead of the correct indexing way.
@dreambeats , can I nudge you to respond to @gessha ? I’d like to know how to use the subset of ImageNet you graciously uploaded with PyTorch’s pretrained models.
I managed to figure out the problem - ImageNet has a certain indexing that they use which is different from the directory alphabetic indexing that is used by ImageFolder on the current dataset. I suppose this subset of ImageNet doesn’t have all the classes and that’s why there’s a mismatch between predictions and directory listing. You have to use the full indexing to get adequate accuracy using a pretrained network