How does one "Download ImageNet"?

Is there a good set of instructions on how to download ImageNet? Like, ELI5?
I’ve searched around a bit but it seems like “download imagenet” is supposed to be an obvious instruction. And yet on Image-Net.org, I find so many different versions, and different years, and “Download Image URLs” vs “Download Original Images”. One finds reduced versions of ImageNet out there…

Selecting “Download Original Images” and getting clearance via a .edu email and then clicking on “ImageNet Fall 2011 release” just gets one the message

"### Oops!

The URL is not valid."

For the past few days the header has said

“Note: ImageNet is under maintenance. Original images outside ILSVRC are temporarily unavailable.”

So, should I select “ImageNet10K from Deng et all, ECCV2010” instead?

They don’t list any mirror sites. (?)

1 Like

You can download from here : Link

2 Likes

Ok, new question: Why does the kaggle download command only say “403 - Forbidden”?

I ran pip install kaggle, created a Kaggle account, authenticated using Google, generated an API token, put that in my ~/.kaggle/kaggle.json, ran the chmod 600 so no one else can read it, but still…

$ kaggle competitions download -c imagenet-object-localization-challenge

403 - Forbidden

Ahhhh, got it. you have to actually “Join the Competition” and then the download command will work. Thanks!

3 Likes

I tried pulling a few images by url via wnid but I also get messages indicating the site is under maintenance. I’m wondering if it has been “under maintenance” since this post; i.e. 2 months.

I don’t have enough space on my machine to download the 155GB file from Kaggle, which I believe is only 200,000 images out of the 14M on ImageNet anyway.

I’m also trying to figure out how to get the whole dataset. It’s a bit complicated due to all the different versions and sources out there.

On the Imagenet website as well as on academic torrents, we have the 2011 release. The link on Imagenet website is broken, but according to academic torrents this is a 1.31 TB file.

But is that the version everyone uses? My guess is no.

On DawnBench (https://github.com/stanford-futuredata/dawn-bench-entries#imagenet-training), they link to the ILSVRC2012 dataset (http://www.image-net.org/challenges/LSVRC/2012/).

Looks like this one is about 150GB.
we
According to this post by Jeremy (Imagenet training project discussion), we can train (for classification) with ILSVRC2017 which is the same as ILSVRC2012 (again, for classification) as far as he knows.

Now, when downloading from Imagenet, I get an 18h ETA, compared to 30 minutes from Kaggle. This is using an AWS Salamander instance. I haven’t tried academic torrents, but I guess it depends on who is seeding.

I’m assuming Kaggle has the ILSVRC2017 version.

Edit to add: on the official ILSVRC2017 download page, I quote “This dataset is unchanged since ILSVRC2012. There are a total of 1,281,167 images for training.The number of images for each synset (category) ranges from 732 to 1300. There are 50,000 validation images, with 50 images per synset. There are 100,000 test images. All images are in JPEG format.”.
File is 155GB.

PSA: If you are using Salamander and have a big dataset, it can take 1 hour per 30GB to move your storage when turning off your instance…

Edit: It took a lot less time than that (30 minutes with just the Imagenet tar file)

If you just want a few images/folders from Imagenet, I bet you could download the whole tar from Kaggle onto a cheap Salamander instance, untar, then download what you want from Imagenet, without going over the $1 free credit Salamander gives you.

Hello,

I also faced the same set of problems.

I wrote a tool ImageNet downloader which uses the URLs to download ImageNet images. It takes good URLs and skips the bad ones. You can specify how many classes you want and how many images per class you want.

I also wrote a blog post where I did a little analysis of the state of the ImageNet URLs and came to interesting conclusions. It’s and 2011 URL list and many of the sites are down. You can check out the plots.

Hope it is helpful.

4 Likes

Since you don’t have space for all of it, perhaps you’d like to work with a subset of Imagenet? FastAI has you covered.
http://files.fast.ai/data/imagenet-sample-train.tar.gz
Iirc there are 700 classes here, though I forgot how many images you’d get for each class.

4 Likes

You can try this tiny imageNet dataset: https://tiny-imagenet.herokuapp.com/
Its just 2.1G in size, has 200 classes.

How are the classes indexed for both suggested subsets?

I tried downloading the one from @dreambeats but I got a low accuracy with a Pytorch pretrained models and I assume it’s because I loaded them with ImageFolder instead of the correct indexing way.

@dreambeats , can I nudge you to respond to @gessha ? I’d like to know how to use the subset of ImageNet you graciously uploaded with PyTorch’s pretrained models.

Thanks in advance :slight_smile:

I should login here more often :sweat_smile:

I managed to figure out the problem - ImageNet has a certain indexing that they use which is different from the directory alphabetic indexing that is used by ImageFolder on the current dataset. I suppose this subset of ImageNet doesn’t have all the classes and that’s why there’s a mismatch between predictions and directory listing. You have to use the full indexing to get adequate accuracy using a pretrained network

I managed to find the full indexing by searching for ilsvrc_synsets.txt, and the file that I used can be found here: https://github.com/val-iisc/nag/blob/master/misc/ilsvrc_synsets.txt

Hope that helps

1 Like