Help wanted (easy one!) - clean up imagenet validation set

(Jeremy Howard (Admin)) #1

Turns out there’s a blacklist of imagenet validation files here: https://github.com/dailuo/create_dataset_imagenet/blob/master/ILSVRC2014_clsloc_validation_blacklist.txt

There should be a file called ILSVRC2012_val_{i}.JPEG in one of the validation category folders, where {i} is zero pre-padded, e.g. ILSVRC2012_val_00045880.JPEG.

I wonder if someone could be so kind as to create a list of path and file names for all these images, so we can easily remove them from our training? You’ll need a copy of imagenet to be able to do this, of course.

0 Likes

(Vikram Kalabi) #2

If 36 is the blacklisted, are you expecting full path to ILSVRC2012_val_00000036.JPEG?

0 Likes

(Jeremy Howard (Admin)) #3

Yes I believe that’s how it works. Thanks for clarifying.

(Note it needs to be the path assuming all the val images have been moved into per-class folders, since we use from_paths().)

0 Likes

(Vikram Kalabi) #4

I do not have access to ImageNet dataset but I recreated the use case with dogs and cats dataset. This is how the directory structure is:

$ tree -d
.
|-- models
|-- sample
|   |-- train
|   |   |-- cats
|   |   `-- dogs
|   `-- valid
|       |-- cats
|       `-- dogs
|-- test1
|-- tmp
|-- train
|   |-- cats
|   `-- dogs
`-- valid
    |-- cats
    `-- dogs

I created a sample blacklisted files like this:

$ find valid -name "*.jpg" | head | cut -d. -f2 > black_list.txt
$ cat black_list.txt
1288
1460
12272
4671
5162
8819
3465
2836
11503
5769

Using xargs and find I can now find any directory for the files containing these strings:

$ cat black_list.txt | xargs -n1 -- bash -c 'find `pwd`/valid/ -name "*$0*"'
/media/vikram/fastai_data/dogscats/valid/cats/cat.1288.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.1460.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.12272.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.4671.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.5162.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.8819.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.3465.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.2836.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.11503.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.5769.jpg

Is this what was expected?

For the actual problem, using ILSVRC2012_val_$0.JPEG as pattern should fetch us the files. To add the leading zeros we can use printf formatting:


 cat black_list.txt | xargs -n1 -- bash -c 'printf "%08d\n" $0' | xargs -n1 -- bash -c 'find `pwd`/valid/ -name "ILSVRC2012_val_$0.JPEG"' 

1 Like

(Jeremy Howard (Admin)) #5

Good thinking! Here’s what I ended up using (a bit faster than find and avoid firing up bash many times):

for i in $(cat blacklist.txt | xargs -n1 printf "%08d\n")
  do ls */ILSVRC2012_val_$i.JPEG
done
2 Likes

(Vikram Kalabi) #6

Woah! This is way cooler approach. Thanks

0 Likes

#7

Since I stumbled upon this thread and was wondering whether to black list or not, I’ll clarify here for the next person in my situation:

According to this discussion: https://github.com/stanford-futuredata/dawn-bench-entries/issues/36 , the Dawn Bench Imagenet competition does NOT want us to remove any of those images from the validation set

1 Like