jeremy
(Jeremy Howard)
April 17, 2018, 6:13am
1
Turns out there’s a blacklist of imagenet validation files here: https://github.com/dailuo/create_dataset_imagenet/blob/master/ILSVRC2014_clsloc_validation_blacklist.txt
There should be a file called ILSVRC2012_val_{i}.JPEG
in one of the validation category folders, where {i}
is zero pre-padded, e.g. ILSVRC2012_val_00045880.JPEG
.
I wonder if someone could be so kind as to create a list of path and file names for all these images, so we can easily remove them from our training? You’ll need a copy of imagenet to be able to do this, of course.
vikram
(Vikram Kalabi)
April 17, 2018, 6:21am
2
If 36
is the blacklisted, are you expecting full path to ILSVRC2012_val_00000036.JPEG
?
jeremy
(Jeremy Howard)
April 17, 2018, 6:32am
3
Yes I believe that’s how it works. Thanks for clarifying.
(Note it needs to be the path assuming all the val images have been moved into per-class folders, since we use from_paths()
.)
vikram
(Vikram Kalabi)
April 17, 2018, 7:31am
4
I do not have access to ImageNet dataset but I recreated the use case with dogs and cats dataset. This is how the directory structure is:
$ tree -d
.
|-- models
|-- sample
| |-- train
| | |-- cats
| | `-- dogs
| `-- valid
| |-- cats
| `-- dogs
|-- test1
|-- tmp
|-- train
| |-- cats
| `-- dogs
`-- valid
|-- cats
`-- dogs
I created a sample blacklisted files like this:
$ find valid -name "*.jpg" | head | cut -d. -f2 > black_list.txt
$ cat black_list.txt
1288
1460
12272
4671
5162
8819
3465
2836
11503
5769
Using xargs
and find
I can now find any directory for the files containing these strings:
$ cat black_list.txt | xargs -n1 -- bash -c 'find `pwd`/valid/ -name "*$0*"'
/media/vikram/fastai_data/dogscats/valid/cats/cat.1288.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.1460.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.12272.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.4671.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.5162.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.8819.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.3465.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.2836.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.11503.jpg
/media/vikram/fastai_data/dogscats/valid/cats/cat.5769.jpg
Is this what was expected?
For the actual problem, using ILSVRC2012_val_$0.JPEG
as pattern should fetch us the files. To add the leading zeros we can use printf
formatting:
cat black_list.txt | xargs -n1 -- bash -c 'printf "%08d\n" $0' | xargs -n1 -- bash -c 'find `pwd`/valid/ -name "ILSVRC2012_val_$0.JPEG"'
1 Like
jeremy
(Jeremy Howard)
April 17, 2018, 2:16pm
5
Good thinking! Here’s what I ended up using (a bit faster than find
and avoid firing up bash many times):
for i in $(cat blacklist.txt | xargs -n1 printf "%08d\n")
do ls */ILSVRC2012_val_$i.JPEG
done
2 Likes
vikram
(Vikram Kalabi)
April 17, 2018, 7:19pm
6
Woah! This is way cooler approach. Thanks
Seb
May 24, 2019, 5:45pm
7
Since I stumbled upon this thread and was wondering whether to black list or not, I’ll clarify here for the next person in my situation:
According to this discussion: https://github.com/stanford-futuredata/dawn-bench-entries/issues/36 , the Dawn Bench Imagenet competition does NOT want us to remove any of those images from the validation set
1 Like