The Wikiart dataset

(David Pfahler) #1

I want to go for a rather challenging dataset and have stumbled upon the wikiart.org dataset. It is the biggest dataset of paintings and has been in use in some papers. However, I cannot for the life of me find a way to download it fast.

What I have tried:

  • Download a version of the Wikiart Dataset used by Chan et al in their ArtGAN paper presented at ICIP2016 presented on their GitHub repository; Problem: extremely slow download speeds (300KB/s)
  • Use the Wikiart Retriever by Lucas David. Also seems to take forever (ran for several hours) and I am not sure it had downloaded any images by that point.

I cannot find a way to download the dataset from wikiart.org directly. Any help would be much appreciated.

2 Likes

(Benoit) #2

For me, right now, the download speed of http://www.cs-chan.com/source/ICIP2017/wikiart.zip is around 900 kB/s and will took 7h.
It may depend on the time of the day and the load of the server.

0 Likes

(David Pfahler) #3

So that is the best / only way to download the dataset, then?

0 Likes

(Michal Wawrzyniuk) #4

Try

wget -c <your download url>

It will always start download from last point

M

1 Like

(Cedric Chee) #5

I don’t know the best way. For me, I usually start with cURL. If the download is slow (~KiB/s), I switch to aria2. aria2 is an ultra fast download manager. I download the file directly to my AWS VM. So, this is using AWS fast down/uplink network and the download speed is ~20MiB/s. The ETA of the file is 19 minutes.

aria2c --file-allocation=none -c -x 5 -s 5 http://www.cs-chan.com/source/ICIP2017/wikiart.zip

  • Multi-connection download
  • Multi-threaded
  • Lightweight

I am not affiliated with aria2. Try it out and good luck!

6 Likes

(David Pfahler) #6

Thanks! I will give it a try. But I think the hosting server just doesn’t go faster.

0 Likes

(David Pfahler) #7

Following up on the Wikiart dataset, I finally downloaded it and ran the standard fastai lesson 1 process on it. And guess what, I got 60% accuracy predicting the exact right style. I find that very impressive. My attention was drawn to this dataset after reading the paper The Shape of Art History in the Eyes of the Machine and as far as I can see, their best accuracy was 60% as well. (Of course the paper goes much further than just doing classification, but I still find this very impressive).

One lesson I took away: Sometimes you might need to filter out corrupt images, as my learner seemed to crash on .fit_one_cycle when it encountered strange, truncated images.

1 Like

(Orfeas Menis) #8

Hello, may I ask from where did you finally download the Dataset? Because I have been searching for a link or anything at wikiart.org, but I didn’t find anything. I was thinking about downloading the Chan et al version of the dataset. Thanks!

0 Likes

(Jacopo Attolini) #9

Hello, can I ask what techniques do you use to train your model with such large datasets? I want to use this dataset but I am really struggling to upload it on Colab

0 Likes

(David Pfahler) #10

Update: After me contacting Professor Chan, he was kind enough to update the links here https://github.com/cs-chan/ArtGAN/tree/master/WikiArt%20Dataset

1 Like

#11

After I download the dataset from this url, it always meet error when I unzip the zip file.I’m a little confused on how to solve this problem now.

0 Likes

(David Pfahler) #12

I remember having the same problem, but then it somehow went away. I know that isn’t very helpful, but I can share the code I found in my notebook I was working in at the time:

base_dir = '.'
zipfile = base_dir + "/wikiart.zip"
#!wget http://web.fsktm.um.edu.my/~cschan/source/ICIP2017/wikiart.zip -O "{zipfile}" -c

#!unzip -n "{zipfile}" -d "{base_dir}"

csvzipfile = base_dir + "/wikiart_csv.zip"
#!wget http://web.fsktm.um.edu.my/~cschan/source/ICIP2017/wikiart_csv.zip -O "{csvzipfile}" -c
#!unzip "{csvzipfile}" -d "{base_dir}"

The commented parts are the lines I only run once, so you need to uncomment them. You might also need to update the URLs (I’m not sure if they are correct).

0 Likes

(Minh) #13

Interesting dataset @davidpfahler. Thanks for sharing. I just wonder since the dataset is big (25.4 GB), it does not fit into memory. So did you run it using lesson 1 process in a machine that has > 32GB of ram?

0 Likes

(David Pfahler) #14

I ran this on an AWS p2.xlarge machine, but I don’t remember how I handled the data loading process.

1 Like