I want to go for a rather challenging dataset and have stumbled upon the wikiart.org dataset. It is the biggest dataset of paintings and has been in use in some papers. However, I cannot for the life of me find a way to download it fast.
What I have tried:
Download a version of the Wikiart Dataset used by Chan et al in their ArtGAN paper presented at ICIP2016 presented on their GitHub repository; Problem: extremely slow download speeds (300KB/s)
Use the Wikiart Retriever by Lucas David. Also seems to take forever (ran for several hours) and I am not sure it had downloaded any images by that point.
I cannot find a way to download the dataset from wikiart.org directly. Any help would be much appreciated.
I don’t know the best way. For me, I usually start with cURL. If the download is slow (~KiB/s), I switch to aria2. aria2 is an ultra fast download manager. I download the file directly to my AWS VM. So, this is using AWS fast down/uplink network and the download speed is ~20MiB/s. The ETA of the file is 19 minutes.
Following up on the Wikiart dataset, I finally downloaded it and ran the standard fastai lesson 1 process on it. And guess what, I got 60% accuracy predicting the exact right style. I find that very impressive. My attention was drawn to this dataset after reading the paper The Shape of Art History in the Eyes of the Machine and as far as I can see, their best accuracy was 60% as well. (Of course the paper goes much further than just doing classification, but I still find this very impressive).
One lesson I took away: Sometimes you might need to filter out corrupt images, as my learner seemed to crash on .fit_one_cycle when it encountered strange, truncated images.
Hello, may I ask from where did you finally download the Dataset? Because I have been searching for a link or anything at wikiart.org, but I didn’t find anything. I was thinking about downloading the Chan et al version of the dataset. Thanks!
Hello, can I ask what techniques do you use to train your model with such large datasets? I want to use this dataset but I am really struggling to upload it on Colab
I remember having the same problem, but then it somehow went away. I know that isn’t very helpful, but I can share the code I found in my notebook I was working in at the time:
The commented parts are the lines I only run once, so you need to uncomment them. You might also need to update the URLs (I’m not sure if they are correct).
Interesting dataset @davidpfahler. Thanks for sharing. I just wonder since the dataset is big (25.4 GB), it does not fit into memory. So did you run it using lesson 1 process in a machine that has > 32GB of ram?