Data Collection Lesson Learned

As part of lesson 2/3 I’m working on a mushroom classifier for various species found in the region where I live. I learned a lesson about data collection the hard way and wanted to share with everyone.

I started by finding a list of about 50 common species found in my area, and begun to download images from google images. I briefly went in and cleaned up the images, removing scientific diagrams and very blatant misclassifications (i’m no expert so i can’t really say for certain so it was only blatant ones).

Well, it turns out that on websites were photos of mushrooms appear, many other photos are also present, often of mushrooms in the same or similar genus’. This meant that there was very bad misclassification in virtually every class.

I am now going through the process of recollecting the images, this time I am using a more manual process, by inspecting the captions on google images and using a one-click downloader chrome extension.

Hopefully someone can learn from my mistake and will be aware of cross contamination in classes that are very similar.

I had the same problem with 24 genera of orchids. The data was such a mess, not to mention most people can’t take a decent photo of an orchid.

In the end I did something else simpler in order to get something done return my own dataset but I found an academic paper with an orchid dataset so I’m going to go and do that one at some point. I figure that should be quite clean and I’ve got their score to compare mine against. Might be a useful exercise for you as well with your mushrooms.

1 Like

i should add too that we shouldn’t underestimate how bad search engines are at getting this stuff right; even much simpler stuff you’d think they’d would be ok with.

i created a nice clowns vs scarey clowns dataset and it required a lot of cleaning.


  1. i think it’s always worth spending a minute giving your images a quick look
  2. don’t let DuckDuckGo organise your kids’ birthday party.
1 Like

Thanks for your reply!

I did briefly see if there were any academic datasets, but since there are so many species of mushroom, it was hard to find one that would work for the species that are local to my area.