I’m doing the course with my ten-yer-old daughter. We made a classifier for deep-sky Messier Objects.
Instead of using all 110 objects, I used only the list of named objects at the bottom of this page: http://www.seasky.org/astronomy/astronomy-messier.html. This way I was getting only the most commonly photographed and distinctive ones.
However, because that list only includes one globular cluster (the Great Cluster in Hercules, M13), and I really wanted to find out if my model will be able to tell apart different globular clusters, I added M3, M4, M15, and M80. I chose these ones because I think they are distinctive from each other: M3 is particularly brilliant. M4 has more easily resolvable stars and has a distinctive bar across an egg-shaped ring of bright stars. M15 has a particularly dense and bright core. M80 also has a dense core but tapers off much more quickly. I figured that if there are any globulars that the model will be able to tell apart, these are them.
Anyways, first step was to gather the images from Google. I followed lesson 2 and downloaded about 200 images for each object. I could already see from my Google search results that this wasn’t going to be easy, and I would need lots of cleaning - the images were a mess, with most of them quite obviously wrongly labelled. Some weren’t Messier Objects at all. Some were wide sky shots, or pictures of telescopes.
Because I had so many classes of images to download, instead of running through the download command manually for each one, I just created a quick loop. I made a CSV listing the Messier Objects and used pandas to take the second column (the first column was the name, second was the M## code) and make a list our of it to iterate over:
messiers_table = pd.read_csv('./images/messiers.csv', header=None)
messiers = messiers_table[1].tolist()
for m in messiers:
print(m)
download_images('images/'+ m + '/download', 'images/' + m, max_pics = 200)
I trained my data using the same parameters used in the lesson notes: resnet32 with 20% validation data. After the first round of training, my accuracy rate was only 45%. YIKES! Unfreezing and retraining didn’t help much. Clearly I was going to need to clean my data.
Soooo… ran ImageCleaner and started cleaning… At first, I was trying to relabel wrongly labelled images by actually trying to recognize what the images were, but it turns out that this takes a bit longer for a human than recognizing black bears and teddy bears. It was taking me several minutes per page… After doing this for a couple of hours, and several hundred images later, I decided to just delete images that were clearly not Messiers, and do the relabelling later. My dataset was over 5000 images, and I had only cleaned a few hundred, and the quality wasn’t getting much better as I was going through the top losses list.
So, retrained the model on the slightly cleaned dataset, and got the accuracy up to about 50% - not much of an improvement.
I tried a few more iterations of cleaning, retraining, recleaning, and so on, and I am now at an accuracy of about 65%. Much better, but still not good enough. But at least now the lot losses were a lot more sensible:
**
**Here are some of my most confused. At least they are sensible:
[(‘M76’, ‘M27’, 7),
(‘M15’, ‘M13’, 6),
(‘M3’, ‘M13’, 6),
(‘M11’, ‘M24’, 5),
(‘M81’, ‘M82’, 5),
(‘M82’, ‘M81’, 5),
(‘M24’, ‘M11’, 4),
(‘M24’, ‘M6’, 4),
(‘M44’, ‘M11’, 4),
(‘M6’, ‘M11’, 4),
(‘M6’, ‘M24’, 4)]
Lesson for me was, it’s really hard to get a nice, clean, labelled dataset to start with, and using Google images is a quick way to get images, but not very clean ones!