Kaggle Galaxy Classification


(Rob Harrand) #1

Hi all,

I’m currently on week 2 (part 1 deep learning), and have started to look at applying what’s been covered so far to the Kaggle ‘Galaxy Zoo’ challenge. The problem I have is that the training data, rather than having a binary classification, has 37 probabilities, reflecting different classifications and features, as per the competition guidelines. With the dogsvscats work, the training data was divided into folders that represented the class, but that’s not possible here. Has anyone tackled this issue and if so, could you give me any hints as to how to tell keras about these training labels, and how to consequently get the predict functions to output 37 different probabilities per image.

This is all new to me, so I’m sure I’m missing something pretty basic. Thanks.


"Guided" machine learning?
(Florian Peter) #2

Hi @tentotheminus9,

We’re currently in the same boat. Decided to give it about 2 hours, before moving on to Week 3, and I didn’t get very far.

Found this thread with some interesting insights, and I’m also guessing that we need to replace the flow_from_directory method with something more handcoded.

One crazy out-of-whack idea I just had to make it work with our existing toolset: instead of using 37 categories, turn it into a (37*10)=370 categories problem (with subdirectories for each), approximating the “correct” probability/weight of each of the 37 categories in steps of 0.1
Might work for a basic submission, but obviously can’t be a very good solution :wink:

Did you make any progress?


(Sean Lanning) #3

flow from directory should work fine. Change your loss function to categorical cross entropy from binary cross entropy, your activation function to softmax instead of sigmoid, and the number of dense outputs at the end from 1 to 37.


(Florian Peter) #4

Thx for your reply! How would I organize the training data folder-wise, with flow from directory? And can you give me a hint why caterogical cross entropy is better than binary cross entropy for this type of multi-label classification?


(Alex) #5

All the images are in the same package, I just don’t understand how can I classify them to split in directories, how did you do that?


(Florian Peter) #6

Hey @Alexev,

part 1 (2018) (the first 3 lessons) makes all of this a lot easier with the new fastai library.
I would love to have another take at Galaxies now, if time permits.
Let me know if you have questions!


(Florian Peter) #7

Just found the time to play with Galaxies!

The ideas from the fastai notebooks work great here as well, already in top 50 and climbing. Will share my messy code here once done. Learning lots, especially as I had to play around with the DataLoader and metrics.

Ping me anytime if you can use some help getting started.


(John Wu) #8

I just started on this project (and I’ve completed up to lesson 4 in the first part of the course).

I tried predicting images using only the first classification question (i.e., only Class1.1, Class1.2, and Class1.3 corresponding respectively to featureless, featured/disk, artifact/star/other classes).

Beginning with a 32x32 image I can achieve about 83% accuracy, but when I transition to 64x64 or larger images I seem to hit a lower ceiling and I can’t get the network to learn any more. This occurs even when I cyclically anneal the learning rate. I’ve adapted code from the (working) dogbreeds data set, so I’m not sure why the network is failing here.

Do you have any tips for training on images of increasing sizes? Thank you!


(Florian Peter) #9

Hi @jwuphysics! Apologies, I’m quite bogged down and might be slow in replying these next few weeks. Here’s the code I used to get into the Top 7 a while back - very messy still :slight_smile:

Hope you find something useful, and let me know how it goes!


(John Wu) #10

Hey @farlion, I really appreciate you sharing your work! I’m learning a ton simply from observing your workflow :wink:

One of the big differences between my first attempt and yours is that I’m using a puny batchsize (about 8-16 compared to your 128) while using 4 or 8 workers (whereas you use 1). I’m running my code on a GTX 780 with 3GB RAM. This seems to affect my ability to find a good learning rate and so I was trying to train my net using a learning rate that was 2 orders of magnitude lower than what you selected.

Anyway, thank you for sharing this. I might prod you again if I have more substantive questions!