Kaggle Galaxy Classification

tentotheminus9 · July 24, 2017, 3:15pm

Hi all,

I’m currently on week 2 (part 1 deep learning), and have started to look at applying what’s been covered so far to the Kaggle ‘Galaxy Zoo’ challenge. The problem I have is that the training data, rather than having a binary classification, has 37 probabilities, reflecting different classifications and features, as per the competition guidelines. With the dogsvscats work, the training data was divided into folders that represented the class, but that’s not possible here. Has anyone tackled this issue and if so, could you give me any hints as to how to tell keras about these training labels, and how to consequently get the predict functions to output 37 different probabilities per image.

This is all new to me, so I’m sure I’m missing something pretty basic. Thanks.

farlion · August 29, 2017, 7:53pm

Hi @tentotheminus9,

We’re currently in the same boat. Decided to give it about 2 hours, before moving on to Week 3, and I didn’t get very far.

Found this thread with some interesting insights, and I’m also guessing that we need to replace the flow_from_directory method with something more handcoded.

One crazy out-of-whack idea I just had to make it work with our existing toolset: instead of using 37 categories, turn it into a (37*10)=370 categories problem (with subdirectories for each), approximating the “correct” probability/weight of each of the 37 categories in steps of 0.1
Might work for a basic submission, but obviously can’t be a very good solution

Did you make any progress?

sml0820 · August 29, 2017, 9:22pm

flow from directory should work fine. Change your loss function to categorical cross entropy from binary cross entropy, your activation function to softmax instead of sigmoid, and the number of dense outputs at the end from 1 to 37.

farlion · August 30, 2017, 10:44am

Thx for your reply! How would I organize the training data folder-wise, with flow from directory? And can you give me a hint why caterogical cross entropy is better than binary cross entropy for this type of multi-label classification?

Alexev · January 25, 2018, 8:12pm

All the images are in the same package, I just don’t understand how can I classify them to split in directories, how did you do that?

farlion · February 3, 2018, 8:03pm

Hey @Alexev,

part 1 (2018) (the first 3 lessons) makes all of this a lot easier with the new fastai library.
I would love to have another take at Galaxies now, if time permits.
Let me know if you have questions!

farlion · March 12, 2018, 11:44pm

Just found the time to play with Galaxies!

The ideas from the fastai notebooks work great here as well, already in top 50 and climbing. Will share my messy code here once done. Learning lots, especially as I had to play around with the DataLoader and metrics.

Ping me anytime if you can use some help getting started.

jwuphysics · May 22, 2018, 8:10pm

I just started on this project (and I’ve completed up to lesson 4 in the first part of the course).

I tried predicting images using only the first classification question (i.e., only Class1.1, Class1.2, and Class1.3 corresponding respectively to featureless, featured/disk, artifact/star/other classes).

Beginning with a 32x32 image I can achieve about 83% accuracy, but when I transition to 64x64 or larger images I seem to hit a lower ceiling and I can’t get the network to learn any more. This occurs even when I cyclically anneal the learning rate. I’ve adapted code from the (working) dogbreeds data set, so I’m not sure why the network is failing here.

Do you have any tips for training on images of increasing sizes? Thank you!

farlion · May 25, 2018, 11:34am

Hi @jwuphysics! Apologies, I’m quite bogged down and might be slow in replying these next few weeks. Here’s the code I used to get into the Top 7 a while back - very messy still

gist.github.com

https://gist.github.com/workflow/60e4f586e3f7ef0aaefde091c3a488b3

galaxies-deepgreen.ipynb

{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "# Kaggle Galaxies"
    },
    {
      "metadata": {
        "trusted": false

This file has been truncated. show original

Hope you find something useful, and let me know how it goes!

jwuphysics · May 25, 2018, 2:39pm

Hey @farlion, I really appreciate you sharing your work! I’m learning a ton simply from observing your workflow

One of the big differences between my first attempt and yours is that I’m using a puny batchsize (about 8-16 compared to your 128) while using 4 or 8 workers (whereas you use 1). I’m running my code on a GTX 780 with 3GB RAM. This seems to affect my ability to find a good learning rate and so I was trying to train my net using a learning rate that was 2 orders of magnitude lower than what you selected.

Anyway, thank you for sharing this. I might prod you again if I have more substantive questions!