Lesson 1 discussion

(anamariapopescug) #21

hi @leahob. same here, i was a bit shy of the second half after submitting (running more epochs seemed to improve the score, didn’t have a chance to fiddle with the optimizer parameters much)

(anamariapopescug) #22

this happened to me on some runs, but not on others (e.g. for some values of num_epochs but not for others)

(Jeremy Howard) #23

This is a wonderful resource for everyone - thanks @melissa.fabros !

(Jeremy Howard) #24

What was your validation set accuracy? It’s easier to make suggestions once we know how you’re going so far.

(Jeremy Howard) #25

Actually I can give you a strong hint - look at the equation here: https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/details/evaluation . Have a think about what minor change to your output might make a big difference to that evaluation function. Hint: I went from 105th spot to 37th spot by running two little commands in my text editor - no need to even start up my AWS instance…

(jbrown81) #26

Thanks @jeremy
First, I ran the vgg model as is (with 1000 class output, without fine-tuning to dogsvscatsredux) to generate predictions on the first 4 test images like so:
from vgg16 import Vgg16
vgg = Vgg16()
batches = vgg.get_batches(path+‘test’, batch_size=4,shuffle=False)
imgs,labels = next(batches)
vgg.predict(imgs, True)
and the output I get is:
(array([ 0.2321, 0.5742, 0.2567, 0.5104], dtype=float32),
array([285, 246, 229, 285]),
[u’Egyptian_cat’, u’Great_Dane’, u’Old_English_sheepdog’, u’Egyptian_cat’])

Next, I fine tuned the vgg model to generate two-class predictions like so:
vgg = Vgg16()
batches = vgg.get_batches(path+‘train’, batch_size=batch_size)
val_batches = vgg.get_batches(path+‘valid’, batch_size=batch_size)
vgg.fit(batches, val_batches, nb_epoch=1)
training completed, then I ran prediction on the test set:
batches = vgg.get_batches(path+‘test’, batch_size=4,shuffle=False)
imgs,labels = next(batches)
vgg.predict(imgs, True)
and the output I get is:
(array([ 1., 1., 1., 1.], dtype=float32),
array([0, 1, 1, 0]),
[u’tench’, u’goldfish’, u’goldfish’, u’tench’])

The predictions on the first 4 test images look correct (cat,dog,dog,cat). What I’m puzzled by is why the probabilities are always exactly 1. With a softmax output, I expect the class probability values be somewhere between 0-1, like they are with the original vgg net.

I’m running this on a p2 instance fwiw.

(anamariapopescug) #27

Can confirm this worked :slight_smile:

(leahob) #28

I submitted after only just one epoch of training to test the mechanics of submission after seeing that the validation accuracy is not too bad.
Thanks for the hint below regarding the evaluation details, Jeremy – noting the type of prediction probabilities I submitted (mainly 0, 1 as noted also by @jbrown81), and the evaluation metric, there is definitely room for quick improvement.

Epoch 1/1
22500/22500 [==============================] - 629s - loss: 0.7596 - acc: 0.9504 - val_loss: 0.2705 - val_acc: 0.9824

(Jeremy Howard) #29

@jbrown81 we’ll talk more about that issue in class, but the short answer is that the model tends to be terribly over-confident; you can’t really think of them as probabilities at all. So it might well be giving ones and zeros as ‘probabilities’. If you try running predict() before you train, you should see that you do indeed get numbers between zero and one.


I was wondering if I can work on the digital mammography DREAM challenge data set instead.

(Jeremy Howard) #31

@jeff that would be a great project. Let us know how we can help you!

(melissa.fabros) #32

help debugging writing to csv. I thought I was iterating through the batches to get the predictions and the filename ids at the same time, but the file ids are repeating with different prediction scores associated with them. I think it’s iterating through the images and predicting ok, but I’m not sure how to get beyond the first 64 item batch to the next.?

Does batches.filenames seems to return the whole list [whether 1k or 12k items] of all files in the target directory or is it only the one batch size at a time? I thought it was the former, and then batches get divvied up by next() before being passed to predictions.

any help would be appreciated.
Also, i’m training my model on the m4.instance and it’s been going for about 4 hours and it still isn’t done. As a reference, training models would be be faster on the on p2 instance?

def pred_batch():
    import csv
    # collect batches for analysis
    batches = get_batches(data_path+'train', batch_size=64, shuffle=False)
    # open csv to write to
    with open('kaggle_catsdogs.csv', 'w') as csvfile:
        # assign Kaggle column names
        fieldnames = ['id', 'label']
        # instaniate DictWriter to write to csv
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        # write colnum names to csv
        # while loop to continue loading batches after the first
        # batch of 64 elements are analyzed
        while next(batches):
            # iterate through batches 
            imgs, labels = next(batches)
            # run images through prediction method
            preds = vgg.predict(imgs, True)
            # index values of images 
            idxs = np.argmax(preds, axis=1)

            #loop to format predictions and files id
            for i in range(len(idxs)):
                idx = idxs[i]
                # split to get file id
                filename = batches.filenames[i].split('.')[1]
                print ('{},{:.1f}'.format(filename, preds[i, idx])) 
                writer.writerow({'id': filename, 'label': preds[i, idx]})

(Jeremy Howard) #33

@melissa.fabros yes .filenames gives you the whole lot. If you come into class now we can take a look and help you out.

(ben.bowles) #34

In case its useful, this is how I solved this problem.

def predict(base_train_folder, vgg):
imags = glob.glob(os.path.join(base_train_folder, ‘test’,’*.jpg’))

records = []
for n, path_img in enumerate(imags):
    probs = vgg.model.predict(
        img_to_array(load_img(path_img, target_size=[224, 224])).reshape(1, 3,
    number = os.path.split(path_img)[-1][0:-4]
    records.append({'id': number, 'label': probs[0][1]})
if n % 15 == 0:
    print float(n) / len(imags)

df = pd.DataFrame.from_records(records)
df['id'] = pd.to_numeric(df['id'])
df = df.sort_values('id')
df.to_csv('submission.csv', index=False)

Note, a few of my functions came from keras.preprocessing.image module


Thanks. I could use your guidance on a couple of things:

  • Since the digital mammography challenge asks the mammogram image PLUS clinical data would improve diagnostic accuracy, how would you architect the neural network? Would it be a deep and wide model (as described in https://research.googleblog.com/2016/06/wide-deep-learning-better-together-with.html? Is that how you solved similar problems at Enlitic (if you’re allowed to say)?

  • Since medical images are usually VERY high resolution, did you find a certain range of downsampled pixel resolutions and training batch sizes that worked well enough for you? Does it depend on the type of medical image?

(Jeremy Howard) #36

@jeff this deserves it’s own topic! could you create a “medical imaging” topic, and copy your question there? also, please add info about the details of the dataset you’re looking at - number of images, size of each image, etc. also give us some background on the digital mammography problem itself (eg why is it important; what kinds of things would a model need to detect; …). hopefully we can get a few people on the forum to work together on this! :slight_smile:

(Jeremy Howard) #37

Leveraging Pandas is a good idea @ben.bowles

(Tom Elliot) #38

In lesson 1 there was a quick mention of stochastic gradient descent, and a link to a stanford page talking about it. I’ve done my best to understand and summarise the key points on the wiki here - feel free to add to/update/correct the notes. I tried to keep the language pretty plain though - go to the source for more detail!

(Jeremy Howard) #39

Thanks @tom - that’s a great idea. I think you can make your page even better by adding some simpler introductory information early on. Rather than talk about how it’s different to other optimization techniques up front, since a lot of folks won’t be familiar with those other techniques, how about trying to jot down a simple explanation first of how SGD works? That would be a great test of your understanding.

For instance, you could refer to, and borrow from, the SGD intro notebook that I showed in class. If you can explain what this notebook is doing, and why, then you’ll have a nice clear explanation of SGD, I think. How does that sound? Let me know if I can help.

(Jeremy Howard) #40

Also, note that this notebook was used in lesson 2, so the lesson 2 video may be helpful here.