# Lesson 1 discussion

(anamariapopescug) #21

hi @leahob. same here, i was a bit shy of the second half after submitting (running more epochs seemed to improve the score, didn’t have a chance to fiddle with the optimizer parameters much)

0 Likes

(anamariapopescug) #22

this happened to me on some runs, but not on others (e.g. for some values of num_epochs but not for others)

1 Like

This is a wonderful resource for everyone - thanks @melissa.fabros !

0 Likes

What was your validation set accuracy? It’s easier to make suggestions once we know how you’re going so far.

0 Likes

Actually I can give you a strong hint - look at the equation here: https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/details/evaluation . Have a think about what minor change to your output might make a big difference to that evaluation function. Hint: I went from 105th spot to 37th spot by running two little commands in my text editor - no need to even start up my AWS instance…

2 Likes

(jbrown81) #26

Thanks @jeremy
First, I ran the vgg model as is (with 1000 class output, without fine-tuning to dogsvscatsredux) to generate predictions on the first 4 test images like so:
from vgg16 import Vgg16
vgg = Vgg16()
batches = vgg.get_batches(path+‘test’, batch_size=4,shuffle=False)
imgs,labels = next(batches)
vgg.predict(imgs, True)
and the output I get is:
(array([ 0.2321, 0.5742, 0.2567, 0.5104], dtype=float32),
array([285, 246, 229, 285]),
[u’Egyptian_cat’, u’Great_Dane’, u’Old_English_sheepdog’, u’Egyptian_cat’])

Next, I fine tuned the vgg model to generate two-class predictions like so:
vgg = Vgg16()
batch_size=128
batches = vgg.get_batches(path+‘train’, batch_size=batch_size)
val_batches = vgg.get_batches(path+‘valid’, batch_size=batch_size)
vgg.finetune(batches)
vgg.fit(batches, val_batches, nb_epoch=1)
training completed, then I ran prediction on the test set:
batches = vgg.get_batches(path+‘test’, batch_size=4,shuffle=False)
imgs,labels = next(batches)
vgg.predict(imgs, True)
and the output I get is:
(array([ 1., 1., 1., 1.], dtype=float32),
array([0, 1, 1, 0]),
[u’tench’, u’goldfish’, u’goldfish’, u’tench’])

The predictions on the first 4 test images look correct (cat,dog,dog,cat). What I’m puzzled by is why the probabilities are always exactly 1. With a softmax output, I expect the class probability values be somewhere between 0-1, like they are with the original vgg net.

I’m running this on a p2 instance fwiw.

4 Likes

(anamariapopescug) #27

Can confirm this worked

0 Likes

(leahob) #28

I submitted after only just one epoch of training to test the mechanics of submission after seeing that the validation accuracy is not too bad.
Thanks for the hint below regarding the evaluation details, Jeremy – noting the type of prediction probabilities I submitted (mainly 0, 1 as noted also by @jbrown81), and the evaluation metric, there is definitely room for quick improvement.

Epoch 1/1
22500/22500 [==============================] - 629s - loss: 0.7596 - acc: 0.9504 - val_loss: 0.2705 - val_acc: 0.9824

1 Like

@jbrown81 we’ll talk more about that issue in class, but the short answer is that the model tends to be terribly over-confident; you can’t really think of them as probabilities at all. So it might well be giving ones and zeros as ‘probabilities’. If you try running predict() before you train, you should see that you do indeed get numbers between zero and one.

0 Likes

#30

I was wondering if I can work on the digital mammography DREAM challenge data set instead.

0 Likes

@jeff that would be a great project. Let us know how we can help you!

0 Likes

(melissa.fabros) #32

help debugging writing to csv. I thought I was iterating through the batches to get the predictions and the filename ids at the same time, but the file ids are repeating with different prediction scores associated with them. I think it’s iterating through the images and predicting ok, but I’m not sure how to get beyond the first 64 item batch to the next.?

Does batches.filenames seems to return the whole list [whether 1k or 12k items] of all files in the target directory or is it only the one batch size at a time? I thought it was the former, and then batches get divvied up by next() before being passed to predictions.

any help would be appreciated.
Also, i’m training my model on the m4.instance and it’s been going for about 4 hours and it still isn’t done. As a reference, training models would be be faster on the on p2 instance?

``````def pred_batch():
import csv
# collect batches for analysis
batches = get_batches(data_path+'train', batch_size=64, shuffle=False)

# open csv to write to
with open('kaggle_catsdogs.csv', 'w') as csvfile:

# assign Kaggle column names
fieldnames = ['id', 'label']

# instaniate DictWriter to write to csv
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

# write colnum names to csv

# batch of 64 elements are analyzed
while next(batches):

# iterate through batches
imgs, labels = next(batches)

# run images through prediction method
preds = vgg.predict(imgs, True)

# index values of images
idxs = np.argmax(preds, axis=1)

#loop to format predictions and files id
for i in range(len(idxs)):
idx = idxs[i]

# split to get file id
filename = batches.filenames[i].split('.')[1]

print ('{},{:.1f}'.format(filename, preds[i, idx]))
writer.writerow({'id': filename, 'label': preds[i, idx]})
``````
2 Likes

@melissa.fabros yes .filenames gives you the whole lot. If you come into class now we can take a look and help you out.

0 Likes

(ben.bowles) #34

In case its useful, this is how I solved this problem.

def predict(base_train_folder, vgg):
imags = glob.glob(os.path.join(base_train_folder, ‘test’,’*.jpg’))

``````records = []
for n, path_img in enumerate(imags):
probs = vgg.model.predict(
224,
224))
number = os.path.split(path_img)[-1][0:-4]
records.append({'id': number, 'label': probs[0][1]})
if n % 15 == 0:
print float(n) / len(imags)

df = pd.DataFrame.from_records(records)
df['id'] = pd.to_numeric(df['id'])
df = df.sort_values('id')
df.to_csv('submission.csv', index=False)
``````

Note, a few of my functions came from keras.preprocessing.image module

9 Likes

#35

Thanks. I could use your guidance on a couple of things:

• Since the digital mammography challenge asks the mammogram image PLUS clinical data would improve diagnostic accuracy, how would you architect the neural network? Would it be a deep and wide model (as described in https://research.googleblog.com/2016/06/wide-deep-learning-better-together-with.html? Is that how you solved similar problems at Enlitic (if you’re allowed to say)?

• Since medical images are usually VERY high resolution, did you find a certain range of downsampled pixel resolutions and training batch sizes that worked well enough for you? Does it depend on the type of medical image?

1 Like

@jeff this deserves it’s own topic! could you create a “medical imaging” topic, and copy your question there? also, please add info about the details of the dataset you’re looking at - number of images, size of each image, etc. also give us some background on the digital mammography problem itself (eg why is it important; what kinds of things would a model need to detect; …). hopefully we can get a few people on the forum to work together on this!

1 Like

Leveraging Pandas is a good idea @ben.bowles

0 Likes

(Tom Elliot) #38

In lesson 1 there was a quick mention of stochastic gradient descent, and a link to a stanford page talking about it. I’ve done my best to understand and summarise the key points on the wiki here - feel free to add to/update/correct the notes. I tried to keep the language pretty plain though - go to the source for more detail!

2 Likes