shuffle=True or False?

metamich · February 22, 2017, 12:01am

I took Jeremy’s dogscats_ensemble notebook and ran it up to the finetuning of the last layer. After that finetuning, what happened was surprising, because I got validation accuracy of like 50%, way less than the finetuning results in lesson 1. I then investigated, and found out that the accuracy was also terrible, if I use vgg.py’s native finetune() function like in lesson 1 in this notebook.

The difference, it turns out, is how batches and val_batches were defined. In short, if I override the default of “shuffle=True” with “shuffle=False”, like in the provided dogscats_ensemble notebook, I’ll get terrible accuracy. If I just change into “shuffle=True”, for the definition of both batches and val_batches, the accuracy shoots up to 95%.

But that’s confusing to me, because knowing how the accuracy responds to this parameter, I don’t understand why “shuffle=False” was set in the first place. Is there ever any reason to set “shuffle=False”? Why was it there in the dogscats_ensemble notebook?

VishnuSubramanian · February 22, 2017, 1:57am

HI Metamich ,

If you are using precomputed convolutional layers like what Jeremy uses in dogscats_ensemble notebook , then the order in which data gets generated loses order. Say for example we have the images in the below order .

1.jpg -> Dog
2.jpg -> Dog
3.jpg -> Cat

After generating convolution layers with shuffle = true , it becomes

2.jpg -> Dog
3.jpg -> Dog
1.jpg -> Cat

If you want to use shuffle = true , then you can use batches directly. So your model is trying to learn from wrong data.

Hope it helps.

NOTE: Please post your questions in the respective lessons , so that its easy to maintain.

Thanks,
Vishnu Subramanian

metamich · February 22, 2017, 3:27am

Thanks @VishnuSubramanian. I see your point, but this is why I don’t get it. I wanted to use “shuffle=False”, like in Jeremy’s notebook. That’s what I did the first time. But it turns out that the validation accuracy is terrible in that case. It’s only after I changed it into “shuffle=True” that I get good results. That’s counter-intuitive. I mean, why would the model been trying to learn from the wrong data only when shuffling is turned OFF?

VishnuSubramanian · February 22, 2017, 3:30am

Can you share your code in gist. It will help to understand.

metamich · February 22, 2017, 3:44am

Using the dogscats sample data only:

batches = get_batches(path+'train', shuffle=False, batch_size=batch_size)
val_batches = get_batches(path+'valid', shuffle=False, batch_size=batch_size)
vgg = Vgg16()
vgg.finetune(batches)
vgg.fit(batches, val_batches, nb_epoch=1)

The above results in a validation accuracy of 0.5000

vgg = Vgg16()
batches = get_batches(path+'train', shuffle=True, batch_size=batch_size)
val_batches = get_batches(path+'valid', shuffle=True, batch_size=batch_size)
vgg.finetune(batches)
vgg.fit(batches, val_batches, nb_epoch=1)

The above results in a validation accuracy of 0.8250

Basically, I ran the dogscats_ensemble notebook verbatim, up to the finetuning of the last layer (removing the last layer and second last layer, replace with batchnorm, dropout and the final dense layer), and the validation accuracy is a whopping 0.3710, (worse than a random guess?), which I didn’t understand. So I went back and tested the above two chunks of the code, and got the above results.

VishnuSubramanian · February 22, 2017, 3:55am

There are two problems here.

Using Batch normalisation : Since you are using vgg model which was trained without BatchNormalisation layer , the weights work optimally for that. We cannot introduce BN layer into the existing model. You can find more about it in Lesson 5 video and the notes.
Why shuffle = True improves accuracy : Shuffle = True should give better results particularly when you are running for more epochs. But I am not clear why there is so huge difference in the accuracy when shuffle is set to False. One thing you can try is , to increase the number of epochs and see if the accuracy improves. For shuffle = False.

metamich · February 22, 2017, 4:09am

Thank you @VishnuSubramanian. I can report that your second point is right on: with 8 epochs each, shuffle=True or False both generate training/val accurancy around 98%.

Regarding your first point, I’m not sure what you mean. The code I provided above didn’t contain BatchNorm. The BatchNorm is in later code in the dogscats_ensemble notebook, the one provided. What it does is precisely introducing BN into the existing model, by removing the last two layers (dropout and dense), and replacing them with three layers (batchnorm, dropout and dense).

def get_ll_layers():
return [ 
    BatchNormalization(input_shape=(4096,)),
    Dropout(0.5),
    Dense(2, activation='softmax') 
    ]

Did you mean that since we’ve now introduced BatchNorm, perhaps the problem with a validation accuracy of 37% is resulting from insufficient epochs, as well? I’ve watched lesson 5 video and read the note before, btw: I thought I understood what the notebook is trying to do, but the result, the 37% accuracy, threw me off.

VishnuSubramanian · February 22, 2017, 5:21am

Jeremy has trained the VGG model with Batch Normalisation and you can find the same in the vgg16bn.py file. If you are using that model , you can see a change in ur accuracy. But try to read again the note and my previous answer to find why it works.

metamich · February 22, 2017, 5:39am

Thanks @VishnuSubramanian. I do think that I’ve done my homework and have read these before I came here and asked the question. I’m aware that vgg_bn.py exists and I could use that for the same purpose. But in this case, I’m just trying to learn how Jeremy did it with the original vgg.py only. His dogscats ensemble notebook didn’t mention vgg_bn.py.

In other words, my question is simple. Would you let me know why, if I run the original dogscats ensemble notebook, I don’t get the same results as he does? In particular, I get about 50% accuracy after running 13 epochs, instead of the above 97% when he runs it.

The code I used are as follows, same as his notebook, I believe. For ease of reading, I’ve downloaded it from ipynb into .py:

# coding: utf-8    

# In[50]:    

from theano.sandbox import cuda
cuda.use('gpu0')    


# In[51]:    

get_ipython().magic(u'matplotlib inline')
import utils; reload(utils)
from utils import *
from __future__ import division, print_function    


# ## Setup    

# In[52]:    

path = "data/dogscats/sample/"
model_path = 'data/dogscats/models/'
if not os.path.exists(model_path): os.mkdir(model_path)    

batch_size=64    


# In[53]:    

batches = get_batches(path+'train', shuffle=False, batch_size=batch_size)
val_batches = get_batches(path+'valid', shuffle=False, batch_size=batch_size)    


# In[54]:    

(val_classes, trn_classes, val_labels, trn_labels, 
    val_filenames, filenames, test_filenames) = get_classes(path)    


# In this notebook we're going to create an ensemble of models and use their average as our predictions. For each ensemble, we're going to follow our usual fine-tuning steps:
# 
# 1) Create a model that retrains just the last layer
# 2) Add this to a model containing all VGG layers except the last layer
# 3) Fine-tune just the dense layers of this model (pre-computing the convolutional layers)
# 4) Add data augmentation, fine-tuning the dense layers without pre-computation.
# 
# So first, we need to create our VGG model and pre-compute the output of the conv layers:    

# In[55]:    

model = Vgg16().model
conv_layers,fc_layers = split_at(model, Convolution2D)    


# In[56]:    

conv_model = Sequential(conv_layers)    


# In[57]:    

val_features = conv_model.predict_generator(val_batches, val_batches.nb_sample)
trn_features = conv_model.predict_generator(batches, batches.nb_sample)    


# In[58]:    

save_array(model_path + 'train_convlayer_features.bc', trn_features)
save_array(model_path + 'valid_convlayer_features.bc', val_features)    


# In the future we can just load these precomputed features:    

# In[59]:    

trn_features = load_array(model_path+'train_convlayer_features.bc')
val_features = load_array(model_path+'valid_convlayer_features.bc')    


# We can also save some time by pre-computing the training and validation arrays with the image decoding and resizing already done:    

# In[60]:    

trn = get_data(path+'train')
val = get_data(path+'valid')    


# In[61]:    

save_array(model_path+'train_data.bc', trn)
save_array(model_path+'valid_data.bc', val)    


# In the future we can just load these resized images:    

# In[62]:    

trn = load_array(model_path+'train_data.bc')
val = load_array(model_path+'valid_data.bc')    


# Finally, we can precompute the output of all but the last dropout and dense layers, for creating the first stage of the model:    

# In[63]:    

model.pop()
model.pop()    


# In[64]:    

ll_val_feat = model.predict_generator(val_batches, val_batches.nb_sample)
ll_feat = model.predict_generator(batches, batches.nb_sample)    


# In[65]:    

save_array(model_path + 'train_ll_feat.bc', ll_feat)
save_array(model_path + 'valid_ll_feat.bc', ll_val_feat)    


# In[66]:    

ll_feat = load_array(model_path+ 'train_ll_feat.bc')
ll_val_feat = load_array(model_path + 'valid_ll_feat.bc')    


# ...and let's also grab the test data, for when we need to submit:    

# In[67]:    

test = get_data(path+'test')
save_array(model_path+'test_data.bc', test)    


# In[68]:    

test = load_array(model_path+'test_data.bc')    


# ## Last layer    

# The functions automate creating a model that trains the last layer from scratch, and then adds those new layers on to the main model.    

# In[71]:    

def get_ll_layers():
    return [ 
        BatchNormalization(input_shape=(4096,)),
        Dropout(0.5),
        Dense(2, activation='softmax') 
        ]    


# In[83]:    

def train_last_layer(i):
    #so the above a few cells ago, 
    #we dropped the last two layers of vgg, and got the model output just above as features
    #here we train the replacement layers (adding batchnorm, and modify the number of outputs)
    #the purpose is to finetune the last dense layer, with batchnorm 
    ll_layers = get_ll_layers()
    ll_model = Sequential(ll_layers)
    ll_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    ll_model.optimizer.lr=1e-5
    ll_model.fit(ll_feat, trn_labels, validation_data=(ll_val_feat, val_labels), nb_epoch=12)
    ll_model.optimizer.lr=1e-7
    ll_model.fit(ll_feat, trn_labels, validation_data=(ll_val_feat, val_labels), nb_epoch=12)
    ll_model.save_weights(model_path+'ll_bn' + i + '.h5')    

    #here we've taken vgg 16, popped the last dense layer, the dropout above that
    #and the regular dense layer above that
    vgg = Vgg16()
    model = vgg.model
    model.pop(); model.pop(); model.pop()
    print (model.summary())
    for layer in model.layers: layer.trainable=False
    model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])    

    ll_layers = get_ll_layers()
    for layer in ll_layers: model.add(layer)
    for l1,l2 in zip(ll_model.layers, model.layers[-3:]):
        l2.set_weights(l1.get_weights())
    model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    model.save_weights(model_path+'bn' + i + '.h5')
    return model    


# In[84]:    

train_last_layer('test')

In addition, I’d also like to ask why model.pop() was done three times in the definition of “train_last_layer(i)”. I thought popping twice is enough, and the precalculation was done based on popping the last two, instead of three, layers. That’s not the problem why the accuracy is so low though. I changed the definition to popping twice only before, and the result was not improved.

VishnuSubramanian · February 22, 2017, 6:28am

Because Jermey used the updated model with Batch Normalisation in the lesson. But he modified it for MOOC. So for you to get the same result , Use the BN file.

metamich · February 22, 2017, 6:53am

I re-thought about it and really can’t believe the dogscats_ensemble notebook is not supposed to run as it is. Every piece of evidence points to that it was intending to be built based on vgg, not vgg_bn.

@VishnuSubramanian you did read that notebook, right? If it’s not suppose to run, how did all these cells seem to be producing results? The high accuracy levels are shown as cell results.

Also, there’s this forum post here indicating the Jeremy uploaded it intending it to work. So it must be that I did something wrong that the notebook isn’t running correctly for me.

Finally, if the notebook were intending to import vgg_bn instead of vgg as the starter model, then there’s no need to pop the 2 last layers only to add 3 back, because the vgg_bn model already has the batchNorm layer as the third last layer. Therefore, the definition of get_ll_layers() won’t be necessary. The existence of such a function definition as get_ll_layers() made me think it’s indicative of the intent of building the BatchNorm layers using the vgg model as the baseline.

To say that that notebook has been modified such that it no longer produces the nice accuracy results because it’s not using vgg_bn anymore is just not consistent with the observations above. This is making me even more confused. What am I missing?

geniusgeek · February 22, 2017, 3:36pm

Shuffle set to false, allows you to use the previously trained data. Setting this to true means that you either want to retrain or set the epoch to some value greater than 3. To learn, but this increases the chances of memorization (over fitting)

xiaogao · August 28, 2017, 9:20am

In Jemery’s notebook “statefarm-sample”, there are 2 lines:

batches = get_batches(path+‘train’, batch_size=batch_size)
val_batches = get_batches(path+‘valid’, batch_size=batch_size*2, shuffle=False)

It shuffles training data but retain the validation data. What’s the purpose of this?

simoneva · August 28, 2017, 11:11am

Validation will produce exactly the same score whatever the order as the weights are fixed. Therefore it makes no difference whether you shuffle or not (slightly slower for a shuffle but probably insignificant).

Training however is incremental. You can either teach your baby “hello” for 6 months then “goodbye” for 6 months; or you can mix up “hello” and “goodbye” for a year. Mixing it up is intuitively better to teach the difference between hello and goodbye; avoids any influence of timing (e.g. hello is summer); and shows smoother progress e.g. if it is easier to spot “hello” then half way through an epoch you will see progress slow significantly.

xiaogao · August 28, 2017, 2:13pm

It sounds like that is no big difference between whatever I set shuffle as true or false for train and valid data. But based on my observation, only shuffle=True for train batches and shuffle=False for valid batches can get good result from the following mode. The other combinations will get very low accuracy for both train and valid data and the valid accuracy is almost no improvement after several epochs. I want to understand why this happened?

model = Sequential([
BatchNormalization(axis=1, input_shape=(3, 224, 224)),
Flatten(),
Dense(10, activation=‘softmax’, W_regularizer=l2(0.01))
])
model.optimizer.lr=0.001
model.fit_generator(batches, batches.nb_sample, nb_epoch=4, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)

simoneva · August 28, 2017, 2:33pm

Perhaps your observation is based on different starting points. It makes no difference to the validation score whether shuffled or not. If you want to see this you have to keep everything else the same.

np.random.seed(0)
initialise your generators
run model

xiaogao · August 28, 2017, 2:44pm

Ok, I will try this. BTW, where to set the seed? before creating batch or creating model?

simoneva · August 28, 2017, 3:33pm

If the batch is shuffled then before creating the batch. Even if the batch is not shuffled then if you have already used the generator it could be starting with different data.