Lesson 4 discussion

Michael_D · August 23, 2017, 8:19am

Batch Normalisation vs Dropout

According to the batch normalization paper https://arxiv.org/pdf/1502.03167.pdf

When we use batch normalization the Dropout is not needed and should not be used in order to maximum benefit from batch norm.

From the paper (4.2.1, page 6) Batch Normalization fulfills some of the same goals as Dropout. Removing
Dropout from Modified BN-Inception speeds up training, without increasing overfitting

From the other side, from lesson 4 (video) it looks, that adding Dropout in addtion to Batchnorm makes an improvement.

@jeremy

waransa · September 3, 2017, 3:16pm

I would like to reproduce the result in the lesson for state farme, so I would like to know which drivers exactly is used for the validation set, since I tried to split but I got bad results and I could not improve with all ways suggested by @jeremy

It would be better to add the code for the split to the notebook of state farme sample for example

thank you!

SpaceCowboy850 · September 19, 2017, 9:27pm

Am I right in assuming that the predict for the movie should be passing in user INDEX and movie INDEX and not user ID, movie ID? Early on, the rating is changed from UID to index to make embedding easier. So to reverse this, we either call predict and the numbers we pass in are indices and have to be converted to a UID (then a string for the movie) to get human readable results, or, if the code in the notebook is correct, we must take the UIDs, convert them back to indices, before we call predict.

I believe it is accidentally working simply because most people are putting in hardcoded numbers that happen to be in both datasets. Am I incorrect in this?

mishig · September 23, 2017, 9:30pm

I have a question about the Collaborative filtering part.
The model in Excel calculates the score by: dot(two matrices) + bias_user + bias_movie
I believe that the first initial Keras model does exact same calculation as the Excel sheet (Correct me if I am wrong).
How does the architecture change and movie ratings are calculated when we add a Dense layer to the Keras model?

SpaceCowboy850 · September 29, 2017, 6:05pm

@rachel @jeremy

I’m really confused on whether this is working or not. Here are the important lines cut from the notebook

ratings = pd.read_csv(path+'ratings.csv')  # loads ratings.csv, which is UserUID, MovieUID, Rating, timestamp
users = ratings.userId.unique() # creates a unique listing of users
movies = ratings.movieId.unique() # creates a unique listing of movies
userid2idx = {o:i for i,o in enumerate(users)} # creates a mapping from user id to index in users
movieid2idx = {o:i for i,o in enumerate(movies)} # creates a mapping from movie id to index in movies

### CRITICAL-###
# Converts the ratings.movieId from being UID based to index based
ratings.movieId = ratings.movieId.apply(lambda x: movieid2idx[x]) 
# Converts the ratings.userId from being UID based to index based
ratings.userId = ratings.userId.apply(lambda x: userid2idx[x])

Then when we train:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, epochs=1,
validation_data=([val.userId, val.movieId], val.rating))

So we train on the indices, so any call to predict, MUST use the index.

However, I wrote this function (I’ve renamed some of the variables in my own notebook to be a bit more self explantory)

def predict( user_index = 0, movie_index = 0, movie_uid = None, user_uid = None, movie_name = None ):
    if movie_name != None :
        movie_uid = movie_name_to_uid[movie_name]
        movie_index = movie_uid_to_idx[movie_uid]
        
    if movie_uid != None :
        movie_index = movie_uid_to_idx[movie_uid]
        
    if user_uid != None :
        user_index = user_uid_to_idx[user_uid]
        
    result = model.predict( [np.array([user_index]), np.array([movie_index])])
    # but we want to translate that into a movie id...
    print ( "Best Rating for user {} on movie {} is {}".format( user_idx_to_uid[user_index], MovieIndexToName(movie_index), result[0] ))

However, when I then call my function:

predict( user_index = 0, movie_index = 27 )
predict( user_uid = 1, movie_name = "Braveheart (1995)")
predict( user_uid = 1, movie_uid = 110 )

# This represents what the notebook is really asking for
predict( user_index = 3, movie_index = 6 )
# This represents what it intended to ask for
predict( user_uid = 3, movie_uid = 6 )

I get this result:

# From the CSV ratings, file, UserID 1, with Movie ID 110 should score about a 1.0 rating.
Best Rating for user 1 on movie Braveheart (1995) is [ 2.84351969]
Best Rating for user 1 on movie Braveheart (1995) is [ 2.84351969]
Best Rating for user 1 on movie Braveheart (1995) is [ 2.84351969]

# The following result is closest to the notebook example of model.predict([np.array([3]), np.array([6])])
Best Rating for user 4 on movie Ben-Hur (1959) is [ 4.69232702]
# This was what it intended to ask
Best Rating for user 3 on movie Heat (1995) is [ 3.62401152]

Am I doing this right? Maybe the model is just not that good. I’ve been disappointed by it’s predictive powers.

nok · September 30, 2017, 7:28am

In lesson4, the neural net embedding using only UserId and MovidId information. So in order to generate predictions, there has to be some existing rating for the movie.

What if the test set has some new movies and new users that are not included in train set, in this case, how can we use the Embedding layer?

jamest · October 4, 2017, 8:45am

Hi all, I’m totally new to ML and really getting a lot out of the course so far - thanks so much for putting it all online for free.

My issue: I’m going through collaborative filtering with movielens and cannot replicate the level of loss Jeremy displayed for simple dot product model, without bias. The lowest I can get is a validation MSE loss of around 3.4, compared to Jeremy’s 1.45. I’ve played around with the learning rate in a similar manner to Jeremy but it just won’t budge below that loss. I assume that means something has gone wrong in the way I’ve set it up! The main difference I can see is that I’m using a tensorflow backend on a shared server. Might that have something to do with it?

You can find my workbook here - any help would be appreciated. Thanks!

npvisual · October 20, 2017, 11:03pm

@RiB ,

I was wondering if you ever figured out an answer to the discrepancy between your own calculations of the loss results and the ones obtained through the notebook ? I am in the same boat in that the numbers I get from the loss on the validation set are nowhere near what @Jeremy found.

My results match yours : 2.5 vs 1.4 and 1.14 vs 0.79
keras.version ==. ‘1.2.2’

I did go over your post (May 19) about the regularization parameter impacting the MSE results in Keras 2, but unless they also applied the same for version 1.2.2, I am not quite sure what else would make sense

To be honest, I don’t even understand why the regularization parameter would impact the MSE calculation on the validation set.

Do you have a link to your Keras support question so I can follow up ?

Thanks,
N.

npvisual · October 20, 2017, 11:15pm

@jamest ,

From what I could tell from your notebook, it seems to me that the learning rate for the optimizer is set too high when you first compile the model :

model.compile(loss='mse', optimizer=Adam(0.01), metrics=['accuracy'])

whereas @Jeremy first sets the lr to 0.001 to run 1 epoch and then does one pass at 0.01 (3 epochs) and then another one at 0.001 (6 epochs) :

model.compile(Adam(0.001), loss='mse')
...
model.optimizer.lr=0.01
...
model.optimizer.lr=0.001

I think by taking too big of a step initially you’re not able to find the proper latent factors.

HTH,
N.

npvisual · October 24, 2017, 8:03pm

I went over the details of the notebook and what is shown during the class. There are several differences that I noted :

for the Dot Product section, the weight regularizer in the notebook is 1e-4 whereas @Jeremy is using 1e-5

u = Embedding(n_users, n_factors, input_length=1, W_regularizer=l2(1e-4))(user_in)
the first iterations consist of 6 epoch with that regularizer value, whereas the notebook only goes through 1 epoch and then changes the model’s learning rate for a new set of iterations.

By following what’s done during the class I get 1.399 after the first 6 epochs. The iterations provided in the notebook after that (once the lr is modified) are fairly redundant as we can see that the model is heavily overfitting.

Train on 79766 samples, validate on 20238 samples
Epoch 1/6
79766/79766 [==============================] - 8s - loss: 3.6372 - val_loss: 2.6259
Epoch 2/6
79766/79766 [==============================] - 7s - loss: 1.8852 - val_loss: 1.7960
Epoch 3/6
79766/79766 [==============================] - 7s - loss: 1.2993 - val_loss: 1.5317
Epoch 4/6
79766/79766 [==============================] - 6s - loss: 1.0723 - val_loss: 1.4462
Epoch 5/6
79766/79766 [==============================] - 7s - loss: 0.9653 - val_loss: 1.4056
Epoch 6/6
79766/79766 [==============================] - 7s - loss: 0.9012 - val_loss: 1.3995

for the Bias section it’s a little bit of the same : the weight regularizer is set to 1e-5 in the class (vs. 1e-4 in the notebook). This gives me 1.17 after 6 epochs vs. 1.1185 in the class. This is much better than the 1.88 I get with the notebook’s factor of 1e-4 :

Bias with reg factor of 1e-4
Train on 79766 samples, validate on 20238 samples
Epoch 1/6
79766/79766 [==============================] - 5s - loss: 8.8063 - val_loss: 3.5732
Epoch 2/6
79766/79766 [==============================] - 8s - loss: 2.5907 - val_loss: 2.3323
Epoch 3/6
79766/79766 [==============================] - 7s - loss: 1.9997 - val_loss: 2.1234
Epoch 4/6
79766/79766 [==============================] - 7s - loss: 1.8357 - val_loss: 2.0305
Epoch 5/6
79766/79766 [==============================] - 7s - loss: 1.7387 - val_loss: 1.9515
Epoch 6/6
79766/79766 [==============================] - 7s - loss: 1.6577 - val_loss: 1.8832

I still can’t explain the smaller discrepancies but those numbers look a lot closer to the ones observed during the class than the ones obtained with a regularizer factor of 1e-4.

HTH,
N.

jamest · October 25, 2017, 1:44pm

@npvisual

Thanks for getting back to me. Copying @Jeremy with the learning rates and epochs yields a validation loss of 2.6. So, better than before but still not nearly as good. Continuing on I can improve that validation loss using the techniques given in the lesson, but they are never as low as @Jeremy achieves. I’m curious as to why that is - is the result of SGD dependent on the backend or hardware? These are the only differences I can see if I run the original lesson notebook.

Cheers,
James

npvisual · October 25, 2017, 3:14pm

@jamest are you still using TensorFlow as the backend as is shown in your notebook ?
If so, have you tried running the same model with Theano ?

npvisual · October 26, 2017, 2:25am

Just a quick update on this : I upgraded to Keras 2.0.6 and ran the numbers with TensorFlow as the backend, instead of Theano. While I saw some small differences, the spread was much wider when I fiddled with the weight regularizers or changed the optimizer.

For example, with TF, in the Bias section, I can get all the way down to 0.973 simply by using RMSProp instead of Adam and by lowering the weight regularizers to 1e-9. I could run more epochs as well but it bottoms out around 18~20 iterations.

Here’s the part describing the model (had to modify it for Keras 2.0.6) :

x = keras.layers.dot([u,m], axes=2, normalize=False)
x = Flatten()(x)
x = keras.layers.add([x, ub])
x = keras.layers.add([x, mb])
model = Model([user_in, movie_in], x)
model.compile(optimizer=keras.optimizers.TFOptimizer(tf.train.RMSPropOptimizer(0.001)), loss='mse')

and the output when I fit the model :

model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, epochs=18, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79906 samples, validate on 20098 samples
Epoch 1/18
79906/79906 [==============================] - 8s - loss: 11.7509 - val_loss: 9.2466
Epoch 2/18
79906/79906 [==============================] - 7s - loss: 5.8543 - val_loss: 3.8248
Epoch 3/18
79906/79906 [==============================] - 8s - loss: 2.8041 - val_loss: 2.3294
Epoch 4/18
79906/79906 [==============================] - 7s - loss: 1.8351 - val_loss: 1.7454
Epoch 5/18
79906/79906 [==============================] - 7s - loss: 1.4134 - val_loss: 1.4586
Epoch 6/18
79906/79906 [==============================] - 8s - loss: 1.1883 - val_loss: 1.2983
Epoch 7/18
79906/79906 [==============================] - 8s - loss: 1.0511 - val_loss: 1.2000
[.....]
Epoch 15/18
79906/79906 [==============================] - 7s - loss: 0.6468 - val_loss: 0.9850
Epoch 16/18
79906/79906 [==============================] - 7s - loss: 0.6206 - val_loss: 0.9790
Epoch 17/18
79906/79906 [==============================] - 8s - loss: 0.5953 - val_loss: 0.9763
Epoch 18/18
79906/79906 [==============================] - 7s - loss: 0.5730 - val_loss: 0.9730

So it’s still not explaining the differences between the"original" notebook and the class, but I believe that using Adam for this type of application is not really the best optimizer.

For the Dot Product section, while I kept Adam as the optimizer, I get much better results without specifying any weight regularizers :

u = Embedding(n_users, n_factors, input_length=1)(user_in)
m = Embedding(n_movies, n_factors, input_length=1)(movie_in)

which gets me (only showing the first 6 epochs) :

Train on 80307 samples, validate on 19697 samples
Epoch 1/10
80307/80307 [==============================] - 5s - loss: 10.9750 - val_loss: 4.4705
Epoch 2/10
80307/80307 [==============================] - 4s - loss: 2.5244 - val_loss: 1.8511
Epoch 3/10
80307/80307 [==============================] - 4s - loss: 1.2538 - val_loss: 1.4383
Epoch 4/10
80307/80307 [==============================] - 4s - loss: 0.9073 - val_loss: 1.3123
Epoch 5/10
80307/80307 [==============================] - 4s - loss: 0.7445 - val_loss: 1.2630
Epoch 6/10
80307/80307 [==============================] - 4s - loss: 0.6430 - val_loss: 1.2449

In comparison, using a weight regularizer of 1e-5, as is done in the class, gets me around 1.41.

So I don’t think there’s an issue per say with the backend you’re using unless there’s a bug in your specific version ; however based on which weight regularizer coefficient and which optimizer you’re using there seems to be a fairly big impact for a specific model.

As a side note : adding the weight regularizers takes about twice as long for each epoch.

HTH,
N.

optimusprime · October 29, 2017, 5:08pm

when i am trying to run

user_in = Input(shape=(1,), dtype=‘int64’, name=‘user_in’)
u = Embedding(n_users, n_factors, input_length=1, W_regularizer=l2(1e-4))(user_in)
movie_in = Input(shape=(1,), dtype=‘int64’, name=‘movie_in’)
m = Embedding(n_movies, n_factors, input_length=1, W_regularizer=l2(1e-4))(movie_in)

i get this error below. when i try to change w_regularizer to kernel_regularizer, it wont even run

UserWarning: Update your Embedding call to the Keras 2 API: Embedding(9066, 50, input_length=1, embeddings_regularizer=<keras.reg...)
after removing the cwd from sys.path.

optimusprime · October 29, 2017, 6:02pm

why even use the bias and dot method? when you can run a neural network and get better results?

raisefamous · October 29, 2017, 9:45pm

Can someone please explain to me why the same model architectures produce totally different results? I’m training my model with this architecture:

def conv1(batches):
    model = Sequential([
            BatchNormalization(axis=1, input_shape=(3,224,224)),
            Convolution2D(32,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D((3,3)),
            Convolution2D(64,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D((3,3)),
            Flatten(),
            Dense(200, activation='relu'),
            BatchNormalization(),
            Dense(10, activation='softmax')
        ])

    model.compile(Adam(lr=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    model.optimizer.lr = 0.001
    model.fit_generator(batches, batches.nb_sample, nb_epoch=4, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    return model

As per Jeremy’s results val acc is above 0.5 on the 2’nd epoch, my model does not reach that level after all the epochs (2+4). The only difference that I have is that my val set is based on 3 drivers, which as 2424 images. I tried playing with learning rates but accuracy won’t go above 0.3. Does the difference in train/validation sets has such a huge effect on the results?

npvisual · October 30, 2017, 7:42pm

@optimusprime,

Absolutely, but that wasn’t the issue.

I was concerned (and others were) about the fact that we weren’t getting comparable results for each one of the different sections, between the class (video) and the provided notebook.

Understanding what the possible differences were and finding an explanations for the gap in the results was the primary goal for the information I shared above.

npvisual · November 1, 2017, 2:32pm

You need to use embeddings_regularizer , which is the same as the weight regularizer.

u = Embedding(n_users, n_factors, input_length=1, embeddings_regularizer=l2(1e-5))(user_in)

HTH,
N.

jamest · November 3, 2017, 8:48am

@npvisual sorry for the slow reply - I’m still using TF as I haven’t managed to get Theano working on our ML server. It threw some errors in python when importing that I didn’t have the time to investigate. I’ll look into if it makes a difference when I next have a chance to sort it out!

namp · November 21, 2017, 8:37am

Has anybody any idea about which paper proposes the concatenation of the user and item embeddings and then feeding those to a neural network as an alternative to matrix factorization? Is this a novel Jeremy’s idea?

Thanks