Lesson 4 discussion

Hello Guys

Got a small doubt. Don’t know if it is stupid or not.

So I was running lesson 4 code and in the compile section of model since there is no way to see accuracy, I added the metrics=[‘accuracy’]. But now when I fit the model I am getting a weird error.

Here is the error-

ImportError: (‘The following error happened while compiling the node’, Elemwise{Composite{EQ(i0, RoundHalfToEven(i1))}}(flatten_6_target, Reshape{2}.0), ‘\n’, ‘DLL load failed: The specified procedure could not be found.’, ‘[Elemwise{Composite{EQ(i0, RoundHalfToEven(i1))}}(flatten_6_target, <TensorType(float32, matrix)>)]’)

Here is the code which I used-

x = merge([u, m], mode=‘dot’)
x = Flatten()(x)
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss=‘mse’,metrics=[‘accuracy’])
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1,
validation_data=([val.userId, val.movieId], val.rating))

Am I missing anything here?

Hello, Guys!

Great and valuable course. Thank you very much for sharing your knowledge!

I have a doubt about the movielens Neural Net model.

Imagine that I also have demographic info (gender, age, etc.) about each user. Maybe this info could help enhance the model predictions.

How could the hot encoded new features of every user be merged with the previous latent factors obtained through embedding of users and movies?

I did ran into a similar problem and sad part was I was working with the 100M dataset. It seems the output stream starts throwing back too much and Jupyter notebook doesn’t handle it really well. In firefox it hangs and displays a ‘Unresponsive script warning’. I suggest using keras_tqdm for all the notebook callbacks. It worked for me!

All the best

@jeremy Thanks for the reply, I was looking for this only. I agree that the output dimensionality remains unchanged and mostly the model should handle the new user.

I have a very stupid question, I trained the nn model and played around with nn.predict (results below)

for i in range(0,10):
    print(nn.predict([np.array([i]), np.array([20])]))

[[ 3.2297]]
[[ 3.2297]]
[[ 3.2297]]
[[ 3.2297]]
[[ 3.3086]]
[[ 3.084]]
[[ 3.2339]]
[[ 3.4119]]
[[ 3.0601]]
[[ 3.2256]]

Now, from ratings dataframe, i can see UserID : 0 rated MovieId : 20 as 5.0, but the prediction is fairly low!
Also, UserId 0 to 3 are always getting the same prediction, I didn’t understand that and its hard to believe that all 50 latent factors for the 4 users are exactly same.

Please help me, I’m definitely missing something.
Note: Our ratings table may be different since I used 10M dataset.

I often see 1x1 convolutions in modern ANN-architectures. Can anyone explain those (even better if with an excel spreadsheet)? I can’t get my head around them and explanations I’ve found so far aren’t clear on how it decreases number of filters.

1x1 convolutions are simply point-wise linear combinations, and the number of output filters can be chosen (just like with any size filter), so you can choose to have fewer output filters than input channels. There is a thread about it here.

you can also do:
model.predict([np.array([212]), np.array([49])])

assuming you set up your model to accept the user id first, then the movie.

I believe you could just add this as an additional input in one of the later layers. If you can’t figure out how to add the input later, you could just use the embeddings as inputs into a new neural net model that also has your extra data as inputs…

Although you usually see factorization machines used to merge CF with additional data…

Hi all,

I was just watching the beginning of the Lesson 4 video, where Jeremy explains convolutions using Excel (which I found to be a really helpful visualization, thanks!). I have a question about the tensors that are applied between the first two convolutional layers.

I can see from the Excel equation that each 3x3 matrix of each tensor (I don’t know the correct terminology here so I’m making it up) is applied to one matrix from the previous layer. So for example, the “top filter” in filter 1 is applied to the “top matrix” output from layer 1, and the “bottom filter” in filter 1 is applied to the “bottom matrix” output from layer 1, and the two results are added together. The same goes for filter 2. My question is, why aren’t the “top” and “bottom” filters each applied to both the “top” and “bottom” matrices? Are there ever architectures where that is the case? What difference would it make in the output?

Thanks and I hope that made sense!

Batch Normalisation vs Dropout

According to the batch normalization paper https://arxiv.org/pdf/1502.03167.pdf

When we use batch normalization the Dropout is not needed and should not be used in order to maximum benefit from batch norm.

From the paper (4.2.1, page 6) Batch Normalization fulfills some of the same goals as Dropout. Removing
Dropout from Modified BN-Inception speeds up training, without increasing overfitting

From the other side, from lesson 4 (video) it looks, that adding Dropout in addtion to Batchnorm makes an improvement.

@jeremy

I would like to reproduce the result in the lesson for state farme, so I would like to know which drivers exactly is used for the validation set, since I tried to split but I got bad results and I could not improve with all ways suggested by @jeremy

It would be better to add the code for the split to the notebook of state farme sample for example

thank you!

Am I right in assuming that the predict for the movie should be passing in user INDEX and movie INDEX and not user ID, movie ID? Early on, the rating is changed from UID to index to make embedding easier. So to reverse this, we either call predict and the numbers we pass in are indices and have to be converted to a UID (then a string for the movie) to get human readable results, or, if the code in the notebook is correct, we must take the UIDs, convert them back to indices, before we call predict.

I believe it is accidentally working simply because most people are putting in hardcoded numbers that happen to be in both datasets. Am I incorrect in this?

I have a question about the Collaborative filtering part.
The model in Excel calculates the score by: dot(two matrices) + bias_user + bias_movie
I believe that the first initial Keras model does exact same calculation as the Excel sheet (Correct me if I am wrong).
How does the architecture change and movie ratings are calculated when we add a Dense layer to the Keras model?

@rachel @jeremy

I’m really confused on whether this is working or not. Here are the important lines cut from the notebook

ratings = pd.read_csv(path+'ratings.csv')  # loads ratings.csv, which is UserUID, MovieUID, Rating, timestamp
users = ratings.userId.unique() # creates a unique listing of users
movies = ratings.movieId.unique() # creates a unique listing of movies
userid2idx = {o:i for i,o in enumerate(users)} # creates a mapping from user id to index in users
movieid2idx = {o:i for i,o in enumerate(movies)} # creates a mapping from movie id to index in movies

### CRITICAL-###
# Converts the ratings.movieId from being UID based to index based
ratings.movieId = ratings.movieId.apply(lambda x: movieid2idx[x]) 
# Converts the ratings.userId from being UID based to index based
ratings.userId = ratings.userId.apply(lambda x: userid2idx[x]) 

Then when we train:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, epochs=1,
validation_data=([val.userId, val.movieId], val.rating))

So we train on the indices, so any call to predict, MUST use the index.

However, I wrote this function (I’ve renamed some of the variables in my own notebook to be a bit more self explantory)

def predict( user_index = 0, movie_index = 0, movie_uid = None, user_uid = None, movie_name = None ):
    if movie_name != None :
        movie_uid = movie_name_to_uid[movie_name]
        movie_index = movie_uid_to_idx[movie_uid]
        
    if movie_uid != None :
        movie_index = movie_uid_to_idx[movie_uid]
        
    if user_uid != None :
        user_index = user_uid_to_idx[user_uid]
        
    result = model.predict( [np.array([user_index]), np.array([movie_index])])
    # but we want to translate that into a movie id...
    print ( "Best Rating for user {} on movie {} is {}".format( user_idx_to_uid[user_index], MovieIndexToName(movie_index), result[0] ))

However, when I then call my function:

predict( user_index = 0, movie_index = 27 )
predict( user_uid = 1, movie_name = "Braveheart (1995)")
predict( user_uid = 1, movie_uid = 110 )

# This represents what the notebook is really asking for
predict( user_index = 3, movie_index = 6 )
# This represents what it intended to ask for
predict( user_uid = 3, movie_uid = 6 )

I get this result:

# From the CSV ratings, file, UserID 1, with Movie ID 110 should score about a 1.0 rating.
Best Rating for user 1 on movie Braveheart (1995) is [ 2.84351969]
Best Rating for user 1 on movie Braveheart (1995) is [ 2.84351969]
Best Rating for user 1 on movie Braveheart (1995) is [ 2.84351969]

# The following result is closest to the notebook example of model.predict([np.array([3]), np.array([6])])
Best Rating for user 4 on movie Ben-Hur (1959) is [ 4.69232702]
# This was what it intended to ask
Best Rating for user 3 on movie Heat (1995) is [ 3.62401152]

Am I doing this right? Maybe the model is just not that good. I’ve been disappointed by it’s predictive powers.

In lesson4, the neural net embedding using only UserId and MovidId information. So in order to generate predictions, there has to be some existing rating for the movie.

What if the test set has some new movies and new users that are not included in train set, in this case, how can we use the Embedding layer?

Hi all, I’m totally new to ML and really getting a lot out of the course so far - thanks so much for putting it all online for free.

My issue: I’m going through collaborative filtering with movielens and cannot replicate the level of loss Jeremy displayed for simple dot product model, without bias. The lowest I can get is a validation MSE loss of around 3.4, compared to Jeremy’s 1.45. I’ve played around with the learning rate in a similar manner to Jeremy but it just won’t budge below that loss. I assume that means something has gone wrong in the way I’ve set it up! The main difference I can see is that I’m using a tensorflow backend on a shared server. Might that have something to do with it?

You can find my workbook here - any help would be appreciated. Thanks!

@RiB ,

I was wondering if you ever figured out an answer to the discrepancy between your own calculations of the loss results and the ones obtained through the notebook ? I am in the same boat in that the numbers I get from the loss on the validation set are nowhere near what @Jeremy found.

My results match yours : 2.5 vs 1.4 and 1.14 vs 0.79
keras.version ==. ‘1.2.2’

I did go over your post (May 19) about the regularization parameter impacting the MSE results in Keras 2, but unless they also applied the same for version 1.2.2, I am not quite sure what else would make sense

To be honest, I don’t even understand why the regularization parameter would impact the MSE calculation on the validation set.

Do you have a link to your Keras support question so I can follow up ?

Thanks,
N.

@jamest ,

From what I could tell from your notebook, it seems to me that the learning rate for the optimizer is set too high when you first compile the model :

model.compile(loss='mse', optimizer=Adam(0.01), metrics=['accuracy'])

whereas @Jeremy first sets the lr to 0.001 to run 1 epoch and then does one pass at 0.01 (3 epochs) and then another one at 0.001 (6 epochs) :

model.compile(Adam(0.001), loss='mse')
...
model.optimizer.lr=0.01
...
model.optimizer.lr=0.001

I think by taking too big of a step initially you’re not able to find the proper latent factors.

HTH,
N.

I went over the details of the notebook and what is shown during the class. There are several differences that I noted :

  1. for the Dot Product section, the weight regularizer in the notebook is 1e-4 whereas @Jeremy is using 1e-5

    u = Embedding(n_users, n_factors, input_length=1, W_regularizer=l2(1e-4))(user_in)

  2. the first iterations consist of 6 epoch with that regularizer value, whereas the notebook only goes through 1 epoch and then changes the model’s learning rate for a new set of iterations.

By following what’s done during the class I get 1.399 after the first 6 epochs. The iterations provided in the notebook after that (once the lr is modified) are fairly redundant as we can see that the model is heavily overfitting.

Train on 79766 samples, validate on 20238 samples
Epoch 1/6
79766/79766 [==============================] - 8s - loss: 3.6372 - val_loss: 2.6259
Epoch 2/6
79766/79766 [==============================] - 7s - loss: 1.8852 - val_loss: 1.7960
Epoch 3/6
79766/79766 [==============================] - 7s - loss: 1.2993 - val_loss: 1.5317
Epoch 4/6
79766/79766 [==============================] - 6s - loss: 1.0723 - val_loss: 1.4462
Epoch 5/6
79766/79766 [==============================] - 7s - loss: 0.9653 - val_loss: 1.4056
Epoch 6/6
79766/79766 [==============================] - 7s - loss: 0.9012 - val_loss: 1.3995
  1. for the Bias section it’s a little bit of the same : the weight regularizer is set to 1e-5 in the class (vs. 1e-4 in the notebook). This gives me 1.17 after 6 epochs vs. 1.1185 in the class. This is much better than the 1.88 I get with the notebook’s factor of 1e-4 :

    Bias with reg factor of 1e-4
    Train on 79766 samples, validate on 20238 samples
    Epoch 1/6
    79766/79766 [==============================] - 5s - loss: 8.8063 - val_loss: 3.5732
    Epoch 2/6
    79766/79766 [==============================] - 8s - loss: 2.5907 - val_loss: 2.3323
    Epoch 3/6
    79766/79766 [==============================] - 7s - loss: 1.9997 - val_loss: 2.1234
    Epoch 4/6
    79766/79766 [==============================] - 7s - loss: 1.8357 - val_loss: 2.0305
    Epoch 5/6
    79766/79766 [==============================] - 7s - loss: 1.7387 - val_loss: 1.9515
    Epoch 6/6
    79766/79766 [==============================] - 7s - loss: 1.6577 - val_loss: 1.8832

I still can’t explain the smaller discrepancies but those numbers look a lot closer to the ones observed during the class than the ones obtained with a regularizer factor of 1e-4.

HTH,
N.