Lesson 4 discussion

I am struggling a bit with the concept of bias.
If we take as an example Jeremy’s recommendation engine in Excel, what does it mean to add bias?
I do understand it s about adding a row / column which represent some sort of “importance” of a user or of a movie, but then we train (and therefore change) those values during the optimization procedure.

So my questions are:

  • What is the difference between a latent factor and a bias, beyond their initial values (respectively random and based on some sort of importance score)?
  • Is the different initialization step important enough to have an impact on the final values for those factors, or is there something I am missing?

Something must be up with your implementation or setup. I was able to get results that beat Jeremy’s in class results (0.76 consistently) after playing around with the hyperparameters.

In Collaborative Filtering, @Jeremy explained that “adding meta-data to collaborative filtering doesn’t improve it at all” (1:38:00 in the video).

Later (1:47:00 in the video), Jeremy adds two bias described as “How good/popular is this movie” and “How movie-enthusiastic is this user”.

(1) Where’s the info coming from and (2) isn’t that “meta-data” ?

Eric

The bias term is simply the same as adding a constant ‘1’ to every row. E.g. see http://stackoverflow.com/questions/2480650/role-of-bias-in-neural-networks

1 Like

Thanks, @Even. Are you using Keras 2? Did you use regularization in your embeddings?

I think it may be worth sharing with everyone here that, after having raised an issue with Keras about the MSE computations, I found out that if one adds a regularization parameter - as I had done in my implementation - the displayed loss metric includes the loss from regularization. I presume this is new in Keras 2. This explains why direct MSE evaluation did not match Keras MSE callbacks, nor model.evaluate for that matter.

@erlapi One crucial difference about the bias term with resplect to the latent factors is that the bias does not undergo dot multiplication. It is just a constant term that you add to a user rating (to reflect user “generosity” in rating movies) or to a movie rating (to reflect the underlying quality of the movie). Both parameters, as you mention, are learned from data, but their impact is only additive. So to answer your questions:

  • Latent factors get multiplied (you can think of a user latent factor as, e.g. “how much a given user likes romantic scenes in a movie” and the corresponding movie latent factor as “how many romantic scenes there are in that movie”, so you can see how multiplying them together makes sense), while biases do not.
  • No, initialization should not really matter. If it does, one is doing it wrong.

I hope this helps.

3 Likes

Thanks.
This makes a lot of sense.

From lesson4.ipynb …

Why does x = merge([u, m], mode='dot') have an output shape of (None, 1, 1) instead of just (None, 1) ?

From the docs, I expected the output to be a tensor of shape (batch_size, 1). So what is that extra 1 in there?

1 Like

Based on the training, aren’t the latent factors for that movie and user configured so that the rating is 0? How could we get the correct rating? I’m confused because from what I understand, the missing movie ratings were treated as 0, so all latent factors would be optimized in that manner, hence how will rating for that missing movie change?

I discussed with a friend in academic field of nlp about the pseudo-labelling technique, he said at least when you intend to publish papers, the test dataset should be used only for testing, absolutely no other usage. And I rethink about the pseudo-labelling, I feel it takes some advantage of the testing dataset(include the information of the test data in the model), which may lead to overfitting on it. What do you think about this? Thanks for the fabulous and clear explanation of the deep learning. @jeremy

One more question on embeddings and bias.

In lesson 4 notebook, we first built a dot + bias ‘ad hoc’ architecture to solve the recommender system problem.
Jeremy showed how the bias term is key to improving results in this case.

Then we introduced the NN showing it even further improves the ad hoc architecture.

However, I don’t see a bias term in this case.

Is it because the NN is smart enough to derive one?
Would introducing a bias term improve our NN solution even further?

Hello Guys

Got a small doubt. Don’t know if it is stupid or not.

So I was running lesson 4 code and in the compile section of model since there is no way to see accuracy, I added the metrics=[‘accuracy’]. But now when I fit the model I am getting a weird error.

Here is the error-

ImportError: (‘The following error happened while compiling the node’, Elemwise{Composite{EQ(i0, RoundHalfToEven(i1))}}(flatten_6_target, Reshape{2}.0), ‘\n’, ‘DLL load failed: The specified procedure could not be found.’, ‘[Elemwise{Composite{EQ(i0, RoundHalfToEven(i1))}}(flatten_6_target, <TensorType(float32, matrix)>)]’)

Here is the code which I used-

x = merge([u, m], mode=‘dot’)
x = Flatten()(x)
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss=‘mse’,metrics=[‘accuracy’])
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1,
validation_data=([val.userId, val.movieId], val.rating))

Am I missing anything here?

Hello, Guys!

Great and valuable course. Thank you very much for sharing your knowledge!

I have a doubt about the movielens Neural Net model.

Imagine that I also have demographic info (gender, age, etc.) about each user. Maybe this info could help enhance the model predictions.

How could the hot encoded new features of every user be merged with the previous latent factors obtained through embedding of users and movies?

I did ran into a similar problem and sad part was I was working with the 100M dataset. It seems the output stream starts throwing back too much and Jupyter notebook doesn’t handle it really well. In firefox it hangs and displays a ‘Unresponsive script warning’. I suggest using keras_tqdm for all the notebook callbacks. It worked for me!

All the best

@jeremy Thanks for the reply, I was looking for this only. I agree that the output dimensionality remains unchanged and mostly the model should handle the new user.

I have a very stupid question, I trained the nn model and played around with nn.predict (results below)

for i in range(0,10):
    print(nn.predict([np.array([i]), np.array([20])]))

[[ 3.2297]]
[[ 3.2297]]
[[ 3.2297]]
[[ 3.2297]]
[[ 3.3086]]
[[ 3.084]]
[[ 3.2339]]
[[ 3.4119]]
[[ 3.0601]]
[[ 3.2256]]

Now, from ratings dataframe, i can see UserID : 0 rated MovieId : 20 as 5.0, but the prediction is fairly low!
Also, UserId 0 to 3 are always getting the same prediction, I didn’t understand that and its hard to believe that all 50 latent factors for the 4 users are exactly same.

Please help me, I’m definitely missing something.
Note: Our ratings table may be different since I used 10M dataset.

I often see 1x1 convolutions in modern ANN-architectures. Can anyone explain those (even better if with an excel spreadsheet)? I can’t get my head around them and explanations I’ve found so far aren’t clear on how it decreases number of filters.

1x1 convolutions are simply point-wise linear combinations, and the number of output filters can be chosen (just like with any size filter), so you can choose to have fewer output filters than input channels. There is a thread about it here.

you can also do:
model.predict([np.array([212]), np.array([49])])

assuming you set up your model to accept the user id first, then the movie.

I believe you could just add this as an additional input in one of the later layers. If you can’t figure out how to add the input later, you could just use the embeddings as inputs into a new neural net model that also has your extra data as inputs…

Although you usually see factorization machines used to merge CF with additional data…

Hi all,

I was just watching the beginning of the Lesson 4 video, where Jeremy explains convolutions using Excel (which I found to be a really helpful visualization, thanks!). I have a question about the tensors that are applied between the first two convolutional layers.

I can see from the Excel equation that each 3x3 matrix of each tensor (I don’t know the correct terminology here so I’m making it up) is applied to one matrix from the previous layer. So for example, the “top filter” in filter 1 is applied to the “top matrix” output from layer 1, and the “bottom filter” in filter 1 is applied to the “bottom matrix” output from layer 1, and the two results are added together. The same goes for filter 2. My question is, why aren’t the “top” and “bottom” filters each applied to both the “top” and “bottom” matrices? Are there ever architectures where that is the case? What difference would it make in the output?

Thanks and I hope that made sense!