Lesson 4 discussion

(Emilio) #65

I am struggling a bit with the concept of bias.
If we take as an example Jeremy’s recommendation engine in Excel, what does it mean to add bias?
I do understand it s about adding a row / column which represent some sort of “importance” of a user or of a movie, but then we train (and therefore change) those values during the optimization procedure.

So my questions are:

  • What is the difference between a latent factor and a bias, beyond their initial values (respectively random and based on some sort of importance score)?
  • Is the different initialization step important enough to have an impact on the final values for those factors, or is there something I am missing?

(Even Oldridge) #66

Something must be up with your implementation or setup. I was able to get results that beat Jeremy’s in class results (0.76 consistently) after playing around with the hyperparameters.

(Eric Perbos-Brinck) #67

In Collaborative Filtering, @Jeremy explained that “adding meta-data to collaborative filtering doesn’t improve it at all” (1:38:00 in the video).

Later (1:47:00 in the video), Jeremy adds two bias described as “How good/popular is this movie” and “How movie-enthusiastic is this user”.

(1) Where’s the info coming from and (2) isn’t that “meta-data” ?


(Jeremy Howard) #68

The bias term is simply the same as adding a constant ‘1’ to every row. E.g. see http://stackoverflow.com/questions/2480650/role-of-bias-in-neural-networks

(Romano) #69

Thanks, @Even. Are you using Keras 2? Did you use regularization in your embeddings?

I think it may be worth sharing with everyone here that, after having raised an issue with Keras about the MSE computations, I found out that if one adds a regularization parameter - as I had done in my implementation - the displayed loss metric includes the loss from regularization. I presume this is new in Keras 2. This explains why direct MSE evaluation did not match Keras MSE callbacks, nor model.evaluate for that matter.

@erlapi One crucial difference about the bias term with resplect to the latent factors is that the bias does not undergo dot multiplication. It is just a constant term that you add to a user rating (to reflect user “generosity” in rating movies) or to a movie rating (to reflect the underlying quality of the movie). Both parameters, as you mention, are learned from data, but their impact is only additive. So to answer your questions:

  • Latent factors get multiplied (you can think of a user latent factor as, e.g. “how much a given user likes romantic scenes in a movie” and the corresponding movie latent factor as “how many romantic scenes there are in that movie”, so you can see how multiplying them together makes sense), while biases do not.
  • No, initialization should not really matter. If it does, one is doing it wrong.

I hope this helps.

(Emilio) #70

This makes a lot of sense.

(WG) #71

From lesson4.ipynb …

Why does x = merge([u, m], mode='dot') have an output shape of (None, 1, 1) instead of just (None, 1) ?

From the docs, I expected the output to be a tensor of shape (batch_size, 1). So what is that extra 1 in there?

(Pri) #72

Based on the training, aren’t the latent factors for that movie and user configured so that the rating is 0? How could we get the correct rating? I’m confused because from what I understand, the missing movie ratings were treated as 0, so all latent factors would be optimized in that manner, hence how will rating for that missing movie change?

(Hui Qi) #73

I discussed with a friend in academic field of nlp about the pseudo-labelling technique, he said at least when you intend to publish papers, the test dataset should be used only for testing, absolutely no other usage. And I rethink about the pseudo-labelling, I feel it takes some advantage of the testing dataset(include the information of the test data in the model), which may lead to overfitting on it. What do you think about this? Thanks for the fabulous and clear explanation of the deep learning. @jeremy

(Emilio) #74

One more question on embeddings and bias.

In lesson 4 notebook, we first built a dot + bias ‘ad hoc’ architecture to solve the recommender system problem.
Jeremy showed how the bias term is key to improving results in this case.

Then we introduced the NN showing it even further improves the ad hoc architecture.

However, I don’t see a bias term in this case.

Is it because the NN is smart enough to derive one?
Would introducing a bias term improve our NN solution even further?

(Jimmy Jose) #76

Hello Guys

Got a small doubt. Don’t know if it is stupid or not.

So I was running lesson 4 code and in the compile section of model since there is no way to see accuracy, I added the metrics=[‘accuracy’]. But now when I fit the model I am getting a weird error.

Here is the error-

ImportError: (‘The following error happened while compiling the node’, Elemwise{Composite{EQ(i0, RoundHalfToEven(i1))}}(flatten_6_target, Reshape{2}.0), ‘\n’, ‘DLL load failed: The specified procedure could not be found.’, ‘[Elemwise{Composite{EQ(i0, RoundHalfToEven(i1))}}(flatten_6_target, <TensorType(float32, matrix)>)]’)

Here is the code which I used-

x = merge([u, m], mode=‘dot’)
x = Flatten()(x)
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss=‘mse’,metrics=[‘accuracy’])
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1,
validation_data=([val.userId, val.movieId], val.rating))

Am I missing anything here?

(Ben) #77

I’ve tried to get pseudo-labelling working, following the start of lesson 4, for the state farm competition, but it is making things 10x worse!

I have a very simple model on top of the vgg convolution layers. After I’ve trained it using the training and validation set I created I get decent results:

Train on 20180 samples, validate on 2244 samples
 Epoch 1/1
20180/20180 [==============================] - 34s - loss: 0.0119 - acc: 0.9966 - val_loss: 0.0230 - val_acc: 0.9938

which gives me a score on kaggle of ~0.67. I was pretty happy with that, but I wanted to do better, so tried pseudo labelling. I first tried it with the validation set so I could check that the pseudo labels were in agreement with the real labels. They generally were, so I was comfortable that things were working well. This didn’t improve my kaggle score (made it slightly worse), but I just put that down to over fitting.

Then I wanted to pseudo label the test set. The code that does this is:

def train_model_data_and_pseudo_labelled_test(self, num_pseudo=-1, learning_rate=0.0001, epochs=1, batch_size=64):
    for iter in range(epochs):
        # Do it this way so that the pseudo labels are update every time
        # Side effect: epoch depend features of optimiser cannot be used
        data_test_ftr = np.copy(self.data_test[0])
        pseudo_ftr = data_test_ftr[:num_pseudo]
        pseudo_labels = self.top_model.predict(pseudo_ftr, batch_size=batch_size)

        all_pseudo_ftr = np.concatenate([self.data_train[0], pseudo_ftr])
        all_pseudo_lbl = np.concatenate([self.data_train[1], pseudo_labels])

        self.top_model.fit(all_pseudo_ftr, all_pseudo_lbl, validation_data=self.data_validate, nb_epoch=1, batch_size=batch_size)

self.data_test[0] is the precalculated (on vgg) features of the test set and self.data_train is the features and labels of the precalculated training set. As an aside, I pick a random part of the test set to pseudo label and use in each epoch, is that the right way of doing things?

Anyway, I used this function to improve my CNN. I had 1 pseudo-labelled image per 2 true-labelled image, which I read somewhere was a good number to use.

Train on 30180 samples, validate on 2244 samples
Epoch 1/1
30180/30180 [==============================] - 51s - loss: 0.1901 - acc: 0.9516 - val_loss: 0.0652 - val_acc: 0.9769

Which is a bit worse than before, but I put that down to the previous result being overfitted. I ran this for a lot more epochs, thinking that it would need to cycle a couple of times through the whole test set with pseudo labels for it to work. 15 epochs later I had

Train on 30180 samples, validate on 2244 samples
Epoch 1/1
30180/30180 [==============================] - 51s - loss: 0.0572 - acc: 0.9847 - val_loss: 0.0094 - val_acc: 0.9982

which I was very happy with. However, when I submitted this to kaggle I got a score of over 8. That more than 10x worse than I had before!!

I have no idea what went wrong. I spent a few hours trying slightly different things, but it always gave similarly bad results. The training and validation results are excellent, but the kaggle score is basically worse than random. Any help is much appreciated.


Hello, Guys!

Great and valuable course. Thank you very much for sharing your knowledge!

I have a doubt about the movielens Neural Net model.

Imagine that I also have demographic info (gender, age, etc.) about each user. Maybe this info could help enhance the model predictions.

How could the hot encoded new features of every user be merged with the previous latent factors obtained through embedding of users and movies?

(Pranjal Yadav) #80

I did ran into a similar problem and sad part was I was working with the 100M dataset. It seems the output stream starts throwing back too much and Jupyter notebook doesn’t handle it really well. In firefox it hangs and displays a ‘Unresponsive script warning’. I suggest using keras_tqdm for all the notebook callbacks. It worked for me!

All the best

(Pranjal Yadav) #81

@jeremy Thanks for the reply, I was looking for this only. I agree that the output dimensionality remains unchanged and mostly the model should handle the new user.

I have a very stupid question, I trained the nn model and played around with nn.predict (results below)

for i in range(0,10):
    print(nn.predict([np.array([i]), np.array([20])]))

[[ 3.2297]]
[[ 3.2297]]
[[ 3.2297]]
[[ 3.2297]]
[[ 3.3086]]
[[ 3.084]]
[[ 3.2339]]
[[ 3.4119]]
[[ 3.0601]]
[[ 3.2256]]

Now, from ratings dataframe, i can see UserID : 0 rated MovieId : 20 as 5.0, but the prediction is fairly low!
Also, UserId 0 to 3 are always getting the same prediction, I didn’t understand that and its hard to believe that all 50 latent factors for the 4 users are exactly same.

Please help me, I’m definitely missing something.
Note: Our ratings table may be different since I used 10M dataset.


I often see 1x1 convolutions in modern ANN-architectures. Can anyone explain those (even better if with an excel spreadsheet)? I can’t get my head around them and explanations I’ve found so far aren’t clear on how it decreases number of filters.


1x1 convolutions are simply point-wise linear combinations, and the number of output filters can be chosen (just like with any size filter), so you can choose to have fewer output filters than input channels. There is a thread about it here.

(Robert William Whelan) #85

you can also do:
model.predict([np.array([212]), np.array([49])])

assuming you set up your model to accept the user id first, then the movie.

(alex) #86

I believe you could just add this as an additional input in one of the later layers. If you can’t figure out how to add the input later, you could just use the embeddings as inputs into a new neural net model that also has your extra data as inputs…

(alex) #87

Although you usually see factorization machines used to merge CF with additional data…