Fastbook Chapter 8 questionnaire solutions (wiki)

Here are the questions:

  1. What problem does collaborative filtering solve?

It solves the problem of predicting the interests of users based on the interests of other users and recommending items based on these interests.

  1. How does it solve it?

The key idea of collaborative filtering is latent factors. The idea is that the model can tell what kind of items you may like (ex: you like sci-fi movies/books) and these kinds of factors are learned (via basic gradient descent) based on what items other users like.

  1. Why might a collaborative filtering predictive model fail to be a very useful recommendation system?

If there are not many recommendations to learn from, or enough data about the user to provide useful recommendations, then such collaborative filtering systems may not be useful.

  1. What does a crosstab representation of collaborative filtering data look like?

In the crosstab representation, the users and items are the rows and columns (or vice versa) of a large matrix with the values filled out based on the user’s rating of the item.

  1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!)

To do by the reader

  1. What is a latent factor? Why is it “latent”?

As described above, a latent factor are factors that are important for the prediction of the recommendations, but are not explicitly given to the model and instead learned (hence “latent”).

  1. What is a dot product? Calculate a dot product manually using pure python with lists.

A dot product is when you multiply the corresponding elements of two vectors and add them up. If we represent the vectors as lists of the same size, here is how we can perform a dot product:

a = [1, 2, 3, 4]
b = [5, 6, 7, 8]
dot_product = sum(i[0]*i[1] for i in zip(a,b))
  1. What does pandas.DataFrame.merge do?

It allows you to merge DataFrames into one DataFrame.

  1. What is an embedding matrix?

It is what you multiply an embedding with, and in the case of this collaborative filtering problem, is learned through training.

  1. What is the relationship between an embedding and a matrix of one-hot encoded vectors?

An embedding is a matrix of one-hot encoded vectors that is computationally more efficient.

  1. Why do we need Embedding if we could use one-hot encoded vectors for the same thing?

Embedding is computationally more efficient. The multiplication with one-hot encoded vectors is equivalent to indexing into the embedding matrix, and the Embedding layer does this. However, the gradient is calculated such that it is equivalent to the multiplication with the one-hot encoded vectors.

  1. What does an embedding contain before we start training (assuming we’re not using a prertained model)?

The embedding is randomly initialized.

  1. Create a class (without peeking, if possible!) and use it.

To do by the reader. Example in the chapter:

class Example:
    def __init__(self, a): self.a = a
    def say(self,x): return f'Hello {self.a}, {x}.'
  1. What does x[:,0] return?

The user ids

  1. Rewrite the DotProduct class (without peeking, if possible!) and train a model with it

Code provided in chapter:

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)
  1. What is a good loss function to use for MovieLens? Why?

We can use Mean Squared Error (MSE), which is a perfectly reasonable loss as we have numerical targets for the ratings and it is one possible way of representing the accuracy of the model.

  1. What would happen if we used CrossEntropy loss with MovieLens? How would we need to change the model?

We would need to ensure the model outputs 5 predictions. For example, with a neural network model, we need to change the last linear layer to output 5, not 1, predictions. Then this is passed into the Cross Entropy loss.

  1. What is the use of bias in a dot product model?

A bias will compensate for the fact that some movies are just amazing or pretty bad. It will also compensate for users who often have more positive or negative recommendations in general.

  1. What is another name for weight decay?

L2 regularization

  1. Write the equation for weight decay (without peeking!)

loss_with_wd = loss + wd * (parameters**2).sum()

  1. Write the equation for the gradient of weight decay. Why does it help reduce weights?

We add to the gradients 2*wd*parameters. This helps create more shallow, less bumpy/sharp surfaces that generalize better and prevents overfitting.

  1. Why does reducing weights lead to better generalization?

This will result is more shallow, less sharp surfaces. If sharp surfaces are allowed, it can very easly overfit, and now this is prevented.

  1. What does argsort do in PyTorch?

This just gets the indices in the order that the original PyTorch Tensor is sorted.

  1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why / why not?

No it means much more than that. It takes into account the genres or actors or other factors. For example, movies with low bias means even if you like these types of movies you may not like this movie (and vice versa for movies with high bias).

  1. How do you print the names and details of the layers in a model?

Just by typing learn.model

  1. What is the “bootstrapping problem” in collaborative filtering?

That the model / system cannot make any recommendations or draw any inferences for users or items about which it has not yet gathered sufficient information. It’s also called the cold start problem.

  1. How could you deal with the bootstrapping problem for new users? For new movies?

You could solve this by coming up with an average embedding for a user or movie. Or select a particular user/movie to represent the average user/movie. Additionally, you could come up with some questions that could help initialize the embedding vectors for new users and movies.

  1. How can feedback loops impact collaborative filtering systems?

The recommendations may suffer from representation bias where a small number of people influence the system heavily. E.g.: Highly enthusiastic anime fans who rate movies much more frequently than others may cause the system to recommend anime more often than expected (incl. to non-anime fans).

  1. When using a neural network in collaborative filtering, why can we have different number of factors for movie and user?

In this case, we are not taking the dot product but instead concatenating the embedding matrices, so the number of factors can be different.

  1. Why is there a nn.Sequential in the CollabNN model?

This allows us to couple multiple nn.Module layers together to be used. In this case, the two linear layers are coupled together and the embeddings can be directly passed into the linear layers.

  1. What kind of model should be use if we want to add metadata about users and items, or information such as date and time, to a collaborative filter model?

We should use a tabular model, which is discussed in the next chapter!

1 Like

@muellerzr Please wiki-fy! :slight_smile:

BTW, you could do it yourself too :wink:

How? I thought a moderator is needed to do it?

Everyone who is a part of the live course can wiki-fy.

3 dots near the post -> Create wiki

1 Like

OMG I did not know that! Thank you for the clarification!

@muellerzr Sorry I kept tagging you about this! I won’t bother you from now on :slight_smile:


Please continue doing that still :grin:


Hahaha yes after all he’s a robot! :joy:

1 Like

Makes me feel important :wink:

1 Like

What would happen if we used CrossEntropy loss with MovieLens? How would we need to change the model?

Create a model for MovieLens which works with CrossEntropy loss, and compare it to the model in this chapter.

Is the idea here to predict an integer between 0 and 5? So, the targets will need to be round first?

1 Like

Hi all,

This is just a request for confirmation of my understanding of bias in the dot product model.

18.What is the use of bias in a dot product model?

and related to this

  1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why / why not?

My understanding of bias in this case is that it is a latent factor that is inherent in the user or item itself.

In contrast, the other kind of latent factor is a result of the interaction of an item with a user and vice versa (i.e. the latent factors of an item are multiplied with the latent factors of a user)

Related to this, the reason why the ranking based on the movie biases can be different from the ranking based on the average movie ratings is because the average movie rating is the result of the combination of latent factors of both users and items (including user and item biases), while sorting via movie biases alone removes the user’s bias as a factor in the ranking.

What do you think of my formulation?

(BTW, this question arose around our ongoing study group re-reading discussion lead by @gansme and @marii here - new participants stll welcome!)

The bias I discuss in the dot product model is slightly different from the normal bias
we deal with in a conventional NN – I consider the bias as a parameter associated with the neuron itself, whereas the weights are parameters associated with the connections between the inputs and the neuron.

1 Like

Has anyone gone back and tried to add metadata about users and items to a collaborative filtering model like in question 31?

1 Like

Can anyone explain to me how the following code optimizes the user and item factors in the Embeddings when model is fit? Looking at the values of model.user_factors before and after training reveals that they are being updated, I just don’t understand by which mechanism they are being optimized.

class CollabNN(Module):

def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
    self.user_factors = Embedding(*user_sz)
    self.item_factors = Embedding(*item_sz)
    self.layers = nn.Sequential(
        nn.Linear(user_sz[1]+item_sz[1], n_act),
        nn.Linear(n_act, 1))
    self.y_range = y_range
def forward(self, x):
    embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
    x = self.layers(, dim=1))
    return sigmoid_range(x, *self.y_range)

model = CollabNN(*embs)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

Has anyone tried to follow up on Q.17 and the 4th assignment of the further research? I did, and here is the full notebook. I am curious to have feedback on a few aspects, so in the following there are some details and questions.

First: about the DataLoaders:
I followed exactly the same steps as in the book. When I get one batch from the data loaders to inspect the content, it of course has 2 dimensions (user, movie), but:

Screenshot 2021-02-05 at 10.55.42

I expected to find the actual titles of the movies, while instead I find numbers in both the columns of x. How is this happening? Are titles being mapped to numbers in the background? Or am I messing things up at some step?

Second: build and train the model:
I followed the same approach as in the answer above, i.e. simply change the number of outputs in the final layer of the network:

class CollabNNCE(Module):
  def __init__(self, user_sz, movie_sz):
    self.users_factors = Embedding(*user_sz)
    self.movies_factors = Embedding(*movie_sz)
    self.layers = nn.Sequential(
        nn.Linear(user_sz[1]+movie_sz[1], 100),

  def forward(self,x):
    users = self.users_factors(x[:,0])
    movies = self.movies_factors(x[:,1])
    facts = (users, movies), dim=1 )
    return self.layers( facts )

I then created a learner and since now it is a classification problem, I added the accuracy as a metric:

learn = Learner(dls, model, loss_func=myLoss, metrics=myAccuracy)

(myAccuracy and myLoss are wrappers around accuracy and CrossEntropyLossFlat respectively).
Here is the outcome of the training:
Screenshot 2021-02-05 at 11.20.42

valid_loss and myAccuracy decrease steadily, and only in the last epoch they go up again pointing to some overfitting, add the overall accuracy is pretty bad. I have been trying a few different things, but I couldn’t get anything better than this. Have you guys tried to implement anything similar? Do you have any feedback?

Third: compare to the book:
I was wondering how to compare this result (assuming that sooner or later I’ll find a way to have a decent one) to the result of the book. How can I compare the two models, being so different (classification vs regression)? Do you have any idea?


I am confused as to why DotProductBias works without using an Embedding layer. I thought we needed to use Embedding for gradient calculation and that indexing directly into the latent factor matrix wouldn’t work. But the training behaves nicely using just a normal PyTorch tensor.
What is the point of using Embedding here then ?


Embedding can be redefined manually, as it was in the lesson to show how it works.