If anyone is interested, I redid this lesson from scratch here but added a few tweak, such as fitting a model with a learner (so we can use learning rate decay/cosine annealing and reach good score with fewer epochs), clipping output range, using full movielens dataset …
what is the intuition behind the above statement, i am having trouble understanding why we would decrease weight decay if there is a lot of variation in the gradient
So in the part of the lecture where we went to excel and did the x,y slope intercept equation learning project, was that a cnn?
Also when we make the crosstab we do:
pd.crosstab(crosstab.userId, crosstab.movieId, crosstab.rating, aggfunc=np.sum)
Why do we need the np.sum? I understand it adds values but what values are there to add? All we are doing is showing what rating a user gave a movie?
I don’t really get the part about adding bias in EmbeddingDotBias class.
My understanding is that in the line:
res = um + self.ub(users).squeeze() + self.mb(movies).squeeze()
um is not a matrix but a vector and
squeeze() does not do broadcasting, but removes a dimension of size 1 (bias was a matrix
n x 1 so it has to be converted to a vector so it can be added to
But in the video it’s said that
squeeze() is for broadcasting, which doesn’t make sense to me.
An in-depth but nice explanation of momentum is here: https://distill.pub/2017/momentum/
To help me better understand the optimization math (without using Solver or macros), I re-created the movie recommender spreadsheets (Excel + Google Sheets) and wrote a blog post about it. Hopefully this helps some of you a bit if you’re feeling stuck (like I was).
Frankly, I still feel like a little kid watching a magic trick each time I see the model learn haha!
Key differences vs. lesson spreadsheets:
1. Gradient descent done using formulas (not Solver, no macros) - Used step-by-step formulas (full derivations…) in batch gradient descent so you can see the math.
2. Added hyperparameter inputs as drop-downs - You can play around with the learning rate, L2 regularization penalty, initial weights, etc…to understand the impact on your errors.
3. Split data into training vs. test sets - This allows you to see the importance of regularization vs. overfitting
4. Added L2 regularization penalty - Helps the model generalize better on the test data
5. Added latent factor visualization graph in Excel
Trying to understand what’s going on here
from sklearn.decomposition import PCA pca = PCA(n_components=3) movie_pca = pca.fit(movie_emb.T).components_
Why do we choose to fit the transpose of the movie embedding, rather than the embedding itself? Why do we take the components from the fit PCA rather than transforming the matrix?
I’ve been playing around with different variants and looking at the dimensionality
test1 = pca.fit(movie_emb.T).components_ test1.shape = (3, 3000) test2 = pca.fit(movie_emb).components_ test2.shape = (3, 50) test3 = pca.fit_transform(movie_emb) test3.shape = (3000, 3)
If someone can illuminate what’s going on here and why we use pca the way we do, I would appreciate it.