Lesson 7 - Official topic

Sure, you can use this code to get the 5 most similar movies:

idx = distances.argsort(descending=True)[1:6]

The argsort method returns a list of the movie id’s sorted descending by similarity. The most similar movie is the movie itself at index 0, so starting from index 1, there are the other movies.

2 Likes

Thanks!! @johannesstutz :smiley:

Hey guys. So I am doing some experimentation on the Collab Notebook.

learn.fit_one_cycle(5, 5e-3)

Here Jeremy used 5x10-3 as the max learning rate. So I was trying to find out why he used that exact number so I ran lr_find and tried to use a different learning rate. The suggested one was 4x10-6, but when I used it, the model losses were way worse (13.5 instead of 0.87 using Jeremy’s Learning rate)

Does anyone know why this happens? Or how to find an optimal learning rate for the DotProduct model?

I am also having the same confusion. I mean how do you determine which matrices to use?

1 Like

The step you cited replaces the values in the SalePrice column (which are in absolute US dollars I think) with the logarithm of the sale price. The reason for this is that the metric that the competition used is on a log scale (root mean squared log error). So if we just convert the dependent variable to a log scale, we can use the (standard) RMSE error and we’re good.

SalesPredicted - Sales: I’m not sure what you mean by that. The loss for every row is determined by the RMSE function, which takes the predicted value and the true value from the SalePrice column as arguments.

Let me know if that helped a little :slight_smile:

2 Likes

Hi everyone, I’m working on using the entity embeddings of the neural net to improve random forest results. This is all in the chapter 09_tabular notebook with the bulldozer bluebook dataset.

The first stumbling block: I don’t quite get the dimensions of the embeddings. Every categorical variable should gets its own embedding layer. This seems right:

embeds = list(learn.model.embeds.parameters())

len(embeds) as well as len(cat_nn) is 13.

Now my understanding was that the first dimension of the embedding layer is equal to the number of levels for the variable. The other dimension is determined by a heuristic that works well in practice.

However, these numbers don’t match.

for i in range(len(cat_nn)):
    print(embeds[i].shape, df_nn_final[cat_nn[i]].nunique())

Gives following result:

torch.Size([73, 18]) 73
torch.Size([7, 5]) 6
torch.Size([3, 3]) 2
torch.Size([75, 18]) 74
torch.Size([4, 3]) 3
torch.Size([5242, 194]) 5281
torch.Size([178, 29]) 177
torch.Size([5060, 190]) 5059
torch.Size([7, 5]) 6
torch.Size([13, 7]) 12
torch.Size([7, 5]) 6
torch.Size([5, 4]) 4
torch.Size([18, 8]) 17

Where does the mismatch come from? Am I maybe using the wrong dataframes or do I have a wrong conception about embeddings?

Thank you!

Thanks johannesstutz

Yes that helped alot. I will continue my fumbling through the code.

Though I have hit my next error already…

Everything, even the Kaggle download worked up until the line

(path/‘to.pkl’).save(to)
Which throws the traceback
AttributeError Traceback (most recent call last)
in
----> 1 (path/‘to.pkl’).save(to)

AttributeError: ‘PosixPath’ object has no attribute ‘save’

I did some googling and found


which seems to says that this error is raised when path is used on a linux system which then defaults to PosixPath object that has no ‘Save’ attribute or method.

Researching more - anyhelp appreciated.

There was a breaking change in the source code:

1 Like

Thanks, trying now.

Collaborative Filtering:

How do I predict/get all the set of movies that a user will like?

EmbeddingDotBias(
(u_weight): Embedding(944, 50)
(i_weight): Embedding(1635, 50)
(u_bias): Embedding(944, 1)
(i_bias): Embedding(1635, 1)
)

Do we have to refer to the u_weight and i_weight to get all the movies recommended for a user?

Thanks
Ganesh Bhat

Well, I got through the decision tree example. Unfortunately, it does not explain how to test new data on the model, I skipped many preceding chapters, so will need to circle back to ‘Turning your model into an online application’.

Hi Ganesh, I think you could pull the embedding of a user (one of the 944 rows) and multiply it with the i_weight embedding, which represents the movies. Add the user bias for your user and the movie biases, and you have the raw predictions. Put this through the sigmoid_range function and you should have the predicted rating for every movie! Have fun and let me know if it worked!

Thanks @johannesstutz.

I am summarizing my understanding:

Model prediction for a user = sigmoid_range(dot (multiply) product of the embeddings vector (one of the value in user_weight * all the weights of item_weight) + user bias + item bias, *self.y_range)

sigmoid_range(u_weight * i_weight + u_bias + i_bias, *self.y_range)

Referring to the output of learn.model in the 08_collab.ipynb, I am putting it in the matrix multiplication form:
sigmoid_range( matrix(1,50) * matrix(1635, 50) + matrix(944,1) + matrix(1635,1), *self.y_range)

Please do confirm my understanding.

Regards
Ganesh Bhat

This looks good, however for the user bias you’ll only want to use the bias for your specific user, so it’s just a single value you are adding.
For the multiplication of the weight vectors you could either use elementwise multiplication and take the sum:
(matrix(1,50) * matrix(1635, 50)).sum(dim=1)
or just matrix multiply them, making sure the dimensions match:
matrix(1,50) @ matrix(1635, 50).t()
which makes the second matrix of shape (50, 1635).

I hope this helped, just play around with it, it took me a while to get a feel for the vector and matrix stuff :slight_smile:

When I fit a decision tree on one categorical feature and run scikit-learn’s plot_tree, I get a tree diagram that shows splitting using <= rather than equality, which seems to contradict this bit of 09_tabular.ipynb:

Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).

Is the passage wrong, or am I misunderstanding something?

Here’s my code:

import matplotlib.pyplot as plt
import pandas as pd
import sklearn.datasets
from sklearn.tree import DecisionTreeRegressor, plot_tree

boston = sklearn.datasets.load_boston()

X = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
X.loc[:10, "CHAS"] = 2  # adding a third level for generality
X = pd.DataFrame(pd.Categorical(X.loc[:, "CHAS"]))
y = boston['target']

dtr = DecisionTreeRegressor(max_depth=3)
dtr.fit(X, y)

plot_tree(dtr, feature_names=["CHAS"], filled=True)

And here’s the output:

Screen Shot 2020-11-11 at 10.35.45 AM

hey there,i’ve got an error when importing fastbook:
name ‘log_args’ is not defined

note:i’m running the notebook on paperspace

can anyone explain to me in this code the meaning of setting the max_card equal to 1:

cont,cat = cont_cat_split(df, 1, dep_var=dep_var)

does that mean to see all variables as continuous??

Looking at the source code, it defines every column of type “float” as continuous. Integer columns depend on the cardinality, if max_card is set to 1, then every integer column is treated as continuous as well. Every other column is categorical.

Hello everyone, please can someone help me with this. I don’t know what i am doing wrong.

Im running into an error i can’t seem to fix. any help will be appreciated.

[Errno 2] No such file or directory: '/root/.fastai/archive/bluebook'

Even though I am following the exact steps as the notebook, I keep on getting this error when I run this code:

if not path.exists():

    path.mkdir()

    api.competition_download_cli('bluebook-for-bulldozers', path=path)

    file_extract(path/'bluebook-for-bulldozers.zip')

path.ls(file_type='text')

Here is my notebook: