Lesson 7 - Official topic

ganesh.bhat · November 5, 2020, 5:10am

I am summarizing my understanding:

Model prediction for a user = sigmoid_range(dot (multiply) product of the embeddings vector (one of the value in user_weight * all the weights of item_weight) + user bias + item bias, *self.y_range)

sigmoid_range(u_weight * i_weight + u_bias + i_bias, *self.y_range)

Referring to the output of learn.model in the 08_collab.ipynb, I am putting it in the matrix multiplication form:
sigmoid_range( matrix(1,50) * matrix(1635, 50) + matrix(944,1) + matrix(1635,1), *self.y_range)

Please do confirm my understanding.

Regards
Ganesh Bhat

johannesstutz · November 5, 2020, 8:32am

This looks good, however for the user bias you’ll only want to use the bias for your specific user, so it’s just a single value you are adding.
For the multiplication of the weight vectors you could either use elementwise multiplication and take the sum:
(matrix(1,50) * matrix(1635, 50)).sum(dim=1)
or just matrix multiply them, making sure the dimensions match:
matrix(1,50) @ matrix(1635, 50).t()
which makes the second matrix of shape (50, 1635).

I hope this helped, just play around with it, it took me a while to get a feel for the vector and matrix stuff

gsganden · November 11, 2020, 4:30pm

When I fit a decision tree on one categorical feature and run scikit-learn’s plot_tree, I get a tree diagram that shows splitting using <= rather than equality, which seems to contradict this bit of 09_tabular.ipynb:

Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).

Is the passage wrong, or am I misunderstanding something?

Here’s my code:

import matplotlib.pyplot as plt
import pandas as pd
import sklearn.datasets
from sklearn.tree import DecisionTreeRegressor, plot_tree

boston = sklearn.datasets.load_boston()

X = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
X.loc[:10, "CHAS"] = 2  # adding a third level for generality
X = pd.DataFrame(pd.Categorical(X.loc[:, "CHAS"]))
y = boston['target']

dtr = DecisionTreeRegressor(max_depth=3)
dtr.fit(X, y)

plot_tree(dtr, feature_names=["CHAS"], filled=True)

And here’s the output:

Screen Shot 2020-11-11 at 10.35.45 AM

Diegowu02 · November 13, 2020, 3:00pm

hey there,i’ve got an error when importing fastbook:
name ‘log_args’ is not defined

note:i’m running the notebook on paperspace

Diegowu02 · November 13, 2020, 3:03pm

can anyone explain to me in this code the meaning of setting the max_card equal to 1:

cont,cat = cont_cat_split(df, 1, dep_var=dep_var)

does that mean to see all variables as continuous??

johannesstutz · November 13, 2020, 3:27pm

Looking at the source code, it defines every column of type “float” as continuous. Integer columns depend on the cardinality, if max_card is set to 1, then every integer column is treated as continuous as well. Every other column is categorical.

Chikwado · November 16, 2020, 12:25pm

Hello everyone, please can someone help me with this. I don’t know what i am doing wrong.

jimmiemunyi · November 18, 2020, 7:18am

Im running into an error i can’t seem to fix. any help will be appreciated.

[Errno 2] No such file or directory: '/root/.fastai/archive/bluebook'

Even though I am following the exact steps as the notebook, I keep on getting this error when I run this code:

if not path.exists():

    path.mkdir()

    api.competition_download_cli('bluebook-for-bulldozers', path=path)

    file_extract(path/'bluebook-for-bulldozers.zip')

path.ls(file_type='text')

Here is my notebook:

jimmiemunyi · November 18, 2020, 7:44am

EDIT: I SOLVED THIS.

Anyone running into a similar problem change this line

to:

if not path.exists():

    path.mkdir(parents=True)

The problem seems to be coming from the fact that one of the parent folders isn’t existing.

@Chikwado This might solve your problem too

mrfabulous1 · November 18, 2020, 4:09pm

Hi Chikwado and jimmiemunyi hope all is well!

I was looking at the code and noticed there is a logical error.

because if you create the path first, before you run the code, the three instructions below

if not path.exists():

will not run.

if not path.exists():
    path.mkdir()

api.competition_download_cli('bluebook-for-bulldozers', path=path)
file_extract(path/'bluebook-for-bulldozers.zip')

The code should probably be as above.

hope this helps.
Cheers mrfabulous1

riteshpaul · November 20, 2020, 11:28am

Its a late reply but if you have not figured this out and for others -

If you go into the hierarchy.py file and change:
if labels and Z.shape[0] + 1 != len(labels):
to:
if (labels is not None) and (Z.shape[0] + 1 != len(labels)):

And restart kernel for this change to kick in.

Vithy · November 20, 2020, 11:49am

Thanks, as you thought I have already done what you have mentioned.

Chikwado · November 30, 2020, 11:59pm

Hello, in chapter 9 lesson 7, i have a question. What exactly does FillMissing do? I ask because i am given to understand that it fills missing values with median of the column, but in the picture i attached we still addressed missing value (despite using FillMissing earlier):

sambit · December 15, 2020, 6:30am

You’re right. For a few minutes, I was mind fucked. But then I realized that the authors have made a logical mistake.

sambit · December 15, 2020, 7:26am

I think you’re right. This appears to be a typo in the book. max_card=1 doesn’t make any sense to me.

MohammedHasan · February 11, 2021, 8:34pm

cred_path.write_text(creds)
you need to change from write to write_text

MohammedHasan · February 11, 2021, 8:38pm

path.mkdir(parents=True) you didnot add parents=True

MohammedHasan · February 18, 2021, 5:59am

Very helpful reply. Just trying the last two days different things. I understood there is a problem regarding labels because without labels it was plotting, Although labels are just numbers. However, never imagined one should change the actual scipy function. Thank you very much

francescolost · February 18, 2021, 1:01pm

I agree that max_card of 1 is weird, and it took me a while to figure out what was going on: setting max_card to 1 I got 51 categorical variables, setting it to 9000 I got 60 categorical variables. I then started investigating some of the 51 I got in the first case and I found out that they all had > 1 category.

If you look at the source code for the cont_cat_split(...) function (e.g. here), you see where the trick is: a variable is considered continuous if it has integer values and > max_card occurrences or if it has float values. In the case of the 51 categorical variables, they are all string-valued!

sua · July 20, 2021, 6:21pm

I’m having trouble understanding the answer to this question in the workbook 08 questionnaire:

Why do we need Embedding if we could use one-hot encoded vectors for the same thing?

Embedding is computationally more efficient. The multiplication with one-hot encoded vectors is equivalent to indexing into the embedding matrix, and the Embedding layer does this. However, the gradient is calculated such that it is equivalent to the multiplication with the one-hot encoded vectors.

I understand the first 2 sentences, though I really don’t understand the “gradient is equivalent part.” Can someone share a concrete example of calculating the gradient of the multiplication of the one hot encoded vectors? How can multiplication have any sort of gradient?
Lastly, how does the Embedding class know about the gradient when all it’s doing is basically indexing in?