Lesson 7 - Official topic

Thanks for the information. It worked for the problem.

2 Likes

Lesson 7 - chapter 9 in the book - Tabular data
Is there any value/advantage running the random forest on GPU compare to CPU? I can understand the value of GPU when we do mathematical manipulation on many small objects (i.e. image) simultaneously but for large tabular dataset maybe CPU could be betterā€¦
Appreciate your clarification

I donā€™t think that sklearn supports GPU. New fast.ai tabular object is a way to prep data for further processing, but the classical ML sklearn library doesnā€™t support GPU. But it is nice to have your data prepared for any approach in the tabular object, so you can try some deep learning model after RF.

1 Like

https://forums.fast.ai/t/lesson-7-official-topic/69896/83

uploading Kaggle data through api problems. Well I signed in at Kaggle, accepted the competition rules and installed kaggle in the terminal, I can download the zip file manually, the api download downloads no files. So I eventually just upload the zip and unzipped it in paperspace and continuing onā€¦lol

This make very little sense to meā€¦
'However, in this case Kaggle tells us what metric to use: root mean squared log error (RMSLE) between the actual and predicted auction prices. We need do only a small amount of processing to use this: we take the log of the prices, so that rmse of that value will give us what we ultimately need:

dep_var = ā€˜SalePriceā€™

df[dep_var] = np.log(df[dep_var])ā€™

ā€˜Sales priceā€™ (dep_var) is what is being predicted, so where is the difference to the original? Where is SalesPredicted - Sales. I feel so lost againā€¦

This code from the Notebook finds the most similar movie:

movie_factors = learn.model.i_weight.weight
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]

Any Idea on what to change to find like say the 5 most similar movies instead of one?

1 Like

Sure, you can use this code to get the 5 most similar movies:

idx = distances.argsort(descending=True)[1:6]

The argsort method returns a list of the movie idā€™s sorted descending by similarity. The most similar movie is the movie itself at index 0, so starting from index 1, there are the other movies.

2 Likes

Thanks!! @johannesstutz :smiley:

Hey guys. So I am doing some experimentation on the Collab Notebook.

learn.fit_one_cycle(5, 5e-3)

Here Jeremy used 5x10-3 as the max learning rate. So I was trying to find out why he used that exact number so I ran lr_find and tried to use a different learning rate. The suggested one was 4x10-6, but when I used it, the model losses were way worse (13.5 instead of 0.87 using Jeremyā€™s Learning rate)

Does anyone know why this happens? Or how to find an optimal learning rate for the DotProduct model?

I am also having the same confusion. I mean how do you determine which matrices to use?

1 Like

The step you cited replaces the values in the SalePrice column (which are in absolute US dollars I think) with the logarithm of the sale price. The reason for this is that the metric that the competition used is on a log scale (root mean squared log error). So if we just convert the dependent variable to a log scale, we can use the (standard) RMSE error and weā€™re good.

SalesPredicted - Sales: Iā€™m not sure what you mean by that. The loss for every row is determined by the RMSE function, which takes the predicted value and the true value from the SalePrice column as arguments.

Let me know if that helped a little :slight_smile:

2 Likes

Hi everyone, Iā€™m working on using the entity embeddings of the neural net to improve random forest results. This is all in the chapter 09_tabular notebook with the bulldozer bluebook dataset.

The first stumbling block: I donā€™t quite get the dimensions of the embeddings. Every categorical variable should gets its own embedding layer. This seems right:

embeds = list(learn.model.embeds.parameters())

len(embeds) as well as len(cat_nn) is 13.

Now my understanding was that the first dimension of the embedding layer is equal to the number of levels for the variable. The other dimension is determined by a heuristic that works well in practice.

However, these numbers donā€™t match.

for i in range(len(cat_nn)):
    print(embeds[i].shape, df_nn_final[cat_nn[i]].nunique())

Gives following result:

torch.Size([73, 18]) 73
torch.Size([7, 5]) 6
torch.Size([3, 3]) 2
torch.Size([75, 18]) 74
torch.Size([4, 3]) 3
torch.Size([5242, 194]) 5281
torch.Size([178, 29]) 177
torch.Size([5060, 190]) 5059
torch.Size([7, 5]) 6
torch.Size([13, 7]) 12
torch.Size([7, 5]) 6
torch.Size([5, 4]) 4
torch.Size([18, 8]) 17

Where does the mismatch come from? Am I maybe using the wrong dataframes or do I have a wrong conception about embeddings?

Thank you!

Thanks johannesstutz

Yes that helped alot. I will continue my fumbling through the code.

Though I have hit my next error alreadyā€¦

Everything, even the Kaggle download worked up until the line

(path/ā€˜to.pklā€™).save(to)
Which throws the traceback
AttributeError Traceback (most recent call last)
in
----> 1 (path/ā€˜to.pklā€™).save(to)

AttributeError: ā€˜PosixPathā€™ object has no attribute ā€˜saveā€™

I did some googling and found


which seems to says that this error is raised when path is used on a linux system which then defaults to PosixPath object that has no ā€˜Saveā€™ attribute or method.

Researching more - anyhelp appreciated.

There was a breaking change in the source code:

1 Like

Thanks, trying now.

Collaborative Filtering:

How do I predict/get all the set of movies that a user will like?

EmbeddingDotBias(
(u_weight): Embedding(944, 50)
(i_weight): Embedding(1635, 50)
(u_bias): Embedding(944, 1)
(i_bias): Embedding(1635, 1)
)

Do we have to refer to the u_weight and i_weight to get all the movies recommended for a user?

Thanks
Ganesh Bhat

Well, I got through the decision tree example. Unfortunately, it does not explain how to test new data on the model, I skipped many preceding chapters, so will need to circle back to ā€˜Turning your model into an online applicationā€™.

Hi Ganesh, I think you could pull the embedding of a user (one of the 944 rows) and multiply it with the i_weight embedding, which represents the movies. Add the user bias for your user and the movie biases, and you have the raw predictions. Put this through the sigmoid_range function and you should have the predicted rating for every movie! Have fun and let me know if it worked!