Lesson 7 - Official topic

jimlou · October 12, 2020, 12:48am

Thanks for the information. It worked for the problem.

maxim · October 13, 2020, 7:50pm

yoniziv · October 16, 2020, 8:46am

Lesson 7 - chapter 9 in the book - Tabular data
Is there any value/advantage running the random forest on GPU compare to CPU? I can understand the value of GPU when we do mathematical manipulation on many small objects (i.e. image) simultaneously but for large tabular dataset maybe CPU could be better…
Appreciate your clarification

maxim · October 19, 2020, 6:48am

I don’t think that sklearn supports GPU. New fast.ai tabular object is a way to prep data for further processing, but the classical ML sklearn library doesn’t support GPU. But it is nice to have your data prepared for any approach in the tabular object, so you can try some deep learning model after RF.

maxim · October 19, 2020, 7:22am

https://forums.fast.ai/t/lesson-7-official-topic/69896/83

newoptionz · November 1, 2020, 1:32pm

uploading Kaggle data through api problems. Well I signed in at Kaggle, accepted the competition rules and installed kaggle in the terminal, I can download the zip file manually, the api download downloads no files. So I eventually just upload the zip and unzipped it in paperspace and continuing on…lol

newoptionz · November 1, 2020, 2:18pm

This make very little sense to me…
'However, in this case Kaggle tells us what metric to use: root mean squared log error (RMSLE) between the actual and predicted auction prices. We need do only a small amount of processing to use this: we take the log of the prices, so that rmse of that value will give us what we ultimately need:

dep_var = ‘SalePrice’

df[dep_var] = np.log(df[dep_var])’

‘Sales price’ (dep_var) is what is being predicted, so where is the difference to the original? Where is SalesPredicted - Sales. I feel so lost again…

jimmiemunyi · November 2, 2020, 11:36am

This code from the Notebook finds the most similar movie:

movie_factors = learn.model.i_weight.weight
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]

Any Idea on what to change to find like say the 5 most similar movies instead of one?

johannesstutz · November 2, 2020, 2:15pm

Sure, you can use this code to get the 5 most similar movies:

idx = distances.argsort(descending=True)[1:6]

The argsort method returns a list of the movie id’s sorted descending by similarity. The most similar movie is the movie itself at index 0, so starting from index 1, there are the other movies.

jimmiemunyi · November 2, 2020, 2:38pm

Thanks!! @johannesstutz

jimmiemunyi · November 2, 2020, 7:46pm

Hey guys. So I am doing some experimentation on the Collab Notebook.

learn.fit_one_cycle(5, 5e-3)

Here Jeremy used 5x10^-3 as the max learning rate. So I was trying to find out why he used that exact number so I ran lr_find and tried to use a different learning rate. The suggested one was 4x10^-6, but when I used it, the model losses were way worse (13.5 instead of 0.87 using Jeremy’s Learning rate)

Does anyone know why this happens? Or how to find an optimal learning rate for the DotProduct model?

KaushalB · November 3, 2020, 4:36am

I am also having the same confusion. I mean how do you determine which matrices to use?

johannesstutz · November 3, 2020, 9:32am

The step you cited replaces the values in the SalePrice column (which are in absolute US dollars I think) with the logarithm of the sale price. The reason for this is that the metric that the competition used is on a log scale (root mean squared log error). So if we just convert the dependent variable to a log scale, we can use the (standard) RMSE error and we’re good.

SalesPredicted - Sales: I’m not sure what you mean by that. The loss for every row is determined by the RMSE function, which takes the predicted value and the true value from the SalePrice column as arguments.

Let me know if that helped a little

johannesstutz · November 3, 2020, 11:43am

Hi everyone, I’m working on using the entity embeddings of the neural net to improve random forest results. This is all in the chapter 09_tabular notebook with the bulldozer bluebook dataset.

The first stumbling block: I don’t quite get the dimensions of the embeddings. Every categorical variable should gets its own embedding layer. This seems right:

embeds = list(learn.model.embeds.parameters())

len(embeds) as well as len(cat_nn) is 13.

Now my understanding was that the first dimension of the embedding layer is equal to the number of levels for the variable. The other dimension is determined by a heuristic that works well in practice.

However, these numbers don’t match.

for i in range(len(cat_nn)):
    print(embeds[i].shape, df_nn_final[cat_nn[i]].nunique())

Gives following result:

torch.Size([73, 18]) 73
torch.Size([7, 5]) 6
torch.Size([3, 3]) 2
torch.Size([75, 18]) 74
torch.Size([4, 3]) 3
torch.Size([5242, 194]) 5281
torch.Size([178, 29]) 177
torch.Size([5060, 190]) 5059
torch.Size([7, 5]) 6
torch.Size([13, 7]) 12
torch.Size([7, 5]) 6
torch.Size([5, 4]) 4
torch.Size([18, 8]) 17

Where does the mismatch come from? Am I maybe using the wrong dataframes or do I have a wrong conception about embeddings?

Thank you!

newoptionz · November 3, 2020, 1:15pm

Thanks johannesstutz

Yes that helped alot. I will continue my fumbling through the code.

Though I have hit my next error already…

Everything, even the Kaggle download worked up until the line

(path/‘to.pkl’).save(to)
Which throws the traceback
AttributeError Traceback (most recent call last)
in
----> 1 (path/‘to.pkl’).save(to)

AttributeError: ‘PosixPath’ object has no attribute ‘save’

I did some googling and found

which seems to says that this error is raised when path is used on a linux system which then defaults to PosixPath object that has no ‘Save’ attribute or method.

Researching more - anyhelp appreciated.

johannesstutz · November 3, 2020, 1:19pm

There was a breaking change in the source code:

newoptionz · November 3, 2020, 1:20pm

Thanks, trying now.

ganesh.bhat · November 3, 2020, 1:56pm

Collaborative Filtering:

How do I predict/get all the set of movies that a user will like?

EmbeddingDotBias(
(u_weight): Embedding(944, 50)
(i_weight): Embedding(1635, 50)
(u_bias): Embedding(944, 1)
(i_bias): Embedding(1635, 1)
)

Do we have to refer to the u_weight and i_weight to get all the movies recommended for a user?

Thanks
Ganesh Bhat

newoptionz · November 4, 2020, 3:17am

Well, I got through the decision tree example. Unfortunately, it does not explain how to test new data on the model, I skipped many preceding chapters, so will need to circle back to ‘Turning your model into an online application’.

johannesstutz · November 4, 2020, 10:09am

Hi Ganesh, I think you could pull the embedding of a user (one of the 944 rows) and multiply it with the i_weight embedding, which represents the movies. Add the user bias for your user and the movie biases, and you have the raw predictions. Put this through the sigmoid_range function and you should have the predicted rating for every movie! Have fun and let me know if it worked!