Lesson 4 Advanced Discussion ✅

I am wondering whether the following problem can be solved using collaborative filtering. Users answer multiple-choice Questions. If a user answers a question they may choose A, B, C or D. So the dependent variable that we would like to predict is categorical. Based on how a user has answered some questions I would like to predict which answer they would choose for a new question.

Can this be easily solved using collaborative filtering? Or would you use “Users” and “Question-Answers” instead of “Users” and “Questions”? So instead of looking saying “User 1233 answered Question 123213 with value C” we would say “For User 1233 and Question-Answer 123213-C give a value of 1, For User 1233 and Question-Answer 123213-A give a value of 0, etc”.

(Absolutely loved the Excel demo of collaborative filtering btw!)

I’ve made an app that does collaborative filtering on board game reviews: Check it out.

Basically I used the same code as in class, but afterwards explored the game embeddings with nearest neighbors (sklearn). This way you can find the most ‘similar’ games, with the idea that there might be similar games that are rated higher.

Example searches: Catan, chess, magic etcetc

Previously I only could program python in notebooks, but with the course you’ve have pushed me to use cloud computing and make a webapp with js and html. It took a lot of time, but I’m sure proud of the result. Thanks!

5 Likes

Morning, everyone
I am trying to use the way"data block API" as below try to load and preprocess my text files data(including some char need to been encoding to “ISO-8859-1”). After run below script, I got error “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe9 in position Detail error: 951: invalid continuation byte”. Does anyone know how to fix that? Thanks!

“data_lm = (TextList.from_folder(path)
#Inputs: all the text files in path
.filter_by_folder(include=[‘Jobs’])
#We may have other temp folders that contain text files so we only keep what’s in train and test
.split_by_rand_pct(0.1)
#We randomly split and keep 10% (10,000 reviews) for validation
.label_for_lm()
#We want to do a language model so we label accordingly
.databunch(bs=bs))
data_lm.save(‘data_lm.pkl’)”

@jeremy Hello Jeremy,

First of all thanks a lot for this series of classes on deep learning which actually supported me in changing the course of my career.

I have a quick comment on this lesson 4 regarding the illustration of collaborative filtering in Excel. You introduce the importance of having a bias term in the user and movies embeddings in order to take into account user or movie specifics. By adding the bias term, you show that the RMSE of the small dataset in Excel drops from 0.39 to 0.32, and conclude that bias is a useful addition to these matrices.

I do not believe the conclusion nor the reasoning is correct. You essentially increase your model’s parameters size by 20% (going from 5 to 6 rows for movies and users) and observe a decrease of the RMSE on your training set. This would be the case for any other model. In fact, I am wondering if you would have seen a different effect if you would have just added another row to the weight matrices (not in the bias).

My intuition is that the “bias” effect can be fully captured by just having on average lower/higher weights for a given movie or user and the same effect can still be achieved by simple matrix multiplication.

This may be harder for the model to learn (and hence the benefit of bias), but I find the process to reach your conclusion arguable.

Thanks again for the course, I’ve been following and re-watching them for the past years!

As I understand it, reinforcement learning is in a machine learning category with supervised and unsupervized learning. In the deep learning world, the major models are convolutional and recurrent. According to this link, they all have different use cases.
https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/an-executives-guide-to-ai

My understanding is that the focus next semester is on machine learning, so maybe your question will be covered then.

Another total rookie here! Yeah, it must be so. When I send a text I get a bar above my entry box that offers choices of the next word to write.

I just ditched apple and got android. STDS* I do not know what either one is using, but I bet it’s something like imagenet.

*Same Thing, Different Software

It’s incredible how easy it is to fool the model from the class, which has 94% accuracy!

learn_clas.predict("This movie is awesome!")

(Category pos, tensor(1), tensor([0.0061, 0.9939]))

By adding a . at the end (e.g. similar to a typo), the output is misclassified.

learn_clas.predict("This movie is awesome!.")

(Category neg, tensor(0), tensor([0.9154, 0.0846]))

I tried adding other symbols at the end ',;!, but only . gets misclassified.
Does anyone have a reasoning/explanation why can the model fail on such a basic example?

1 Like

hello,

i am trying to use the tabular learner. Using to identify an owner of a group.
script successfully completes, below are the details. But in the output file i dont see any predictions (all the predictions comes as ‘No’ i.e 0). Any suggestions on what to check or how to check?

@rachel Hi Rachel,
Could you please share the pretrained viwiki LM?
Thanks!

Hello,

I have a question about the EmbeddingDotBias model used in collaborative filtering. The code in the forward() part of the model does:

dot = self.u_weight(users)* self.i_weight(items)
res = dot.sum(1) + self.u_bias(users).squeeze() + self.i_bias(items).squeeze()

As for as I know, the * on the first line does elementwise multiplication of the user and item tensors. In the imdb movie example, these two tensors happen to have the same shape. Won’t this not work if this is not the case? I.e. if we have more users than movies, what happens in the dot = ... line?

I don’t think this is a problem. If you have N users and M items, dot will have NxM shape. In this case, when you are doing matrix multiplication you only need to ensure that you have one side of each tensor with the same size, which corresponds to the number of factors.

The TextDataBunch creates the tokens automatically and it removes most of the words from my data set, it sets them to ‘unknown’. Since my dataset is small, i don’t want anything to be ‘unknown’ since every word for me is important and most of the words appear only 1 time.

How can i use TextDataBunch but do simple tokenization and not put words to ‘unknown’?

@jeremy,

For numericalization, we call - data.train_ds[0][0].data[:10]. i am unable to understand where is the .data[:10] being called from. The class for data.train_ds[0][0] is fastai.text.data.Text and that has no methods in it. How are we managing to call .data[:10] and which class does this method belong to?

Hi All,

So I’m trainig a sentiment classifier for a very specific task inlcudes English communication b/w client & team and have only 3500 examples of text annotated. I’m little worried if this much data would be enough to create a good classifier.

Did you ever get an answer on this @martijnd? I am curious to know what the answer was.

@jeremy, I guess this might no longer be on your watchlist but I would like to ask a question that was asked by @martijnd in Jul 2018 but was never answered. hope anyone will pick it up and answer this time.
When we want to train the language model for our own dataset on the wiki103 LM. Why don’t we have to align the vocab of our new dataset (IMDB) with the Wiki103? For example like this.
data_lm = (TextList.from_folder(path, **vocab=data_lm.vocab**)

Like we do when we want to initialise the Classifier.
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)

If I remember correctly fastai internally does align it for you (I remember wondering the same and went to look into the source code to confirm).

1 Like

Correct

The reasoning is there’s a convert_weights function that takes those WT103 weights (or any LM weights) to our new corpus, see it below:

def convert_weights(wgts:Weights, stoi_wgts:Dict[str,int], itos_new:Collection[str]) -> Weights:
    "Convert the model `wgts` to go with a new vocabulary."
    dec_bias, enc_wgts = wgts.get('1.decoder.bias', None), wgts['0.encoder.weight']
    wgts_m = enc_wgts.mean(0)
    if dec_bias is not None: bias_m = dec_bias.mean(0)
    new_w = enc_wgts.new_zeros((len(itos_new),enc_wgts.size(1))).zero_()
    if dec_bias is not None: new_b = dec_bias.new_zeros((len(itos_new),)).zero_()
    for i,w in enumerate(itos_new):
        r = stoi_wgts[w] if w in stoi_wgts else -1
        new_w[i] = enc_wgts[r] if r>=0 else wgts_m
        if dec_bias is not None: new_b[i] = dec_bias[r] if r>=0 else bias_m
    wgts['0.encoder.weight'] = new_w
    if '0.encoder_dp.emb.weight' in wgts: wgts['0.encoder_dp.emb.weight'] = new_w.clone()
    wgts['1.decoder.weight'] = new_w.clone()
    if dec_bias is not None: wgts['1.decoder.bias'] = new_b
    return wgts
2 Likes

Thanks, @vijayabhaskar and @muellerzr, I appreciate.

hey i have doubt regrading fastai recommendation model, I have trained the model ,but the problem is that for prediction ,the model only accepts testdata in the form of pandas dataframe,
but the prb is ,i want to predict new products to users , but total users:45000 and total products is 4200 ,then the dataframe will be 45000*4200 ,then the memory is running out ,so is there any other ways to perform inference ,like matrix multiplication user_emb * item_emb, other than passing a dataframe.