Lesson 4 In-Class Discussion

I’m pretty sure that mathematically it won’t make any difference for binary flags. The values of the embeddings will be multiplied by a weight in the fully connected layer, so I can’t see how there would be any difference between training the embedding value and training the weight. Although I’d be very happy to be proven wrong :slight_smile:

I didn’t refer to the impact on classification results which I agree. I am working on something to use features created from combination of embeddings for clustering purposes after the prediction is done. In one case if I directly use the binary flags it indicates they are different. However if represented as embedding they may not show so much differentiation if the feature doesn’t have predictive power.

I guess my point is that I don’t understand how you would train the embedding value. They start off being initialised with some arbitrary value (random or otherwise) and then I would expect SGD to correct the errors, but I believe that potentially a lot of the training adjustments will go into the weights of your dense layer which won’t help you when you’re using the values later.

All this being said, if you find something useful then I’ll be very interested to hear about it :slight_smile:

This sounds really interesting. Looking forward to seeing what you find.

Interestingly, somethings having embeddings that are larger than the cardinality of the original variable actually help predictions. So perhaps binary embeddings could be useful there too…

2 Likes

Is the maximum cardinality of the embedding matrix a learnable concept? The heuristic of max(card//2, 50) was reached due to some research and whether the cardinality of the categorical variable is low enough, but is there another way to decide on the maximum that also takes into account the diminishing usefulness of very large cardinalities? [Edited to add more detail to the question]

Did you figure this one out?

Just to clarify: I think you mean “largest embedding size”, not “maximum cardinality of the embedding matrix”. A categorical variable has a cardinality, an embedding matrix doesn’t - although the number of rows in the embedding matrix is equal to the cardinality of the categorical variable it is used with, I think above you’re referring to the number of columns in the embedding matrix.

Assuming I’ve understood correctly, then yes, it’s certainly worth trying larger embeddings. For words, for instance, researchers have found up to 600 dimensional embeddings to be useful. It depends on how complex the underlying concept is, which you can’t know (AFAIK) without trying different embedding sizes.

The max of 50 is just a rule of thumb that’s worked for me so far.

5 Likes

Yes, I modified the source of nlp .py to be

    trn_iter,val_iter,test_iter = torchtext.data.BucketIterator.splits(splits, batch_sizes=(bs, bs, 1))
    trn_dl = TextDataLoader(trn_iter, text_name, label_name)
    val_dl = TextDataLoader(val_iter, text_name, label_name)
    test_dl = TextDataLoader(test_iter, text_name, label_name)

There should be a fix coming through the chain soon. Being discussed here, along with a couple other issues to note.

Make sense. Thanks! :slight_smile:

As far as I can tell, other users are not facing this issue. If you can share more details of the errors you’re seeing I can take another look.

1 Like

Ok thanks for your answer @anurag , that’s good news (unless it is location related) ,

I will do some more tries next days and if issue continues I will give you more details.

I hope I will be able to make it work cause Crestle seems to me a very well conceived tool, amazing local-setup independent portability and such a clear interface…!

1 Like

Guess that depends on the problem we are trying to solve. I believe this is architecture!

Eventually, through embedding, we are creating bunch of numbers which represent categorical variables. Dropout happen during embedding creation which isn’t required for continuous variables.
Next, once we’ve all of these numbers which fed into neural network, the dropout can be applied at each such layer (to avoid overfitting).

I believe if time-series does impact our model then we must add it. e.g. prediction of sales where specific detail like the day or time matters.

Nope. Jeremy mentioned that it works at word level and not character level.

(sorry for the delay, things got a bit hectic around here :upside_down_face:)

Video timelines for Lesson 4

  • 00:00:04 More cool guides & posts made by Fast.ai classmates
    "Improving the way we work with learning rate", “Cyclical Learning Rate technique”,
    “Exploring Stochastic Gradient Descent with Restarts (SGDR)”, “Transfer Learning using differential learning rates”, “Getting Computers to see better than Humans”

  • 00:03:04 Where we go from here: Lesson 3 -> 4 -> 5
    Structured Data Deep Learning, Natural Language Processing (NLP), Recommendation Systems

  • 00:05:04 Dropout discussion with “Dog_Breeds”,
    looking at a sequential model’s layers with ‘learn’, Linear activation, ReLu, LogSoftmax

  • 00:18:04 Question: “What kind of ‘p’ to use for Dropout as default”, overfitting, underfitting, ‘xtra_fc=’

  • 00:23:45 Question: “Why monitor the Loss / LogLoss vs Accuracy”

  • 00:25:04 Looking at Structured and Time Series data with Rossmann Kaggle competition, categorical & continuous variables, ‘.astype(‘category’)’

  • 00:35:50 fastai library ‘proc_df()’, ‘yl = np.log(y)’, missing values, ‘train_ratio’, ‘val_idx’. “How (and why) to create a good validation set” post by Rachel

  • 00:39:45 RMSPE: Root Mean Square Percentage Error,
    create ModelData object, ‘md = ColumnarModelData.from_data_frame()’

  • 00:45:30 ‘md.get_learner(emb_szs,…)’, embeddings

  • 00:50:40 Dealing with categorical variables
    like ‘day-of-week’ (Rossmann cont.), embedding matrices, ‘cat_sz’, ‘emb_szs’, Pinterest, Instacart

  • 01:07:10 Improving Date fields with ‘add_datepart’, and final results & questions on Rossmann, step-by-step summary of Jeremy’s approach

Pause

  • 01:20:10 More discussion on using Fast.ai library for Structured Data.

  • 01:23:30 Intro to Natural Language Processing (NLP)
    notebook ‘lang_model-arxiv.ipynb’

  • 01:31:15 Creating a Language Model with IMDB dataset
    notebook ‘lesson4-imdb.ipynb’

  • 01:39:30 Tokenize: splitting a sentence into an array of tokens

  • 01:43:45 Build a vocabulary ‘TEXT.vocab’ with ‘dill/pickle’; ‘next(iter(md.trn_dl))’

The rest of the video covers the ins and outs of the notebook ‘lesson4-imdb’, don’t forget to use 'J’
and ‘L’ for 10 sec backward/forward on YouTube videos.

  • 02:11:30 Intro to Lesson 5: Collaborative Filtering with Movielens
8 Likes

I’m trying to use the create a language model with a custom dataset.

I’ve loaded a .csv dataset in and saved it to a .txt file after performing operations on it as a pandas data frame.

Now the LanguageModelData.from_text_files gives the error.
ascii' codec can't decode byte 0xcc in position 6836: ordinal not in range(128)

The .txt file displays its encoding as utf-8 according to sublime text.

Also, I’m saving the dataset to a single concatenated .txt file rather than a number of them, since I’m reading from a csv. Will this work or do I have to do something differently?

Please help!

Regards,
Sanyam Bhutani.

Can you put a screenshot of which line gives that error. Can you put your sample data somewhere for us to take a look and test?

It would be great if you can upload your notebook and sample data to a gist via gist.github.com so that we can replicate the issue and fix it.

You can find other threads on this in the forums. It’s most likely because your environment’s locale isn’t set up properly.

I believe some people using the Amazon AMI had that problem.

I didn’t have the issue so I’m sorry I can’t share the right solution, but I know it’s been discussed.

@ramesh

Link to the Gist

Screenshot: