Lesson 4 In-Class Discussion

(Pete Condon) #329

I guess my point is that I don’t understand how you would train the embedding value. They start off being initialised with some arbitrary value (random or otherwise) and then I would expect SGD to correct the errors, but I believe that potentially a lot of the training adjustments will go into the weights of your dense layer which won’t help you when you’re using the values later.

All this being said, if you find something useful then I’ll be very interested to hear about it

(Jeremy Howard) #330

This sounds really interesting. Looking forward to seeing what you find.

Interestingly, somethings having embeddings that are larger than the cardinality of the original variable actually help predictions. So perhaps binary embeddings could be useful there too…

(Brian Muhia) #332

Is the maximum cardinality of the embedding matrix a learnable concept? The heuristic of max(card//2, 50) was reached due to some research and whether the cardinality of the categorical variable is low enough, but is there another way to decide on the maximum that also takes into account the diminishing usefulness of very large cardinalities? [Edited to add more detail to the question]

(Hiromi Suenaga) #334

Did you figure this one out?

How to interpret IMDB sentiment predictions?
(Jeremy Howard) #335

Just to clarify: I think you mean “largest embedding size”, not “maximum cardinality of the embedding matrix”. A categorical variable has a cardinality, an embedding matrix doesn’t - although the number of rows in the embedding matrix is equal to the cardinality of the categorical variable it is used with, I think above you’re referring to the number of columns in the embedding matrix.

Assuming I’ve understood correctly, then yes, it’s certainly worth trying larger embeddings. For words, for instance, researchers have found up to 600 dimensional embeddings to be useful. It depends on how complex the underlying concept is, which you can’t know (AFAIK) without trying different embedding sizes.

The max of 50 is just a rule of thumb that’s worked for me so far.

(Rob H) #336

Yes, I modified the source of nlp .py to be

``````    trn_iter,val_iter,test_iter = torchtext.data.BucketIterator.splits(splits, batch_sizes=(bs, bs, 1))
``````

There should be a fix coming through the chain soon. Being discussed here, along with a couple other issues to note.

(Hiromi Suenaga) #337

Make sense. Thanks!

(Anurag Goel) #338

As far as I can tell, other users are not facing this issue. If you can share more details of the errors you’re seeing I can take another look.

(Miguel Perez Michaus) #339

Ok thanks for your answer @anurag , that’s good news (unless it is location related) ,

I will do some more tries next days and if issue continues I will give you more details.

I hope I will be able to make it work cause Crestle seems to me a very well conceived tool, amazing local-setup independent portability and such a clear interface…!

(Vikrant Behal) #340

Guess that depends on the problem we are trying to solve. I believe this is architecture!

(Vikrant Behal) #341

Eventually, through embedding, we are creating bunch of numbers which represent categorical variables. Dropout happen during embedding creation which isn’t required for continuous variables.
Next, once we’ve all of these numbers which fed into neural network, the dropout can be applied at each such layer (to avoid overfitting).

(Vikrant Behal) #342

I believe if time-series does impact our model then we must add it. e.g. prediction of sales where specific detail like the day or time matters.

(Vikrant Behal) #343

Nope. Jeremy mentioned that it works at word level and not character level.

(Eric Perbos-Brinck) #344

(sorry for the delay, things got a bit hectic around here )

Video timelines for Lesson 4

• 00:00:04 More cool guides & posts made by Fast.ai classmates
"Improving the way we work with learning rate", “Cyclical Learning Rate technique”,
“Exploring Stochastic Gradient Descent with Restarts (SGDR)”, “Transfer Learning using differential learning rates”, “Getting Computers to see better than Humans”

• 00:03:04 Where we go from here: Lesson 3 -> 4 -> 5
Structured Data Deep Learning, Natural Language Processing (NLP), Recommendation Systems

• 00:05:04 Dropout discussion with “Dog_Breeds”,
looking at a sequential model’s layers with ‘learn’, Linear activation, ReLu, LogSoftmax

• 00:18:04 Question: “What kind of ‘p’ to use for Dropout as default”, overfitting, underfitting, ‘xtra_fc=’

• 00:23:45 Question: “Why monitor the Loss / LogLoss vs Accuracy”

• 00:25:04 Looking at Structured and Time Series data with Rossmann Kaggle competition, categorical & continuous variables, ‘.astype(‘category’)’

• 00:35:50 fastai library ‘proc_df()’, ‘yl = np.log(y)’, missing values, ‘train_ratio’, ‘val_idx’. “How (and why) to create a good validation set” post by Rachel

• 00:39:45 RMSPE: Root Mean Square Percentage Error,
create ModelData object, ‘md = ColumnarModelData.from_data_frame()’

• 00:45:30 ‘md.get_learner(emb_szs,…)’, embeddings

• 00:50:40 Dealing with categorical variables
like ‘day-of-week’ (Rossmann cont.), embedding matrices, ‘cat_sz’, ‘emb_szs’, Pinterest, Instacart

• 01:07:10 Improving Date fields with ‘add_datepart’, and final results & questions on Rossmann, step-by-step summary of Jeremy’s approach

Pause

• 01:20:10 More discussion on using Fast.ai library for Structured Data.

• 01:23:30 Intro to Natural Language Processing (NLP)
notebook ‘lang_model-arxiv.ipynb’

• 01:31:15 Creating a Language Model with IMDB dataset
notebook ‘lesson4-imdb.ipynb’

• 01:39:30 Tokenize: splitting a sentence into an array of tokens

• 01:43:45 Build a vocabulary ‘TEXT.vocab’ with ‘dill/pickle’; ‘next(iter(md.trn_dl))’

The rest of the video covers the ins and outs of the notebook ‘lesson4-imdb’, don’t forget to use 'J’
and ‘L’ for 10 sec backward/forward on YouTube videos.

• 02:11:30 Intro to Lesson 5: Collaborative Filtering with Movielens

(Sanyam Bhutani) #345

I’m trying to use the create a language model with a custom dataset.

I’ve loaded a .csv dataset in and saved it to a .txt file after performing operations on it as a pandas data frame.

Now the `LanguageModelData.from_text_files` gives the error.
`ascii' codec can't decode byte 0xcc in position 6836: ordinal not in range(128)`

The .txt file displays its encoding as utf-8 according to sublime text.

Also, I’m saving the dataset to a single concatenated .txt file rather than a number of them, since I’m reading from a csv. Will this work or do I have to do something differently?

Regards,
Sanyam Bhutani.

(Ramesh Sampath) #346

Can you put a screenshot of which line gives that error. Can you put your sample data somewhere for us to take a look and test?

It would be great if you can upload your notebook and sample data to a gist via `gist.github.com` so that we can replicate the issue and fix it.

(Rob H) #347

You can find other threads on this in the forums. It’s most likely because your environment’s locale isn’t set up properly.

I believe some people using the Amazon AMI had that problem.

I didn’t have the issue so I’m sorry I can’t share the right solution, but I know it’s been discussed.

(Sanyam Bhutani) #348

(Sanyam Bhutani) #349

I tried to search for similar posts on the forum, I’ll search again.

I’ve tried updating the conda env,
git pull
Both after activating the fasai env and navigating to the fasai directory

(Rob H) #350

see this comment and the three after it, as well as the links mentioned

edit: @guthl may know a solution