Lesson 4 In-Class Discussion

pete.condon · November 26, 2017, 12:30pm

I guess my point is that I don’t understand how you would train the embedding value. They start off being initialised with some arbitrary value (random or otherwise) and then I would expect SGD to correct the errors, but I believe that potentially a lot of the training adjustments will go into the weights of your dense layer which won’t help you when you’re using the values later.

All this being said, if you find something useful then I’ll be very interested to hear about it

jeremy · November 26, 2017, 3:26pm

This sounds really interesting. Looking forward to seeing what you find.

Interestingly, somethings having embeddings that are larger than the cardinality of the original variable actually help predictions. So perhaps binary embeddings could be useful there too…

poppingtonic · November 26, 2017, 7:23pm

Is the maximum cardinality of the embedding matrix a learnable concept? The heuristic of max(card//2, 50) was reached due to some research and whether the cardinality of the categorical variable is low enough, but is there another way to decide on the maximum that also takes into account the diminishing usefulness of very large cardinalities? [Edited to add more detail to the question]

hiromi · November 26, 2017, 7:32pm

Did you figure this one out?

jeremy · November 26, 2017, 8:15pm

Just to clarify: I think you mean “largest embedding size”, not “maximum cardinality of the embedding matrix”. A categorical variable has a cardinality, an embedding matrix doesn’t - although the number of rows in the embedding matrix is equal to the cardinality of the categorical variable it is used with, I think above you’re referring to the number of columns in the embedding matrix.

Assuming I’ve understood correctly, then yes, it’s certainly worth trying larger embeddings. For words, for instance, researchers have found up to 600 dimensional embeddings to be useful. It depends on how complex the underlying concept is, which you can’t know (AFAIK) without trying different embedding sizes.

The max of 50 is just a rule of thumb that’s worked for me so far.

rob · November 26, 2017, 8:21pm

Yes, I modified the source of nlp .py to be

    trn_iter,val_iter,test_iter = torchtext.data.BucketIterator.splits(splits, batch_sizes=(bs, bs, 1))
    trn_dl = TextDataLoader(trn_iter, text_name, label_name)
    val_dl = TextDataLoader(val_iter, text_name, label_name)
    test_dl = TextDataLoader(test_iter, text_name, label_name)

There should be a fix coming through the chain soon. Being discussed here, along with a couple other issues to note.

hiromi · November 26, 2017, 8:49pm

Make sense. Thanks!

anurag · November 27, 2017, 7:06am

As far as I can tell, other users are not facing this issue. If you can share more details of the errors you’re seeing I can take another look.

miguel_perez · November 27, 2017, 9:07am

Ok thanks for your answer @anurag , that’s good news (unless it is location related) ,

I will do some more tries next days and if issue continues I will give you more details.

I hope I will be able to make it work cause Crestle seems to me a very well conceived tool, amazing local-setup independent portability and such a clear interface…!

vikbehal · November 27, 2017, 9:20pm

Guess that depends on the problem we are trying to solve. I believe this is architecture!

vikbehal · November 27, 2017, 9:26pm

Eventually, through embedding, we are creating bunch of numbers which represent categorical variables. Dropout happen during embedding creation which isn’t required for continuous variables.
Next, once we’ve all of these numbers which fed into neural network, the dropout can be applied at each such layer (to avoid overfitting).

vikbehal · November 27, 2017, 9:29pm

I believe if time-series does impact our model then we must add it. e.g. prediction of sales where specific detail like the day or time matters.

vikbehal · November 27, 2017, 9:32pm

Nope. Jeremy mentioned that it works at word level and not character level.

EricPB · November 28, 2017, 6:36pm

(sorry for the delay, things got a bit hectic around here )

Video timelines for Lesson 4

00:00:04 More cool guides & posts made by Fast.ai classmates
"Improving the way we work with learning rate", “Cyclical Learning Rate technique”,
“Exploring Stochastic Gradient Descent with Restarts (SGDR)”, “Transfer Learning using differential learning rates”, “Getting Computers to see better than Humans”
00:03:04 Where we go from here: Lesson 3 -> 4 -> 5
Structured Data Deep Learning, Natural Language Processing (NLP), Recommendation Systems
00:05:04 Dropout discussion with “Dog_Breeds”,
looking at a sequential model’s layers with ‘learn’, Linear activation, ReLu, LogSoftmax
00:18:04 Question: “What kind of ‘p’ to use for Dropout as default”, overfitting, underfitting, ‘xtra_fc=’
00:23:45 Question: “Why monitor the Loss / LogLoss vs Accuracy”
00:25:04 Looking at Structured and Time Series data with Rossmann Kaggle competition, categorical & continuous variables, ‘.astype(‘category’)’
00:35:50 fastai library ‘proc_df()’, ‘yl = np.log(y)’, missing values, ‘train_ratio’, ‘val_idx’. “How (and why) to create a good validation set” post by Rachel
00:39:45 RMSPE: Root Mean Square Percentage Error,
create ModelData object, ‘md = ColumnarModelData.from_data_frame()’
00:45:30 ‘md.get_learner(emb_szs,…)’, embeddings
00:50:40 Dealing with categorical variables
like ‘day-of-week’ (Rossmann cont.), embedding matrices, ‘cat_sz’, ‘emb_szs’, Pinterest, Instacart
01:07:10 Improving Date fields with ‘add_datepart’, and final results & questions on Rossmann, step-by-step summary of Jeremy’s approach

Pause

01:20:10 More discussion on using Fast.ai library for Structured Data.
01:23:30 Intro to Natural Language Processing (NLP)
notebook ‘lang_model-arxiv.ipynb’
01:31:15 Creating a Language Model with IMDB dataset
notebook ‘lesson4-imdb.ipynb’
01:39:30 Tokenize: splitting a sentence into an array of tokens
01:43:45 Build a vocabulary ‘TEXT.vocab’ with ‘dill/pickle’; ‘next(iter(md.trn_dl))’

The rest of the video covers the ins and outs of the notebook ‘lesson4-imdb’, don’t forget to use 'J’
and ‘L’ for 10 sec backward/forward on YouTube videos.

02:11:30 Intro to Lesson 5: Collaborative Filtering with Movielens

init_27 · November 28, 2017, 8:38pm

I’m trying to use the create a language model with a custom dataset.

I’ve loaded a .csv dataset in and saved it to a .txt file after performing operations on it as a pandas data frame.

Now the LanguageModelData.from_text_files gives the error.
ascii' codec can't decode byte 0xcc in position 6836: ordinal not in range(128)

The .txt file displays its encoding as utf-8 according to sublime text.

Also, I’m saving the dataset to a single concatenated .txt file rather than a number of them, since I’m reading from a csv. Will this work or do I have to do something differently?

Please help!

Regards,
Sanyam Bhutani.

ramesh · November 28, 2017, 9:00pm

Can you put a screenshot of which line gives that error. Can you put your sample data somewhere for us to take a look and test?

It would be great if you can upload your notebook and sample data to a gist via gist.github.com so that we can replicate the issue and fix it.

rob · November 28, 2017, 9:21pm

You can find other threads on this in the forums. It’s most likely because your environment’s locale isn’t set up properly.

I believe some people using the Amazon AMI had that problem.

I didn’t have the issue so I’m sorry I can’t share the right solution, but I know it’s been discussed.

init_27 · November 28, 2017, 9:33pm

@ramesh

Link to the Gist

Screenshot:

init_27 · November 28, 2017, 9:35pm

I tried to search for similar posts on the forum, I’ll search again.

I’ve tried updating the conda env,
git pull
Both after activating the fasai env and navigating to the fasai directory

rob · November 28, 2017, 9:42pm

see this comment and the three after it, as well as the links mentioned

edit: @guthl may know a solution