Lesson 4 In-Class Discussion

vikbehal · November 21, 2017, 5:27am

Guess both are embeddings. Instead of using Word2Vec embeddings, we create our own from scratch.
Per @yinterian question, we could use both…like merging or something?

zaoyang · November 21, 2017, 5:28am

When you use multiple languages, does this mean you have to use multiple models and select between them? Is there a way to use one big model?

vikbehal · November 21, 2017, 5:31am

Interesting. As Jeremy said, the model learnt English. Maybe, when we’ve multiple languages, it’ll learn to distinguish? Thought - Start with a model and throw data to teach multiple languages and use that as a pre-trained model?

anamariapopescug · November 21, 2017, 5:51am

Word2vec learns an embedding matrix as well. There are multiple ways of learning embeddings.

Even · November 21, 2017, 5:57am

@Jeremy, If you don’t mind going into it, how does this method of creating the language embeddings differ from those that would be created by running GloVe or word2Vec on the custom dataset?

Is this method superior because it’s leveraging some of the new techniques you’ve covered like gradient descent with restart and (I’m assuming) bi-lstms with dropout, or is there something else going on under the hood?

rikiya · November 21, 2017, 6:17am

@jeremy Is there any rule of thumb for choosing “categorical” or “continuous” when it comes to discrete variables? What are pros and cons of each method? For example, we could say “year” as both categorical and continuous. Thanks in advance.

jeremy · November 21, 2017, 6:36am

Yes set to zero. So no gradients are propagated back from that unit for that mini batch.

jeremy · November 21, 2017, 6:39am

Not with the fastai lib, unless you make a copy of it.

jeremy · November 21, 2017, 6:40am

Since the are binary, it doesn’t really matter where they go.

jeremy · November 21, 2017, 6:48am

n*1 concat 4*1 --> (n+4)*1

Make sense?

jeremy · November 21, 2017, 6:51am

With different types of datasets you can need very different amounts of dropout. The normal amounts for image CNNs are very different to structured data, for instance.

jeremy · November 21, 2017, 6:51am

Not really - you have a try a few amounts. But somewhere around this number should be OK. If you have less data, you’ll need less activations, or you’ll overfit.

jeremy · November 21, 2017, 6:53am

Yes it works great. Remind me to show you next week if I forget!

jeremy · November 21, 2017, 6:54am

The point is to learn the embedding of ‘Sunday’. We don’t know up front what an appropriate encoding is.

jeremy · November 21, 2017, 6:55am

Conceptually, yes, you can use the same approach, but it’s not in the library yet (or any other software). Would make for a good project!

jeremy · November 21, 2017, 6:56am

Great point. I’d suggest adding columns for locality, area, city, county, and state, and then make them all (including zip code) separate categorical variables!

jeremy · November 21, 2017, 7:01am

Yeah I guess SMOTE is, strictly speaking, a data augmentation technique for structured data. But I haven’t seen it used with these types of models or know if it’s helpful. It seems tricky to implement with embeddings - not quite sure how it would work…

jeremy · November 21, 2017, 7:06am

It’s just what empirically has been found to work by NLP researchers. >600 hasn’t been found helpful so far. It seems the implicit dimensionality of words in English just isn’t higher than that.

binga · November 21, 2017, 7:07am

bump! @jeremy

rsrivastava · November 21, 2017, 7:07am

I am using Random Forest model for a project there we are analyzing URLs and content. But features looks like this:
http://kase.twemail.com.de/contact/dmB4L3686825_frIV688.iTBV516_92qwH.html 0.0,0.0,2570.0,783.0,0.0,0.0,0.0,1.0,0.0,0.0,20.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,1.0,3.0,0.0,1.0,103.0,1.0,2

Here 2 at the end id the label and rest are feature. Valid values for most of the feature is -1, 0, 1. Seems like features are already categorical with three categories. Not sure how to add more dimensions using embedding to leverage deep learning.