Lesson 4 In-Class Discussion


(Vikrant Behal) #246

Guess both are embeddings. Instead of using Word2Vec embeddings, we create our own from scratch.
Per @yinterian question, we could use both…like merging or something?


(Zao Yang) #247

When you use multiple languages, does this mean you have to use multiple models and select between them? Is there a way to use one big model?


(Vikrant Behal) #248

Interesting. As Jeremy said, the model learnt English. Maybe, when we’ve multiple languages, it’ll learn to distinguish? Thought - Start with a model and throw data to teach multiple languages and use that as a pre-trained model?


(anamariapopescug) #249

Word2vec learns an embedding matrix as well. There are multiple ways of learning embeddings.


(Even Oldridge) #250

@Jeremy, If you don’t mind going into it, how does this method of creating the language embeddings differ from those that would be created by running GloVe or word2Vec on the custom dataset?

Is this method superior because it’s leveraging some of the new techniques you’ve covered like gradient descent with restart and (I’m assuming) bi-lstms with dropout, or is there something else going on under the hood?


(Rikiya Yamashita) #251

@jeremy Is there any rule of thumb for choosing “categorical” or “continuous” when it comes to discrete variables? What are pros and cons of each method? For example, we could say “year” as both categorical and continuous. Thanks in advance.


(Jeremy Howard) #252

Yes set to zero. So no gradients are propagated back from that unit for that mini batch.


(Jeremy Howard) #253

Not with the fastai lib, unless you make a copy of it.


(Jeremy Howard) #254

Since the are binary, it doesn’t really matter where they go.


(Jeremy Howard) #255

n*1 concat 4*1 --> (n+4)*1

Make sense?


(Jeremy Howard) #256

With different types of datasets you can need very different amounts of dropout. The normal amounts for image CNNs are very different to structured data, for instance.


(Jeremy Howard) #257

Not really - you have a try a few amounts. But somewhere around this number should be OK. If you have less data, you’ll need less activations, or you’ll overfit.


(Jeremy Howard) #258

Yes it works great. Remind me to show you next week if I forget!


(Jeremy Howard) #259

The point is to learn the embedding of ‘Sunday’. We don’t know up front what an appropriate encoding is.


(Jeremy Howard) #260

Conceptually, yes, you can use the same approach, but it’s not in the library yet (or any other software). Would make for a good project!


(Jeremy Howard) #261

Great point. I’d suggest adding columns for locality, area, city, county, and state, and then make them all (including zip code) separate categorical variables!


(Jeremy Howard) #262

Yeah I guess SMOTE is, strictly speaking, a data augmentation technique for structured data. But I haven’t seen it used with these types of models or know if it’s helpful. It seems tricky to implement with embeddings - not quite sure how it would work…


(Jeremy Howard) #263

It’s just what empirically has been found to work by NLP researchers. >600 hasn’t been found helpful so far. It seems the implicit dimensionality of words in English just isn’t higher than that.


(Phani Srikanth) #264

bump! @jeremy


(rachana) #265

I am using Random Forest model for a project there we are analyzing URLs and content. But features looks like this:
http://kase.twemail.com.de/contact/dmB4L3686825_frIV688.iTBV516_92qwH.html 0.0,0.0,2570.0,783.0,0.0,0.0,0.0,1.0,0.0,0.0,20.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,1.0,3.0,0.0,1.0,103.0,1.0,2

Here 2 at the end id the label and rest are feature. Valid values for most of the feature is -1, 0, 1. Seems like features are already categorical with three categories. Not sure how to add more dimensions using embedding to leverage deep learning.