Lesson 4 In-Class Discussion

(Rajat) #163

The difference I feel between one hot encoding and embedding is similar to difference between jacquard simillarity and cosine.

(Arvind Nagaraj) #164

Say for the categorical var “day of week”: all 7 values will have same number of dims…maybe sat and sun will have similar floats in some of those dims after the embeddings get learnt…since they are both weekend days…

(Clayton Yochum) #165

Is there a clear reason why no lagging of (continuous) features was needed? Does something about the model do that already, or were lags borrowed from the Kaggle winner’s code?

(Ankit Goila) #166

If you think of the embedded vectors as being in a 3D space, vectors that are closer to one another would mean that they are semantically similar. For example, cat maybe similar to dog but far away from a group of Sunday and Saturday etc.

(Maureen Metzger) #167

@gerardo, it’s df.describe(), actually :slight_smile:


print (df.describe(include=[np.object]))

This is what I use for categorical variable description

(Rajat) #169

Yes, but my question is people have already implemented it. Why @Jeremy said it hasn’t been implemented yet if they are fundamentally same.

(Pete Condon) #170

A time series problem is just structured data when a time value makes up part of the unique identifier for each row (it helps if the time values are evenly spaced, e.g. yearly, hourly, etc).

(Diyang Tang) #171

Why isn’t dropout applicable to the continuous variables? Do they get fed directly into a linear layer?

(Ezequiel) #172

Most of the time I only use the label of the column, if it says gender for example it’s clearly categorical, if you don’t have that info because columns names don’t have a meaning I would rule out as categorical any column that has floating numbers, or things like -2, -1, -4 etc…
A priori you could treat any integer column as categorical, even if there are a lot of levels, but most of the time you have the meaning of the column, so I don’t understand how the problem could arise.

(WG) #173

Would the amount of data inform you as to whether to use RandomForest or a NN?

It seems like RF would be a better option if you don’t have a big dataset, whereas the NN approach works great for things like Rossman where there are some 900k rows.

(Pete Condon) #174

Because it’s a massive assumption that changing the data doesn’t fundamentally change the result, many of the results in structured problems are highly non linear.

(Clayton Yochum) #175

This matches my experience; I’ve been cursed with small datasets, where RF’s and GBM’s (and smart ensembles of both) tend to be tough to beat.

(Zao Yang) #176

The add_datepart function is domain specific feature engineering right? It just expands one feature into many features.

(Jordan) #177

There are ensembles of NNs. I wonder if it would be useful to create effectively a NN random forest.

Especially given the order of items learned affects performance and augmentations are random… seems like you could effectively apply the same idea.

(Pete Condon) #178

Resnet’s architecture effectively mimics gradient boosting, because it passes the results from one block onto the next.

(Pranjal Yadav) #179

I think he meant he hasn’t seen the usage of augmentation/ bootstrapping in the exact similar fashion as embeddings. But still, I think he will be better able to answer your question.

(Kevin Bird) #180

Is it possible the reason this works is because it has overfit to a single paper? I guess how do you know that hasn’t happened?

(anamariapopescug) #181

it also works because abstracts are highly formulaic :).

(Anand Saha) #182

Language modeling: The next best paper on neural networks will be written by a neural network :wink: