Lesson 4 In-Class Discussion

The difference I feel between one hot encoding and embedding is similar to difference between jacquard simillarity and cosine.

Say for the categorical var “day of week”: all 7 values will have same number of dims…maybe sat and sun will have similar floats in some of those dims after the embeddings get learnt…since they are both weekend days…

1 Like

Is there a clear reason why no lagging of (continuous) features was needed? Does something about the model do that already, or were lags borrowed from the Kaggle winner’s code?

2 Likes

If you think of the embedded vectors as being in a 3D space, vectors that are closer to one another would mean that they are semantically similar. For example, cat maybe similar to dog but far away from a group of Sunday and Saturday etc.

2 Likes

@gerardo, it’s df.describe(), actually :slight_smile:

1 Like

print (df.describe(include=[np.object]))

This is what I use for categorical variable description

1 Like

Yes, but my question is people have already implemented it. Why @Jeremy said it hasn’t been implemented yet if they are fundamentally same.

A time series problem is just structured data when a time value makes up part of the unique identifier for each row (it helps if the time values are evenly spaced, e.g. yearly, hourly, etc).

3 Likes

Why isn’t dropout applicable to the continuous variables? Do they get fed directly into a linear layer?

Most of the time I only use the label of the column, if it says gender for example it’s clearly categorical, if you don’t have that info because columns names don’t have a meaning I would rule out as categorical any column that has floating numbers, or things like -2, -1, -4 etc…
A priori you could treat any integer column as categorical, even if there are a lot of levels, but most of the time you have the meaning of the column, so I don’t understand how the problem could arise.

Would the amount of data inform you as to whether to use RandomForest or a NN?

It seems like RF would be a better option if you don’t have a big dataset, whereas the NN approach works great for things like Rossman where there are some 900k rows.

1 Like

Because it’s a massive assumption that changing the data doesn’t fundamentally change the result, many of the results in structured problems are highly non linear.

This matches my experience; I’ve been cursed with small datasets, where RF’s and GBM’s (and smart ensembles of both) tend to be tough to beat.

2 Likes

The add_datepart function is domain specific feature engineering right? It just expands one feature into many features.

There are ensembles of NNs. I wonder if it would be useful to create effectively a NN random forest.

Especially given the order of items learned affects performance and augmentations are random… seems like you could effectively apply the same idea.

1 Like

Resnet’s architecture effectively mimics gradient boosting, because it passes the results from one block onto the next.

2 Likes

I think he meant he hasn’t seen the usage of augmentation/ bootstrapping in the exact similar fashion as embeddings. But still, I think he will be better able to answer your question.

Is it possible the reason this works is because it has overfit to a single paper? I guess how do you know that hasn’t happened?

it also works because abstracts are highly formulaic :).

5 Likes

Language modeling: The next best paper on neural networks will be written by a neural network :wink:

10 Likes