Lesson 4 In-Class Discussion

The difference I feel between one hot encoding and embedding is similar to difference between jacquard simillarity and cosine.

Say for the categorical var “day of week”: all 7 values will have same number of dims…maybe sat and sun will have similar floats in some of those dims after the embeddings get learnt…since they are both weekend days…

Is there a clear reason why no lagging of (continuous) features was needed? Does something about the model do that already, or were lags borrowed from the Kaggle winner’s code?


If you think of the embedded vectors as being in a 3D space, vectors that are closer to one another would mean that they are semantically similar. For example, cat maybe similar to dog but far away from a group of Sunday and Saturday etc.


@gerardo, it’s df.describe(), actually :slight_smile:

print (df.describe(include=[np.object]))

This is what I use for categorical variable description

Yes, but my question is people have already implemented it. Why @Jeremy said it hasn’t been implemented yet if they are fundamentally same.

A time series problem is just structured data when a time value makes up part of the unique identifier for each row (it helps if the time values are evenly spaced, e.g. yearly, hourly, etc).


Why isn’t dropout applicable to the continuous variables? Do they get fed directly into a linear layer?

Most of the time I only use the label of the column, if it says gender for example it’s clearly categorical, if you don’t have that info because columns names don’t have a meaning I would rule out as categorical any column that has floating numbers, or things like -2, -1, -4 etc…
A priori you could treat any integer column as categorical, even if there are a lot of levels, but most of the time you have the meaning of the column, so I don’t understand how the problem could arise.

Would the amount of data inform you as to whether to use RandomForest or a NN?

It seems like RF would be a better option if you don’t have a big dataset, whereas the NN approach works great for things like Rossman where there are some 900k rows.

Because it’s a massive assumption that changing the data doesn’t fundamentally change the result, many of the results in structured problems are highly non linear.

This matches my experience; I’ve been cursed with small datasets, where RF’s and GBM’s (and smart ensembles of both) tend to be tough to beat.


The add_datepart function is domain specific feature engineering right? It just expands one feature into many features.

There are ensembles of NNs. I wonder if it would be useful to create effectively a NN random forest.

Especially given the order of items learned affects performance and augmentations are random… seems like you could effectively apply the same idea.

Resnet’s architecture effectively mimics gradient boosting, because it passes the results from one block onto the next.


I think he meant he hasn’t seen the usage of augmentation/ bootstrapping in the exact similar fashion as embeddings. But still, I think he will be better able to answer your question.

Is it possible the reason this works is because it has overfit to a single paper? I guess how do you know that hasn’t happened?

it also works because abstracts are highly formulaic :).


Language modeling: The next best paper on neural networks will be written by a neural network :wink: