The difference I feel between one hot encoding and embedding is similar to difference between jacquard simillarity and cosine.
Say for the categorical var âday of weekâ: all 7 values will have same number of dimsâŚmaybe sat and sun will have similar floats in some of those dims after the embeddings get learntâŚsince they are both weekend daysâŚ
Is there a clear reason why no lagging of (continuous) features was needed? Does something about the model do that already, or were lags borrowed from the Kaggle winnerâs code?
If you think of the embedded vectors as being in a 3D space, vectors that are closer to one another would mean that they are semantically similar. For example, cat
maybe similar to dog
but far away from a group of Sunday
and Saturday
etc.
print (df.describe(include=[np.object]))
This is what I use for categorical variable description
Yes, but my question is people have already implemented it. Why @Jeremy said it hasnât been implemented yet if they are fundamentally same.
A time series problem is just structured data when a time value makes up part of the unique identifier for each row (it helps if the time values are evenly spaced, e.g. yearly, hourly, etc).
Why isnât dropout applicable to the continuous variables? Do they get fed directly into a linear layer?
Most of the time I only use the label of the column, if it says gender for example itâs clearly categorical, if you donât have that info because columns names donât have a meaning I would rule out as categorical any column that has floating numbers, or things like -2, -1, -4 etcâŚ
A priori you could treat any integer column as categorical, even if there are a lot of levels, but most of the time you have the meaning of the column, so I donât understand how the problem could arise.
Would the amount of data inform you as to whether to use RandomForest or a NN?
It seems like RF would be a better option if you donât have a big dataset, whereas the NN approach works great for things like Rossman where there are some 900k rows.
Because itâs a massive assumption that changing the data doesnât fundamentally change the result, many of the results in structured problems are highly non linear.
This matches my experience; Iâve been cursed with small datasets, where RFâs and GBMâs (and smart ensembles of both) tend to be tough to beat.
The add_datepart function is domain specific feature engineering right? It just expands one feature into many features.
There are ensembles of NNs. I wonder if it would be useful to create effectively a NN random forest.
Especially given the order of items learned affects performance and augmentations are random⌠seems like you could effectively apply the same idea.
Resnetâs architecture effectively mimics gradient boosting, because it passes the results from one block onto the next.
I think he meant he hasnât seen the usage of augmentation/ bootstrapping in the exact similar fashion as embeddings. But still, I think he will be better able to answer your question.
Is it possible the reason this works is because it has overfit to a single paper? I guess how do you know that hasnât happened?
it also works because abstracts are highly formulaic :).
Language modeling: The next best paper on neural networks will be written by a neural network