Hey everyone! Apologies in advance if this question was asked earlier or generally discussed, feel free to link me to the discussion if it was!
These might be a little more in the domain of the machine learning course…but I was hoping someone could shed a little light on the following re: setting up the features for rossman:
Why are variables like ‘AfterStateHoliday’, ‘BeforeStateHoliday’, ‘Promo’, ‘SchoolHoliday’ in the continuous variable list? Wouldn’t they be more suited for the categorical list? I guess the after and before state holidays are a little more continuous in nature…but maybe they could be similarly maxed out like the months since competition open (max = 24), which is a categorical variable.
We devised some transformations on existing features, such as before and after holidays, and before and after promos. Does retaining the original features (holiday, promo) enhance the resulting model and if so, why? I would have thought these newer engineered features contain even more information than the originals, and consequently we could drop the original holiday and promo columns?
I have a question concerning some of the created features of the dataset. More specifically AfterSchoolHoliday, BeforeSchoolHoliday, AfterStateHoliday, and BeforeStateHoliday. I know that this is more on the ML side than the DL side, but I still feel that this thread is the right place to ask it.
I have the impression that some of the values computed in this columns do not make sense, and are just a kind of numerical upper bound. Since we are dealing with durations in days, the values should not exceed a couple of thousands. But we have this value
-9223372036854775808
appearing everywhere (see the output of the cell 68 on the github page of the course:
)
I may have missed a part where we truncate these variables (maybe it is done automatically somewhere) but if we only standardize these variables, then all the meaningful values will be sent to 0, and the variables will loose their meaning.
I apologize if I missed something, and if it is the case I would be glad to know where this issue is dealt with.
This is an error. Nan as a float, is not representable as an integer. Here is the issue on github, which should be fixed. If you update the notebook, and follow along, it should work. If not, post about it here: https://github.com/fastai/fastai/issues/201
I haven’t watched the Pinterest videos yet, but I found these, on O’Reilly’s (paid) website:
How Pinterest uses machine learning to achieve ~200M monthly active users - Yunsong Guo (Pinterest) (28:40 mins)
Pinterest has always prioritized user experiences. Yunsong Guo explores how Pinterest uses machine learning—particularly linear, GBDT, and deep NN models—in its most important product, the home feed, to improve user engagement. Along the way, Yunsong shares how Pinterest drastically increased its international user engagement along with lessons on finding the most impactful features.
Hi. I’m still a little confused as to why embeddings give neural networks a chance to learn richer representations compared to the regular one hot encoding method. How does this array of numbers to represent some categorical variable actually help here?
Also, in the lecture, I think @jeremy mentions that the number of columns to have in our lookup table is roughly max(50, c//2). What’s the intuition behind this?
Let’s say that you have a categorical variable with cardinality N.
If you 1-hot-encode this variable, you’re essentially transforming it to a set of points in N-dimensional space. This is a suitable representation for a neural net. However, points in that space are subjects to a constraint: they appear only in the corners of an N-dimensional cube, so “almost all” area in that space is not used at all.
Now, if we let each category be an arbitrary point in the N-dimensional space, we can potentially use all available area. Moreover, with the previous constraint removed, and since an embedding is a linear learnable layer, the net can move these points to whatever places that yield the smaller loss at the end.
I don’t know for sure, but it looks like the reason why the max(50, c//2) rule works well in practice is because an embedding space with c//2 dimensions is a very large space, compared to 1-hot-encoded space with c dimensions, so it’ll be more that enough to learn meaningful relationships. The maximum cap of 50 is probably because each dimension “exponentially increases” the “representation power”, so going beyond 50 is an overkill, except when the embedding has to learn really rich representations, like a language model. By saying “exponentially increases” I draw an analogy to discrete spaces, however I don’t know how it properly translates to continuous spaces.
Anyone know which papers he was referring to when he started talking about using rnn’s for imdb?
He said there was some recent papers that had just come out doing something similar to what he is doing.
I’ve found some older papers on using rnn’s for text classification but not much recent.
I’ve got a lingering question about the categorical embeddings and missing data.
I think I follow how slices of the embeddings augment each training example, and can be updated via backprop much the same as any other weights, but does the embedding row that corresponds to “missing” ever get updated if there’s no “missing” examples in the training data? If it doesn’t get trained, does using random weights wreak havoc on performance if a test-observation is missing that category?
In this lesson day of week is used as an example with the Rossman data; I think every training observation has a day-of-week, so does the 8th row of that embedding matrix ever get updated?
Has anyone been having trouble with the ColumnarDataModel in lesson3-rossman.ipynb? This was working fine last week but now when you try to use a ColumnarDataModel, the fitting procedure is fine and predicts on the training data but fails to predict on the testing data (below is a snippet adapted from lesson3-rossman.ipynb)
I am currently using the most recent version of fastai on github (the last commit id was 58eb7b18f97c19d4f9661e8110b2f8b96d517549)
It appears there were some changes in the github repository recently, I tried rolling back to just before commit id 51218e11f6f6c8603af8b9a84a02098bf9d64a82 (this was a change to the columnarDataModel) and this fixed part of the problem but replaced it with others so there might be an issue here but its not clear to me.
I had this same question, and was wondering if it made sense to randomly overwrite some day of week values (in your example) with “missing” so the training would have to come up with something to do in those cases. In my head this was analogous to using dropout.
so I have fixed this, if you revert back to an old commit on this file, it should fix your problem.
$$ cd to your fastai directory (where the column_data.py file is)
$$ git checkout 51218e11f6f6c8603af8b9a84a02098bf9d64a82~1 – column_data.py
that will fix the problem