Wiki: Lesson 4

I think basically everything between

df = train[columns]
df = test[columns]


joined = join_df(joined, df, ['Store', 'Date'])
joined_test = join_df(joined_test, df, ['Store', 'Date'])

are supposed to run twice, one for training and one for testing dataset. Hence what I did was:

  1. df = train[columns], …, joined = join_df(joined, df, [‘Store’, ‘Date’])
  2. df = test[columns], …, joined_test = join_df(joined_test, df, [‘Store’, ‘Date’])

Remember to make sure df has 844338 rows for train and 41088 rows for test before you join it with joined and joined_test respectively.


Hey everyone! Apologies in advance if this question was asked earlier or generally discussed, feel free to link me to the discussion if it was!

These might be a little more in the domain of the machine learning course…but I was hoping someone could shed a little light on the following re: setting up the features for rossman:

  1. Why are variables like ‘AfterStateHoliday’, ‘BeforeStateHoliday’, ‘Promo’, ‘SchoolHoliday’ in the continuous variable list? Wouldn’t they be more suited for the categorical list? I guess the after and before state holidays are a little more continuous in nature…but maybe they could be similarly maxed out like the months since competition open (max = 24), which is a categorical variable.

  2. We devised some transformations on existing features, such as before and after holidays, and before and after promos. Does retaining the original features (holiday, promo) enhance the resulting model and if so, why? I would have thought these newer engineered features contain even more information than the originals, and consequently we could drop the original holiday and promo columns?



Hi everyone,

I were training my Sentiment model based on a pre-trained model with not that high accuracy (4.2508664). After the block

m3.freeze_to(-1), 1, metrics=[accuracy])  # train the final layer
m3.unfreeze(), 1, metrics=[accuracy], cycle_len=1)

I see

epoch      trn_loss   val_loss   accuracy                   
    0      1.092117   1.025311   0.485915
epoch      trn_loss   val_loss   accuracy                    
    0      0.494757   0.393013   0.913172

It seemed going well. But after two cycles of restart, the accuracy went low again. I guess it jumped out of a narrow sweet spot.

epoch      trn_loss   val_loss   accuracy                    
    0      0.465001   0.3577     0.918454  ok 
    1      0.427471   0.326164   0.921135  ok                 
    2      0.435863   0.341614   0.918734  ok               
    3      0.421462   0.329268   0.921855  ok              
    4      0.648535   0.504928   0.881362                    
    5      0.65179    0.53642    0.887404                    
    6      0.846718   0.830428   0.666973                    
    7      0.901057   0.944498   0.557698                    
    8      1.019339   0.985083   0.568662                    
    9      1.033611   1.002059   0.517165                   
    10     1.006995   1.367595   0.108635                    
    11     1.008204   1.259699   0.178577                    
    12     1.003845   1.113971   0.497519                    
    13     0.999826   0.856634   0.660131    

How should I obtain a good model in this case then? Should I stop restarting after two cycles? Thanks in advance.

Hello everyone,

I have a question concerning some of the created features of the dataset. More specifically AfterSchoolHoliday, BeforeSchoolHoliday, AfterStateHoliday, and BeforeStateHoliday. I know that this is more on the ML side than the DL side, but I still feel that this thread is the right place to ask it.
I have the impression that some of the values computed in this columns do not make sense, and are just a kind of numerical upper bound. Since we are dealing with durations in days, the values should not exceed a couple of thousands. But we have this value
appearing everywhere (see the output of the cell 68 on the github page of the course:

I may have missed a part where we truncate these variables (maybe it is done automatically somewhere) but if we only standardize these variables, then all the meaningful values will be sent to 0, and the variables will loose their meaning.

I apologize if I missed something, and if it is the case I would be glad to know where this issue is dealt with.


This is an error. Nan as a float, is not representable as an integer. Here is the issue on github, which should be fixed. If you update the notebook, and follow along, it should work. If not, post about it here:

1 Like

I haven’t watched the Pinterest videos yet, but I found these, on O’Reilly’s (paid) website:

  1. How Pinterest uses machine learning to achieve ~200M monthly active users - Yunsong Guo (Pinterest) (28:40 mins)
    Pinterest has always prioritized user experiences. Yunsong Guo explores how Pinterest uses machine learning—particularly linear, GBDT, and deep NN models—in its most important product, the home feed, to improve user engagement. Along the way, Yunsong shares how Pinterest drastically increased its international user engagement along with lessons on finding the most impactful features.

  2. Escaping the forest, falling into the net: The winding path of Pinterest’s migration from GBDT to neural nets - Xiaofang Chen (Pinterest), Derek Cheng (Pinterest) (40:16 mins)
    Pinterest’s power is grounded in its personalization systems. Over the years, these recommender systems have evolved through different types of models. Xiaofang Chen and Derek Cheng explore Pinterest’s recent transition from a GBDT system to one based in neural networks powered by TensorFlow, covering the challenges and solutions to providing recommendations to over 160M monthly active users.

They are many more videos on the Safari’s website. But I was allowed to post only two links.


Overfitting vs. Underfitting, an example

training, validation, accuracy
0.3,         0.2,        0.92 = under fitting
0.2,         0.3,        0.92 = over fitting

I think under fitting is more like this

training, validation, accuracy
0.6         0.3            0.84
0.5         0.3            0.84
0.4         0.2            0.84
0.3         0.1            0.84
0.2         0.01          0.84

and overfitting is more like this

training, validation, accuracy
0.6         0.5             0.92
0.5         0.44           0.92
0.4         0.4             0.92
0.3         0.45           0.89
0.2         0.5             0.85

What you think

1 Like

The NLP notebooks links are broken.

Hi. I’m still a little confused as to why embeddings give neural networks a chance to learn richer representations compared to the regular one hot encoding method. How does this array of numbers to represent some categorical variable actually help here?

Also, in the lecture, I think @jeremy mentions that the number of columns to have in our lookup table is roughly max(50, c//2). What’s the intuition behind this?

Let’s say that you have a categorical variable with cardinality N.

If you 1-hot-encode this variable, you’re essentially transforming it to a set of points in N-dimensional space. This is a suitable representation for a neural net. However, points in that space are subjects to a constraint: they appear only in the corners of an N-dimensional cube, so “almost all” area in that space is not used at all.

Now, if we let each category be an arbitrary point in the N-dimensional space, we can potentially use all available area. Moreover, with the previous constraint removed, and since an embedding is a linear learnable layer, the net can move these points to whatever places that yield the smaller loss at the end.

I don’t know for sure, but it looks like the reason why the max(50, c//2) rule works well in practice is because an embedding space with c//2 dimensions is a very large space, compared to 1-hot-encoded space with c dimensions, so it’ll be more that enough to learn meaningful relationships. The maximum cap of 50 is probably because each dimension “exponentially increases” the “representation power”, so going beyond 50 is an overkill, except when the embedding has to learn really rich representations, like a language model. By saying “exponentially increases” I draw an analogy to discrete spaces, however I don’t know how it properly translates to continuous spaces.


Can you produce multiple dependent variables from a single learner, or do you need to learn for each variable?

Anyone know which papers he was referring to when he started talking about using rnn’s for imdb?
He said there was some recent papers that had just come out doing something similar to what he is doing.
I’ve found some older papers on using rnn’s for text classification but not much recent.

Hi ,
how to work on time series multi class classification problem.

I’ve got a lingering question about the categorical embeddings and missing data.

I think I follow how slices of the embeddings augment each training example, and can be updated via backprop much the same as any other weights, but does the embedding row that corresponds to “missing” ever get updated if there’s no “missing” examples in the training data? If it doesn’t get trained, does using random weights wreak havoc on performance if a test-observation is missing that category?

In this lesson day of week is used as an example with the Rossman data; I think every training observation has a day-of-week, so does the 8th row of that embedding matrix ever get updated?

Has anyone been having trouble with the ColumnarDataModel in lesson3-rossman.ipynb? This was working fine last week but now when you try to use a ColumnarDataModel, the fitting procedure is fine and predicts on the training data but fails to predict on the testing data (below is a snippet adapted from lesson3-rossman.ipynb)

and the ColumnarDataModel is initially built as follows:

I am currently using the most recent version of fastai on github (the last commit id was 58eb7b18f97c19d4f9661e8110b2f8b96d517549)

It appears there were some changes in the github repository recently, I tried rolling back to just before commit id 51218e11f6f6c8603af8b9a84a02098bf9d64a82 (this was a change to the columnarDataModel) and this fixed part of the problem but replaced it with others so there might be an issue here but its not clear to me.

I had this same question, and was wondering if it made sense to randomly overwrite some day of week values (in your example) with “missing” so the training would have to come up with something to do in those cases. In my head this was analogous to using dropout.

@chris2, I’m having different (related?) issues with ColumnarModelData, when I try to create the object:

The bottom of that same trace:

on a fresh pull (58eb7b18).

so I have fixed this, if you revert back to an old commit on this file, it should fix your problem.
$$ cd to your fastai directory (where the file is)
$$ git checkout 51218e11f6f6c8603af8b9a84a02098bf9d64a82~1 –
that will fix the problem

@chris2 Wow, it sure did. Thank you!

Hi @mcintyre1994, can you please tell if the Arxiv dataset is available anywhere?