In the neural net version of the Blue Book For Bulldozers problem, why do we remove 'fiModelDescriptor' instead of 'fiModelDesc' or 'ModelID'?

toomanyrichies · June 6, 2021, 1:10am

Towards the end of the chapter, we see the following paragraph:

The fact that there are two variables pertaining to the “model” of the equipment, both with similar very high cardinalities, suggests that they may contain similar, redundant information… Having a column with 5,000 levels means needing 5,000 columns in our embedding matrix, which would be nice to avoid if possible. Let’s see what the impact of removing one of these model columns has on the random forest:

Then, we

xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)
valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)
m2 = rf(xs_filt2, y_filt)
m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)

=> (0.176853, 0.229973)

cat_nn.remove('fiModelDescriptor')

My question is, if the columns which create the need for larger and more memory-intensive embedding matrices are the columns with cardinalities around 5,000+ (i.e. ModelID and fiModelDesc ), why is it that we instead remove a column which has a cardinality of only 140, i.e. fiModelDescriptor? Why not remove one of the first two?

thatgeeman · June 7, 2021, 1:07pm

Looks like an odd mistake for me. Probably a mixup due to the similar column names…
I tried running the notebook after removing fiModelDesc instead of fiModelDescriptor and the RMSE only changed a little (\approx +0.004). Maybe this deserves a PR and hopefully gets fixed in the next (printed) edition.