Fastbook Chapter 9 questionnaire solutions (wiki)

ilovescience · April 29, 2020, 2:40am

Here are the questions:

What is a continuous variable?

This refers to numerical variables that have have a wide range of “continuous” values (ex: age)

What is a categorical variable?

This refers to variables that can take on discrete levels that correspond to different categories.

Provide 2 of the words that are used for the possible values of a categorical variable.

Levels or categories

What is a “dense layer”?

Equivalent to what we call linear layers.

How do entity embeddings reduce memory usage and speed up neural networks?

Especially for large datasets, representing the data as one-hot encoded vectors can be very inefficient (and also sparse). On the other hand, using entity embeddings allows the data to have a much more memory-efficient (dense) representation of the data. This will also lead to speed-ups for the model.

What kind of datasets are entity embeddings especially useful for?

It is especially useful for datasets with features that have high levels of cardinality (the features have lots of possible categories). Other methods often overfit to data like this.

What are the two main families of machine learning algorithms?

Ensemble of decision trees are best for structured (tabular data)

Multilayered neural networks are best for unstructured data (audio, vision, text, etc.)

Why do some categorical columns need a special ordering in their classes? How do you do this in pandas?

Ordinal categories may inherently have some order and by using set_categories with the argument ordered=True and passing in the ordered list, this information represented in the pandas DataFrame.

Summarize what a decision tree algorithm does.

The basic idea of what a decision tree algorithm does is to determine how to group the data based on “questions” that we ask about the data. That is, we keep splitting the data based on the levels or values of the features and generate predictions based on the average target value of the data points in that group. Here is the algorithm:

Loop through each column of the dataset in turn

For each column, loop through each possible level of that column in turn

Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable)

Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. That is, treat this as a very simple “model” where our predictions are simply the average sale price of the item’s group

After looping through all of the columns and possible levels for each, pick the split point which gave the best predictions using our very simple model

We now have two different groups for our data, based on this selected split. Treat each of these as separate datasets, and find the best split for each, by going back to step one for each group

Continue this process recursively, and until you have reached some stopping criterion for each group — for instance, stop splitting a group further when it has only 20 items in it.

Why is a date different from a regular categorical or continuous variable, and how can you preprocess it to allow it to be used in a model?

Some dates are different to others (ex: some are holidays, weekends, etc.) that cannot be described as just an ordinal variable. Instead, we can generate many different categorical features about the properties of the given date (ex: is it a weekday? is it the end of the month?, etc.)

Should you pick a random validation set in the bulldozer competition? If no, what kind of validation set should you pick?

No, the validation set should be as similar to the test set as possible. In this case, the test set contains data from later data, so we should split the data by the dates and include the later dates in the validation set.

What is pickle and what is it useful for?

Allows you to save nearly any Python object as a file.

How are mse, samples, and values calculated in the decision tree drawn in this chapter?

By traversing the tree based on answering questions about the data, we reach the nodes that tell us the average value of the data in that group, the mse, and the number of samples in that group.

How do we deal with outliers, before building a decision tree?

Finding out of domain data (Outliers)

Sometimes it is hard to even know whether your test set is distributed in the same way as your training data or, if it is different, then what columns reflect that difference. There’s actually a nice easy way to figure this out, which is to use a random forest!

But in this case we don’t use a random forest to predict our actual dependent variable. Instead we try to predict whether a row is in the validation set, or the training set.

How do we handle categorical variables in a decision tree?

We convert the categorical variables to integers, where the integers correspond to the discrete levels of the categorical variable. Apart from that, there is nothing special that needs to be done to get it to work with decision trees (unlike neural networks, where we use embedding layers).

What is bagging?

Train multiple models on random subsets of the data, and use the ensemble of models for prediction.

What is the difference between max_samples and max_features when creating a random forest?

When training random forests, we train multiple decision trees on random subsets of the data. max_samples defines how many samples, or rows of the tabular dataset, we use for each decision tree. max_features defines how many features, or columns of the tabular dataset, we use for each decision tree.

If you increase n_estimators to a very high value, can that lead to overfitting? Why or why not?

A higher n_estimators mean more decision trees are being used. However, since the trees are independent of each other, using higher n_estimators does not lead to overfitting.

What is out of bag error ?

Only use the models not trained on the row of data when going through the data and evaluating the dataset. No validation set is needed.

Make a list of reasons why a model’s validation set error might be worse than the OOB error. How could you test your hypotheses?

The major reason could be because the model does not generalize well. Related to this is the possibility that the validation data has a slightly different distribution than the data the model was trained on.

How can you answer each of these things with a random forest? How do they work?:

How confident are we in our projections using a particular row of data?

Look at standard deviation between the estimators

For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?

Using the treeinterpreter package to check how the prediction changes as it goes through the tree, adding up the contributions from each split/feature. Use waterfall plot to visualize.

Which columns are the strongest predictors?

Look at feature importance

How do predictions vary, as we vary these columns?

Look at partial dependence plots

What’s the purpose of removing unimportant variables?

Sometimes, it is better to have a more interpretable model with less features, so removing unimportant variables helps in that regard.

What’s a good type of plot for showing tree interpreter results?

Waterfall plot

What is the extrapolation problem ?

Hard for a model to extrapolate to data that’s outside the domain of the training data. This is particularly important for random forests. On the other hand, neural networks have underlying Linear layers so it could potentially generalize better.

How can you tell if your test or validation set is distributed in a different way to your training set?

We can do so by training a model to classify if the data is training or validation data. If the data is of different distributions (out-of-domain data), then the model can properly classify between the two datasets.

Why do we make saleElapsed a continuous variable, even although it has less than 9000 distinct values?

This is a variable that changes over time, and since we want our model to extrapolate for future results, we make this a continuous variable.

What is boosting?

We train a model that underfits the dataset, and train subsequent models that predicts the error of the original model. We then add the predictions of all the models to get the final prediction.

How could we use embeddings with a random forest? Would we expect this to help?

Entity embeddings contains richer representations of the categorical features and definitely can improve the performance of other models like random forests. Instead of passing in the raw categorical columns, the entity embeddings can be passed into the random forest model.

Why might we not always use a neural net for tabular modeling?

We might not use them because they are the hardest to train and longest to train, and less well-understood. Instead, random forests should be the first choice/baseline, and neural networks could be tried to improve these results or add to an ensemble.

ilovescience · May 5, 2020, 10:54pm

I am a little confused by question 14 and 20. I may have missed something in the chapter. Any thoughts?

mrfabulous1 · May 9, 2020, 6:11am

Hi ilovescience I hope your having a wonderful weekend?

I have found an answer for number 14, will have to do a bit more research as question 20 looks tricky.

Cheers mrfabulous1

teeth2i4 · December 5, 2020, 3:13pm

14th question " How do we deal with outliers, before building a decision tree?" probably asks about outliers that makes dtreeviz visualization less readable.

Like In this chapter weird year values were replaced with more sensible ones for bulldozers:
xs.loc[xs['YearMade']<1900, 'YearMade'] = 1950

j0nas · March 30, 2021, 12:21pm

I think that maybe, for question 5: How do entity embeddings reduce memory usage and speed up neural networks?, they expected the fact that using the embedding layer means we can lookup in the embedding matrix without multiplying by an actual one-hot encoded vector (thus reducing memory usage) while still calculating gradient as if + the speed is the same as indexing an array.

For question 7: What are the two main families of machine learning algorithms?. I would have said supervised and unsuperviser. Or is it this more about different problems?

Little precision for question 17 in case people get confused: max_features is the number of features consider at each split. (The features themselves vary at each split (randomnly selected) but the number stay the same).

Anyways, thanks for the answers you provided

GiantSox · April 13, 2021, 4:44am

A thought for question 20 (Make a list of reasons why a model’s validation set error might be worse than the OOB error. How could you test your hypotheses?):

The out-of-bag “dataset” for each tree is a random subset of the training set. If the validation set was split non-randomly, like in the bulldozer sales example (validation set contains data newer than the training set), the OOB dataset will not be representative of the validation set.

I suppose you could test this by training a model using a random train/validation split, and see if the OOB error and validation errors are closer.

lovelmark · July 21, 2021, 6:51am

Out of bag error is simply error computed on samples not seen during training. Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi’s). In Breiman’s original implementation of the random forest algorithm, each tree is trained on about 2/3 of the total training data. As the forest is built, each tree can thus be tested (similar to leave one out cross validation) on the samples not used in building that tree. This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed.