Here are the questions:
- What is a continuous variable?
This refers to numerical variables that have have a wide range of “continuous” values (ex: age)
- What is a categorical variable?
This refers to variables that can take on discrete levels that correspond to different categories.
- Provide 2 of the words that are used for the possible values of a categorical variable.
Levels or categories
- What is a “dense layer”?
Equivalent to what we call linear layers.
- How do entity embeddings reduce memory usage and speed up neural networks?
Especially for large datasets, representing the data as one-hot encoded vectors can be very inefficient (and also sparse). On the other hand, using entity embeddings allows the data to have a much more memory-efficient (dense) representation of the data. This will also lead to speed-ups for the model.
- What kind of datasets are entity embeddings especially useful for?
It is especially useful for datasets with features that have high levels of cardinality (the features have lots of possible categories). Other methods often overfit to data like this.
- What are the two main families of machine learning algorithms?
- Ensemble of decision trees are best for structured (tabular data)
- Multilayered neural networks are best for unstructured data (audio, vision, text, etc.)
- Why do some categorical columns need a special ordering in their classes? How do you do this in pandas?
Ordinal categories may inherently have some order and by using
set_categories
with the argumentordered=True
and passing in the ordered list, this information represented in the pandas DataFrame.
- Summarize what a decision tree algorithm does.
The basic idea of what a decision tree algorithm does is to determine how to group the data based on “questions” that we ask about the data. That is, we keep splitting the data based on the levels or values of the features and generate predictions based on the average target value of the data points in that group. Here is the algorithm:
- Loop through each column of the dataset in turn
- For each column, loop through each possible level of that column in turn
- Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable)
- Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. That is, treat this as a very simple “model” where our predictions are simply the average sale price of the item’s group
- After looping through all of the columns and possible levels for each, pick the split point which gave the best predictions using our very simple model
- We now have two different groups for our data, based on this selected split. Treat each of these as separate datasets, and find the best split for each, by going back to step one for each group
- Continue this process recursively, and until you have reached some stopping criterion for each group — for instance, stop splitting a group further when it has only 20 items in it.
- Why is a date different from a regular categorical or continuous variable, and how can you preprocess it to allow it to be used in a model?
Some dates are different to others (ex: some are holidays, weekends, etc.) that cannot be described as just an ordinal variable. Instead, we can generate many different categorical features about the properties of the given date (ex: is it a weekday? is it the end of the month?, etc.)
- Should you pick a random validation set in the bulldozer competition? If no, what kind of validation set should you pick?
No, the validation set should be as similar to the test set as possible. In this case, the test set contains data from later data, so we should split the data by the dates and include the later dates in the validation set.
- What is pickle and what is it useful for?
Allows you to save nearly any Python object as a file.
- How are mse, samples, and values calculated in the decision tree drawn in this chapter?
By traversing the tree based on answering questions about the data, we reach the nodes that tell us the average value of the data in that group, the mse, and the number of samples in that group.
- How do we deal with outliers, before building a decision tree?
Finding out of domain data (Outliers)
Sometimes it is hard to even know whether your test set is distributed in the same way as your training data or, if it is different, then what columns reflect that difference. There’s actually a nice easy way to figure this out, which is to use a random forest!
But in this case we don’t use a random forest to predict our actual dependent variable. Instead we try to predict whether a row is in the validation set, or the training set.
- How do we handle categorical variables in a decision tree?
We convert the categorical variables to integers, where the integers correspond to the discrete levels of the categorical variable. Apart from that, there is nothing special that needs to be done to get it to work with decision trees (unlike neural networks, where we use embedding layers).
- What is bagging?
Train multiple models on random subsets of the data, and use the ensemble of models for prediction.
- What is the difference between max_samples and max_features when creating a random forest?
When training random forests, we train multiple decision trees on random subsets of the data.
max_samples
defines how many samples, or rows of the tabular dataset, we use for each decision tree.max_features
defines how many features, or columns of the tabular dataset, we use for each decision tree.
- If you increase n_estimators to a very high value, can that lead to overfitting? Why or why not?
A higher
n_estimators
mean more decision trees are being used. However, since the trees are independent of each other, using highern_estimators
does not lead to overfitting.
- What is out of bag error ?
Only use the models not trained on the row of data when going through the data and evaluating the dataset. No validation set is needed.
- Make a list of reasons why a model’s validation set error might be worse than the OOB error. How could you test your hypotheses?
The major reason could be because the model does not generalize well. Related to this is the possibility that the validation data has a slightly different distribution than the data the model was trained on.
- How can you answer each of these things with a random forest? How do they work?:
- How confident are we in our projections using a particular row of data?
Look at standard deviation between the estimators
- For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?
Using the
treeinterpreter
package to check how the prediction changes as it goes through the tree, adding up the contributions from each split/feature. Use waterfall plot to visualize.
- Which columns are the strongest predictors?
Look at feature importance
- How do predictions vary, as we vary these columns?
Look at partial dependence plots
- What’s the purpose of removing unimportant variables?
Sometimes, it is better to have a more interpretable model with less features, so removing unimportant variables helps in that regard.
- What’s a good type of plot for showing tree interpreter results?
Waterfall plot
- What is the extrapolation problem ?
Hard for a model to extrapolate to data that’s outside the domain of the training data. This is particularly important for random forests. On the other hand, neural networks have underlying Linear layers so it could potentially generalize better.
- How can you tell if your test or validation set is distributed in a different way to your training set?
We can do so by training a model to classify if the data is training or validation data. If the data is of different distributions (out-of-domain data), then the model can properly classify between the two datasets.
- Why do we make saleElapsed a continuous variable, even although it has less than 9000 distinct values?
This is a variable that changes over time, and since we want our model to extrapolate for future results, we make this a continuous variable.
- What is boosting?
We train a model that underfits the dataset, and train subsequent models that predicts the error of the original model. We then add the predictions of all the models to get the final prediction.
- How could we use embeddings with a random forest? Would we expect this to help?
Entity embeddings contains richer representations of the categorical features and definitely can improve the performance of other models like random forests. Instead of passing in the raw categorical columns, the entity embeddings can be passed into the random forest model.
- Why might we not always use a neural net for tabular modeling?
We might not use them because they are the hardest to train and longest to train, and less well-understood. Instead, random forests should be the first choice/baseline, and neural networks could be tried to improve these results or add to an ensemble.