A collection of questions I have from lesson 6

I’ve had a few questions of mine pop up after going through lesson 6 of the course.

I’d appreciate any answers to them, and feel free to answer only one question or a few! :slightly_smiling_face:

Decision Trees:

  1. When using decision trees, should one use gini or the metric provided in the competition to judge the performance of the tree?
  2. Should one typically limit the number of leaf nodes or the number of samples per leaf?
  3. What’s typically a good amount of samples to have per leaf node in a decision tree?
  4. When should one use OOBE? Or rather, when should one use OOBE and validation error?
  5. Can you use decision trees for NLP tasks? If so, I suppose one would have to create their own features? Or are the tokens themselves enough (e.g., does a document contain the term “delicious”)?

Questions below have been answered. Feel free to add more to them!

Data:

  1. When viewing the values for feature importances, what is a good cut-off point (e.g., any feature that has an importance less than 0.05 will not be used)?

  2. Even if a dataset is not time series, should one not still take a random subset?

Ensembling:

  1. When ensembling a random forest and a neural network together, is a single neural network enough, or should you add more?
1 Like

To 6): I don’t think there is a general answer to your question here. I usually try to limit the features used to the top x features (maybe top 20 or so) and see if it helps my model. Often it does not and then I use every feature. I think you have to try it out.
To 7): Also here there is no general correct answer. It really depends on your dataset. With your validation dataset you try to “mimic” the data your model will see in the wild. If you can make informed guesses about the distribution of classes your model will see in the wild, you should try to mimic this in your validation dataset. Again, it really depends on your dataset and what the model should do later. There are instances where it is perfectly fine to construct your validation dataset randomly.
To 8): Again, you have to try it yourself on your specific dataset. There is no correct single data.

2 Likes

Hmm, I see: boils all down to experimentation and understanding your data.

Thank you for your response!