Hello guys!
I am a beginner in the machine learning field and I want to make sure of somethings after I watched the first 3 lectures.
1- In the first lecture, the first model that was created was a Random Forest (RF) because it does not need specific preparation of the data, right?
2- can I consider starting each project with RF as a first general step?
3- in the 4th lectures (RF interpretation and feature importances), he got the feature importances from the RF. my question is, Do I need to create a random forest with its default parameters to get the importances or to try to get the best parameters for RF before getting them?
4- After I get the important features, should I recreate my train and valid sets from the original data by selecting only those features or continue using my first split (which I used to get the important features)?
Hi, I recently went through the fastai ML course myself.
In the course Jeremy shows you basically two types of models, tree-based and SGD based (like logistic regression and NNs). RFs are a great starting point because they need minimal data pre-processing and hyper-parameter tuning (unlike Gradient Boosting Machines or NNs).
RFs are a great starting point for tabular data, i.e. data in form of spreadsheets or database tables. For unstructured data, e.g. images or text, it’s better to use NNs.
Good question, I’m pretty sure this wasn’t mentioned in the course. My guess would be that it doesn’t matter too much. Maybe try both and see if the feature importance differs a lot.
If you want comparability with your model’s performance before dropping unimportant features, you should use the same train/valid split.