Is Random Forest the right algorithm to use if my dataset has way less rows than columns?
I have a tabular dataset (containing information on different molecules) with 320 rows and 1030 columns/features. 1024 of them are categorical in the sense that they contain only 0s and 1s (each one of these 1024 columns signify the presence of a substructure in the molecule, 1 indicating a presence and 0 indicating absence). The remaining 6 columns are different quantitative properties of the molecule present as floats.
My ideal end goal is to identify the most important features/columns that lead to a high accuracy, as that would allow me to backtrack to the important substructures (or other properties) in these molecules that are a contributing factor.
Given my dataset size contraints, I would like to know if this is a feasible approach or if I should tackle this in a different manner or just try and get more data. Ideally looking for some solution that works with the present dataset size.
I think you’ll just have to try a bunch of them.
I’d strongly recommend trying NaiveBayes (baseline) and then LogisticRegression as per Jeremy’s approach: NB-SVM strong linear baseline | Kaggle
When you try NBLogReg you’re going to want to bin those 6 columns with continuous variables. You can use sklearn
KBinsDiscretizer for that.
@gautam_e I forgot to mention, my end goal is to predict a numerical value (my label column is not categorical). I was trying a LightGBM Regressor but am unable to get a high R^2 score, so was thinking of using RandomForest regressor / some ensemble method to prevent overfitting on my dataset with less rows. Should I still look into the approach you mentioned? How do you think I should proceed?
Most ML algo will expect more samples then features. Otherwise, they will very easily overfit. So the best option, when possible, is to seek more data.
Now, when there is no choice, and for a regression problem, you could use a linear SVM regressor.
Also, before that, you can analyse the features and use feature selections. For instance, run a covariance matrix, identify groups of highly correlated variables and only keep one per group. Alternatively, you can run a PCA on your features (nums) and only keep the n first principle components.
Finally, make sure to use regularisation.
A good post explaining this here