Random Forest with a P >>> N problem


I am building a model to classify sick patients (0) versus non-sick patients (1) using their gene expression level. My data is very typical healthy care data which the number of variables are way greater than observations (each patients). I was managed to get the probability for each observation of my test sets. Where do I go from here?

Another question, although it seems like it is doing a great job predicting on the test sets by looking at the probabilities. However, I notice that in general there are many more sick patients than non-sick patients. Since the each class is sort of unbalanced, will this affect my model’s prediction on new data sets?

You can refer to this thread for handling unbalanced data for one of the kaggle problems. We are still exploring ways so we can share and learn.

That shouldn’t be a problem - we’ll be learning techniques for handling overfitting in the next two classes, in fact! :slight_smile:

Sounds good. Looking forward to it! :grinning: