Random Forest with a P >>> N problem


I am building a model to classify sick patients (0) versus non-sick patients (1) using their gene expression level. My data is very typical healthy care data which the number of variables are way greater than observations (each patients). I was managed to get the probability for each observation of my test sets. Where do I go from here?

Another question, although it seems like it is doing a great job predicting on the test sets by looking at the probabilities. However, I notice that in general there are many more sick patients than non-sick patients. Since the each class is sort of unbalanced, will this affect my model’s prediction on new data sets?

1 Like

You can refer to this thread for handling unbalanced data for one of the kaggle problems. We are still exploring ways so we can share and learn.

1 Like

That shouldn’t be a problem - we’ll be learning techniques for handling overfitting in the next two classes, in fact! :slight_smile:

1 Like

Sounds good. Looking forward to it! :grinning: