Random Forest with a P >>> N problem

yyun2 · October 28, 2017, 7:10pm

Hi,

I am building a model to classify sick patients (0) versus non-sick patients (1) using their gene expression level. My data is very typical healthy care data which the number of variables are way greater than observations (each patients). I was managed to get the probability for each observation of my test sets. Where do I go from here?

Another question, although it seems like it is doing a great job predicting on the test sets by looking at the probabilities. However, I notice that in general there are many more sick patients than non-sick patients. Since the each class is sort of unbalanced, will this affect my model’s prediction on new data sets?

shik1470 · October 28, 2017, 9:11pm

You can refer to this thread for handling unbalanced data for one of the kaggle problems. We are still exploring ways so we can share and learn.
http://forums.fast.ai/t/porto-seguro-s-safe-driver-prediction-dealing-with-unbalanced-data/?source_topic_id=6894

jeremy · October 29, 2017, 8:42pm

That shouldn’t be a problem - we’ll be learning techniques for handling overfitting in the next two classes, in fact!

yyun2 · October 30, 2017, 9:33pm

Sounds good. Looking forward to it!