I’m working on building a random forest and I am getting really terrible results. I tried to just follow the lesson1 steps, but my data is giving me a -0.14 r^2. My question is what are the next steps at that point? Is it back to the drawing board or is there something I can do still? One thing I’m thinking is I might be able to do some sort of oversampling since I have 3.5% of my data equal to 1 and the rest is 0. Does that seem like a good next step with this terrible of a starting r^2? Or should I be looking at my data quality and make sure I am at least getting a decent r^2 out of the gate before trying to do anything extra?
Any advice on this would be much appreciated. Here is what my numbers look like:
m = RandomForestRegressor(n_jobs=-1) m.fit(X_trn, y_trn) print_score(m)
[0.08436980063282429, 0.192801180918272, 0.7956208204063915, -0.14288877498635988]
After looking into this further, I think the problem is with how r^2 is calculated compared to what I actually care about. Only 3.5% of my values are 1 so when I look at the predictions, I see a lot of Actual = 1 Predicted = 0.1.
This leads me to believe I should try to oversample to put the number of 1s up to 50% and then adjust that threshold later when I’m trying to decide what actually counts as a “1”.
On the bright side, I am getting a great r^2 for the RandomForestClassifier, but unfortunately the reason it’s good is because it is predicting almost all of them as 0.