I have a dataset which comprises approximately 2600 rows (data points) and 50 columns (features). I am trying to solve a regression problem as my response variable is numeric.
I have tried to use Random Forests but the R2 value on the test set is extremely low, compared to the same score on the train set.
I thus decided to give Neural Networks a try and trained a simple network with just one hidden layer. I now have two problems.
- If I do not normalize the input data, the loss goes to infinity and the output of the network is simply NaN
- If I do normalize the input data (in this case,
MinMaxScalerfrom scikit-learn), the MSE is about 0.5 but I found out that all it does is predicting the mean of the response variable. So, even if the response variable (test set) ranges from 0 to 5, the predictions range from about 1.1 to 1.3.
The network is rather simple and I’m using just 100 epochs. Moreover, the loss at each epoch is more or less the same (0.8 the first one and than 0.7 for the remaining ones).
Why is it predicting just the mean? Is the model too simple? I tried adding another hidden layer but the results do not change at all. I even removed correlated features, but again I obtain the same results.