Overfitting problem - Linear Regression using Neural Net written in Keras

I am looking for suggestion for a problem that I am working on:
Problem: predict a rank of a product based on its features. The ranks are given as a percentile.
Data: 4500 rows with around 900 features for each observation.
model: I am using MLP (a 2 layer NN) in Keras Model has two 3 layers.
First layer dense(16), second: dense(4) and third dense(1). I am also using batchnormalization and dropout(0.5) after each dense layer. I am using mse as loss function and SGD as optimizer. I tried 100 iteration so far and getting rmse on test data as ~24.50.
Issue: Model is overfitting as seen in learning curve train vs test loss and also train rmse ~5-7 but test is ~24. I tried increasing probability of dropout to 0.8 it negatively impacted. more data is absolutely not an option. How can I reduce model complexity I don’t know because it is only 3 layers only, I can try using two layers too but suspecting underfitting (no harm trying?)

Is there any other suggestions? please let me know.
Note: data is scaled before training using standardscaler.


def baseline_model_899():

create model

sgd = optimizers.SGD(lr=0.001, decay=1e-6, momentum=0.9, nesterov=True)
model = Sequential()
model.add(Dense(16, activation=‘relu’, kernel_initializer = ‘normal’, input_shape=(899,)))
model.add(Dense(1, init=‘normal’, activation=‘linear’))
model.compile(loss=‘mean_squared_error’, optimizer=sgd, metrics=[‘accuracy’])
return model

estimator = KerasRegressor(build_fn=baseline_model_899, epochs=200, batch_size=5, verbose=0)
kfold = KFold(n_splits=10, random_state=42)
results = cross_val_score(estimator, X_train, y_train, cv=kfold)
print(“RMSE:”, np.sqrt(results.std()))

I would suggest first get some baseline like Lasso Regression in Scikit-learn to get a baseline performance for your problem.

Then do a One Layer (No Hidden Layer) Neural Net to see if you can come close to that result (basically doing straight linear regression using Keras).

Then Add one hidden Layer.

I can’t see all your code here. One thing you want to make sure is run a Evaluation on your Training Data. it should give same performance as your training process. Then make sure there are no bugs in your test data prep (like standard scalar etc) and it mirrors the same data prep step using training scalar values.

Hopefully this helps. If this is a open dataset, I would suggest put your notebook in gist.github.com so we can give specific suggestion.

I second @ramesh’s suggestion of doing the evaluation on your training data. It often happens that the test data gives strange results but then it turns out that the test data is actually in a different format than the training data (because they forgot to scale it or whatever).

Overfitting is not a bad problem to have, by the way. If your model overfits then you at least you know it has the capacity to learn.

But not every problem you run into with training models is due to overfitting. I think people draw this conclusion too quickly. There can be many other things wrong. For instance, if you run the model on test data but the BatchNormalization and Dropout layers are still in training mode, then it will appear to give very bad test results.

I already tried Gradient Boosted Trees, Random Forest, ExtraTreesRegressor and Adaboosted Decision trees from scikitlearn with paraneter tuning using GridSearchCV, RepeatedKFold cross validation, stratified k fold cross validation. The best RMSE I got there is 24.3, I want to see if with lesser effort and simpler code neural network to beat ensemble method, hence the try.
@ramesh, there is no other cod, this is all the code, I define a base model, pass it in kerasregressor wrapper and then using 10-fold cross validation on training data. As I told there are only 4500 rows in training data and ~900 features. There is absolutely no business knowledge available to feature engineer and basic engineering proven to be counterproductive. But I am using imputing missing data, standard scaling data etc already.
@both of you: The code tells it clearly that I am doing cross validation on training data already. My best performance using this model is 27 which is much worse than ensemble method so far.
I will post the code in gist.github.com

@Ramesh here is the code in gist: