Is there any ways to improve performance on test set?

Background:

I am supporting a client on a project which was building a deep neural network model to do binary classification problem, relatively small sample size, there are 2 sample groups, one group is around 300+ sample, another is around 100. Each sample is about 20,000 features.

First using train test split, using 20% data for test, never using on training.
Second, using training data perform k-fold cross-validation to train the model.
Third, using the test set to evaluate candidate models.

Symptom:
The metrics (acc, precision, recall ) for test set not improved when they performed well on the train and validation set.

The list things I have tried:

Reduced feature dimensions via PCA with 5, 25, 100 features separately, each time need a much larger model (up to 5 layers, 4096 -> 2048 -> 512 -> 128 -> 64) to perform well on train and validation set, but not on the test set.

Applying dropout with various combination, the batch norm with or without dropout

Tried “Versatile Nonlinear Feature Selection Algorithm for High-dimensional Data” (https://github.com/riken-aip/pyHSICLasso) to analyze why it is the case

Is it Insufficient data to get a big picture.

Using Hilbert Schmidt Independence Criterion Lasso (HSIC Lasso), which is a black box (nonlinear) feature selection method considering the nonlinear input and output relationship, I found that the important features on 3 of them, total data set, train set, and test set, are different. The total data set and train set are almost same, but there is some difference, one of the big problems is that the difference between train set is test set, is this why test set not perform well using thing learn on the train set.

Considering the very limit option to acquire more data (I cannot find more data), is there any way to improve performance on the test set.

How many epochs are you training for? It sounds like you’re overfitting your training data (which may not be a bad thing if done well, but here it seems to be limiting you). Have you experimented with training for less time? Also perhaps mabye keep your test set at 10% (so you have a 70/20/10 split) so you can train a bit more on the data?

1 Like

hi, @[muellerzr] thanks for the reply, Yes, it is fighting for overfitting. Currently, I am doing 50 epochs. And yes, I tried split 10% for the test set and applied early stopping too, which seems not improved much.

That’s quite a bit of training! Have you tested what 10, 20, 30, and 40 epochs look like? (doing just one model to save on time before an ensemble for k-fold). Early Stopping I’ve found helps but my model can still overfit occasionally (and I prefer to watch it myself too incrementally)

1 Like

Is this tabular data?
This is just playing around with different stuff:
If yes then you could try using random forest to filter out the important features and then using just those features.

And just to clarify (as you haven’t mentioned)
Are you using the fit one cycle policy (and in general the fastai way of doing things)?

1 Like

Further studied the data, I plotted

the data on 2-D and 3-D, via PCA. Is there a way to classify it?

Regarding relatively a small sample data set (sample size = 300). I have the following train strategy:

run # of iteration:

  1. run train_test_split (sklearn lib, random_state=None) keep 10% data for test not used in training , 

  2. run 5-fold training on the train set (further split train set to train and validation set).

each iteration runs 1 and 2, and run number of iterations, then select best model, is this a valid training strategy?

my input data are some kind structured data (like table, 20,000 * 16) which obtained after embedding.

after some iteration.

Does not look like you’re using fastai.
Have you tried finding the best lr and use annealing lr and momentum?

No, I am not using fastai. I am using Keras, with built-in Adam optimizer, any suggestion on this, Thanks a lot.

So one of the most effective fastai methodologies is using an effective learning rate.
Jeremy states that a majority of the learning happens in the first new epochs. So choosing a good learning rate is very important.

Start with a very low start_lr and change it at each mini-batch until it reaches a very high end_lr . Recorder will record the loss at each iteration. Plot those losses against the learning rate to find the optimal value before it diverges.

Thsi is from this paper.
fastai has this inbuilt.

So to replicate you can refer to this

this helps you find an optimum learning rate.
As per the fastai methodology, you need to use slanting learning rate as per the fit one cycle methodology described here

1 Like

Thank you for the suggestions!
I have tried my case on fastai, the results were extremely impressive.

I have just created a DataBunch with my train, validation, test set like following:
train_ds = tdatautils.TensorDataset(X_train, y_train)
valid_ds = tdatautils.TensorDataset(X_val, y_val)
test_ds = tdatautils.TensorDataset(X_test, y_test)

batch_size = 32
g_data_bunch = DataBunch.create(train_ds, valid_ds, bs=batch_size)
test_bunch = DataBunch.create(test_ds, test_ds, bs=batch_size)

I build a simple NN model, create a Learner with my train set and validation set. and train

learner.fit(10) (however my fastai is version: ‘1.0.54’, no fit_one_cycle method)

I have created separate DataBunch with my test set (above) and run evaluation like this

learner.validate(test_bunch.valid_dl)

It gives me very high results. Is this the correct way to run an evaluation on my test set?

Yes, I believe that is the correct way to predit on your test set.

Using the validate and predict method.
You should try and verify the results via the predict method.

1 Like

Thanks! But I tried predict method, the follow is what i got, something wrong with my code?

I was wondering why the loss are negative.
45