How to do statistical test to show significant improvement?

e.g. I want to show model A (use mish activation) significantly improve model B (use relu).

  • What test should I choose ?
  • What confidence level should I choose ?
  • How many runs should I run ?
  • Should I do something to control randomness

Note: I am doing NLP.

If it is for a publication, consult a statistician to do thing properly (hell, consult me if needed :P) or, at least, check the literature in your domain to see what methodology is commonly used.

The following is a good, unlikely to be challenged, default:

  • a t-test is the default test to check whether two means are different (it has some hypothesis but you should be good)
  • 95% (a p-value of 0.05) is traditional and fairly well suited if you do not have thousands of datapoints
  • not less than 30, 100 or 200 would be good
  • randomness will increase your variance but if you have enough sample it will make your result more representative of what other users can expect given their seeds so I would recommend keeping it

I’d be mindful of this. This will be discussed in-depth in Lesson 2 in the upcoming course, however null hypothesis significance testing can be flawed

A few links (as I can’t post the videos):

The ASA’s Statement on p -Values: Context, Process, and Purpose

Null Hypothesis Significance Testing Never Worked

And then published on Springer:

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations


Testing for statistical significance requires either

  • assumptions on the distribution of error terms (for example) from which you can derive the distribution for fit statistics like log loss that also incorporate stochastic gradient descent effects. I am not aware of common assumptions or papers. OR
  • some other means of estimating the distribution- that would usually be repeating lots of times, which is not feasible for complex models.

Re. nestorDemeure’s suggestion at t-test: I think that’s a reasonable start but actually I suspect the test loss can be non-normal enough to make this invalid e.g. in the presence of overfitting (so it can fluctuate massively on a whim).

I’d be curious to see sampling results from a simpler model (might show t-test is adequate), if you have time :slight_smile:


Haha yes statistical significance tests are usually flawed due to:

  • misinterpretation of what they are saying
  • distributional assumptions not being valid

Yes it can but I am giving defaults here. If you do not have a huge sample and are not competent in the domain it is good enough and unlikely to change your results.

If you want to go further you need to consult someone competent that will actually look at your data and make an informed decision :slight_smile:

Hi Thanks all comments.
Only until today that I know testing is also a big thing.

I am doing NLP pretraining,
first pretrain

  • training a model takes 4 days on 1 v100
  • data is pure text (wiki and book corpus)
  • data is hundreds of billions of tokens
    and then finetune on GLUE benchmark(includes 9 corpus) to get the score of the pretrained model
  • 9 different size corpus (task)
  • previous study report the best one run of 10 runs
  • 10 runs takes 1 day on 1 v100

Since my setting is so computationally intensive.
How can I practically apply statical test ?
Or is there another suggested way to show one model is better than another model ?

You can still do tests with smaller samples but they will just be less powerful meaning that you will need to maximize the power available.

In those conditions, if possible, I would recommend a paired t-test (once more a t-test, its not a coïncidence: this test was designed with small sample size in mind). The idea is to use the same seed (and other sources of randomness) for your Mish and non-Mish sample so that your samples go in pairs like so (note that you still vary the seed from one pair to another) :

  • sample 1 mish (seed=42) | sample 1 non-mish (seed=42)
  • sample 2 mish (seed=53) | sample 2 non-mish (seed=53)

This should increase the power significantly while not doing all the analysis on a single seed (you could get trapped in a lucky/unlucky case).

I would recommend at least 5 samples per category, you cannot get away with less.
A p-value of 0.05 might be too stringent in those conditions (you can use this approach to find a better p-value, I have a personnal implementation of the concept in R if needed).

Another possibility, if you have a lot of samples in a category (because you did a lot of tests with Mish or non-Mish) you could also do a test with different sample size and gain power from the largest sample.

Obviously you will have to point to your readers that your power is limited due to small sample size (you could say that this is a preliminary study to check wether the addition of Mish to this problem is worth further explorations).


Thanks, it is super useful.
5 samples may be the most my computational environment allow me. :joy:
I’ve seen the paper, but I am not with a statistic background so it’s a little bit hard for me .
I’ll try to ask my professor to see if he knows something.
Can you provide me your implementation of that concept?
It will be super helpful.
Thanks in advance !

I sent you a PM with my code and further informations no the subject :slight_smile:

1 Like