Lesson1 Reproducible Results - Setting seed not working

dserles · February 11, 2019, 10:11pm

I’m using the Cats v. Dogs code from lesson 1 with my own data. I tried setting the random seed for pytorch, numpy, and python, but I’m getting different results every time I run the learner.

Does anyone have any ideas on what might be going on? For reference, I have about 2000 training images and 400 validation images. Both sets are split evenly between my two classes. Here are a couple pictures

KarlH · February 11, 2019, 11:10pm

It looks to be caused by overfitting. Your training loss drops way down while your validation loss dips a bit, then increases. The gap between training and validation loss is huge, and improvements on the training set are not reflected in the validation set. Given that the model has overfit, the numbers you’re seeing for validation loss and accuracy are likely just noise.

mike.moloch · February 11, 2019, 11:15pm

I have the same problem with lesson v3 (PETS) data set. Scratching my head as to why the results are different if I just reload my saved model and run it with exact same data set, same seed and same learning rate, same epochs? difference happens to be about 1% but still it starts out way higher than it does for Jeremy in the lecture and doesn’t go down that much.

Pomo · February 12, 2019, 6:30pm

I have also struggled with the reproducibility problem, and continue to. You might add to your settings:

    torch.backends.cudnn.deterministic = True  #tested - needed for reproducibility
    torch.backends.cudnn.benchmark = False
    torch.cuda.manual_seed(seed_value)

and when creating the DataBunch, num_workers=0.

That said, these settings (along with the ones you already have) used to give me consistent results run after run. However now I get weird bi-modal outcomes for training and validation losses - 80% of runs yield two consistent numbers and 20% two different consistent numbers.

Very frustrating when trying to assess settings and models!

Pomo · February 13, 2019, 9:32pm

Another clue and partial solution.

The issue is that after setting all the random seeds before creating the DataBunch, training and validation losses are inconsistent (non-reproducible) as of the first fit(). The losses have a bimodal or trimodal pattern. These observations are all after a kernel restart between runs, no transforms, and num_workers = 0.

What I see is that setting these random seeds again before fit() yields consistent, reproducible loss results. However, setting them before create_cnn still yields the bimodal loss pattern.

Conclusions:

Setting random seeds before creating the DataBunch is needed to have a consistent Train/Validate split.
create_cnn, in this case Resnet50, leaves the random seeds set inconsistently. (It’s possible that both DataBunch creation and create_cnn leave inconsistencies.)
Somthing in fit_one_cycle() then uses a random seed, perhaps to shuffle the images. Because the random seeds are inconsistent across runs, the losses are inconsistent.
You can get reproducible results by setting random seeds before creating the DataBunch (with num_workers=0) AND before the first fit_one_cycle().

I should say these conclusions are tentative because 1) there are other explanations, such as a flaky GPU; and 2) I have not identified the source of the inconsistency. But I have spend a large number of hours getting to this point, and hope that a more competent developer will eventually investigate.

Here’s the function (originally by someone else) I use to reset every random seed I’ve ever seen mentioned:

def random_seed(seed_value, use_cuda):
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu  vars
    random.seed(seed_value) # Python
    if use_cuda: 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  #needed
        torch.backends.cudnn.benchmark = False
#Remember to use num_workers=0 when creating the DataBunch.

I hope this information can help anyone else who needs reproducible training. For myself, when trying to squeeze off fractions of a percent for a Kaggle competition, it does not work to have variations of 5% across training runs. With that much variation, small effects of changes to the model and parameters get lost in the noise.

dserles · February 15, 2019, 4:03pm

Amazing, thanks so much for the comment.
I was able to get reproducible results using what you recommended. In my case, I found that it was the creation of the new learner that reset the random seed. So I needed to call my random_seed() function before I initialised the ConvLearner object.
I’ve added a couple pictures to help illustrate for anyone looking at this in the future.

zzgong · August 18, 2019, 2:23am

Set num_workets=0 can reproduce result. But another question arise: trainging will consume much more time, did u figure it out?
Thank in advance

zzgong · August 18, 2019, 2:33am

I found when set num_workers=n and run learn.lr_find, the result seems the same, so i suppose it will work for the training phase, so maybe don’t need to set num_workers=0? For it will costs u much more time to train.

Pomo · August 18, 2019, 8:07pm

What I saw - way back then - is that num_workers=0 is needed to get exactly and precisely the same loss numbers across runs. It has to do with each worker getting its own random seed. num_workers affects only the speed of the data pipeline, not the model itself. If the data pipeline is not limiting, then setting num_workers to 0 will not slow down training.

That said, I have found exact reproducibility to be less useful than I once thought.

The starting point seems to have little effect on ultimate accuracy, which is what usually matters.
I can’t be sure that the effect of varying a hyperparameter/model/optimizer from one consistent initialization will have the same effect starting from a different initialization. Maybe, maybe not. (Does anyone know the answer?) I now assess a change by averaging the outcomes of several (randomized) runs.

YMMV, as they say online.

much_learner · January 15, 2020, 2:39pm

For myself, when trying to squeeze off fractions of a percent for a Kaggle competition, it does not work to have variations of 5% across training runs. With that much variation, small effects of changes to the model and parameters get lost in the noise.

That I understand, seeding everything will help to see if your changes are good\bad for the model. But for production (submission) it doesn’t make sense to fix the seed, right?

I.e. for kaggle, we don’t see private test set and cannot be sure the split we use is the best.

Pomo · January 18, 2020, 12:02am

Hmm, I am not sure I understand your question. With a Kaggle test set, public or private, you evaluate all the test input and do not ever see the correct targets. So the train/validation split does not take place.

I agree that you cannot be sure any tuning of the model will have the same effect on a different input set. The assumption is that any improvement of the model for a particular validation set will generalize to the test set.

I suppose you could try the same model tuning on a different train/validation split and better assess its effect in general. But I am just making this idea up - never tried it.

much_learner · January 19, 2020, 3:05pm

I meant that in kaggle when you do a final submission it doesn’t make sense to seed because you don’t know how your model trained with the fixed or unfixed split perform against private test.
So it’s like a gamble. You can hope that random split will give a slightly better model. (proven we can actually do random split)

But I am not sure. What’s the best practice?

Pomo · January 20, 2020, 12:41am

I suspect we are not communicating clearly.

It does not make sense to seed for the test set because the model is already trained and the seed is irrelevant at this point.

you don’t know how your model trained with the fixed or unfixed split perform against private test.
So it’s like a gamble. You can hope that random split will give a slightly better model.

This is always true whether or not you manually fix the seed. A different split will yield a different trained model, which may or may not perform better on the test set.

You can try different splits and weight initialization to see their effect on the test score. The danger is that you will tune the model to the public test set but it will not generalize to the private test set. On the other hand, you may stumble upon initial parameters that yield a trained model that performs better everywhere.

I’m not qualified to cite best practices and I doubt there is one. Actually wrestling with a Kaggle competition and studying the winners’ practices will teach you a lot. Some topics that are relevant to these questions are the “lottery ticket” hypothesis and k-fold training.

HTH, Malcolm

AMusic · March 27, 2020, 6:02pm

Using the data banch u can se num_works = 0 but not using the the new DataBlock. Do you have a clue if there is a workaround or we should use the “old” method using databunches?

rgold · April 1, 2020, 11:01am

Hi @AMusic, I have just tried implementing the random_seed() method demonstrated by @dserles using the function from @Pomo as below on a tabular databunch and a tabular learner. I did not set num_workers = 0 and was still got reproducible results over 1 epoch (test epoch). I am working on a Jupyter Notebook:

random_seed(2, True)
data = …Initialize databunch here, leave num_workers as default/don’t specify it…

random_seed(2, True)
learn = tabular_learner(data…) …Initialise learner here

learn.fit_one_cycle(1)

I hope this helps.

Thanks @Pomo for your helpful function!

rgold · April 1, 2020, 11:10am

For more information from the docs:
https://docs.fast.ai/dev/test.html#getting-reproducible-results

(Apologies if this has already been reposted)

MiriamA · September 8, 2022, 3:50pm

Hi everybody,
I implemented the random_seed method shown by @Pomo and applied it to my deep-learning algorithm which relies on a transfer-learning approach. I set the random seed at the very beginning of my script, even before data augmentation. Below you find some screenshots:

By setting a number of epochs = 1, I obtain reproducible results in two different runs:

epoch,train_loss,valid_loss,accuracy,time
0,0.832483,0.970716,0.531331,07:38
epoch,train_loss,valid_loss,accuracy,time
0,0.832483,0.970716,0.531331,07:46

When I change the number of epochs, e.g. increased to 5, I still get reproducible results across two different runs but they are different from those obtained when setting 1 epoch.

epoch,train_loss,valid_loss,accuracy,time
0,0.971364,1.057626,0.491989,07:36
epoch,train_loss,valid_loss,accuracy,time
0,0.971364,1.057626,0.491989,07:35

What I noticed is that when increasing the number of epochs, the initial values of training and validation losses are higher.

Did somebody ever experience something like that? I would have expected to see the very same values but apparently they change according to the number of epochs.

Thank you,

Miriam