Identical Loaded Model - Not Reproducible

sut · April 17, 2020, 4:20pm

[Solved] - There’s uncontrollable randomness due to the state of shuffling your training items. This is not a problem with fastai or the saving / loading functionality. (see bottom for full write-up)

I’ve reduced my problem down to a toy MNIST example.

Step0: we’ll use reproducible settings like torch to deterministic mode, and set_seed(42) before each step.
Step1: we’ll initialize the data, model, and quickly fit it. Then we’ll .save() this model.
Step2a: we’ll initialize the same model and data as learn2 and dl2. Meanwhile learn and dl are still available.
Step2b: Now we’ll do more fitting on learn and learn2 using the same params and setting seed before we do so. As you can see, the results are different.
Why?

Step0 + Step1:

Step2a + Step2b:

Addtional Notes:

I am able to get the fit_one_cycle() step to reproduce, just not the subsequent fit.
I was originally doing export() / load_learner() / .pkl before I went to save() / load() / .pth but that has the same problem.

Full notebook Link

wdhorton · April 17, 2020, 5:50pm

I don’t know what’s going on here exactly, but I would warn against putting too much time into trying to guarantee fully identical runs. Some CUDA/cudnn operations are themselves nondeterministic (although I’ve read that recently they’ve been adding more deterministic operations) so you’re never going to be able to reproduce the run exactly as long as you’re depending on those.

sut · April 17, 2020, 5:57pm

I’ve reproduce locally, no gpu involved.

You might be right. For my example it was 96% vs 89% on a benchmark I cared about. So the “fresh” learner seemed better than the “loaded” learner. I’m looking to make sure the loaded representation is not missing any critical. But also maybe I’m doing something subtly wrong?

sut · April 17, 2020, 7:09pm

This occurs because the identical learners fit on the items in a different order coming out of their respective DataLoaders. This can be seen with:

x1, y1 = learn.dls[0].one_batch()
x2, y2 = learn2.dls[0].one_batch()
y1 == y2
>>> False

The un-reproducibility can be removed by setting shuffle_Train=False when building the DataLoaders for each learner. When this is performed, like here, we get identical results on subsequent fitting.

JorgeBriones · April 17, 2020, 7:32pm

Yes, shuffle_Train = false is where I ended up also after looking over the docs for ImageDataLoaders. Thanks for posting the initial issue, it was fun to go through it and find out what was happening under the code.
https://dev.fast.ai/vision.data#ImageDataLoaders

wdhorton · April 17, 2020, 8:25pm

Good find! But does that mean there’s some additional seed that could be set to make it deterministic? The shuffle part of shuffle_train has got to depend on something

muellerzr · April 17, 2020, 8:35pm

Each DataLoader has a shuffle_fn. By default it’s a .sample() in a random.Random. In the code this is the randomize property.

See here on lines 123 and 124:

github.com

fastai/fastai2/blob/3b2dcaad722475baff1226c7e73865d8c6b12139/fastai2/data/load.py#L123


                      bs=self.bs, shuffle=self.shuffle, drop_last=self.drop_last, indexed=self.indexed, device=self.device)
    for n in self._methods: cur_kwargs[n] = getattr(self, n)
    return cls(**merge(cur_kwargs, kwargs))


@property
def prebatched(self): return self.bs is None
def do_item(self, s):
    try: return self.after_item(self.create_item(s))
    except SkipItemException: return None
def chunkify(self, b): return b if self.prebatched else chunked(b, self.bs, self.drop_last)
def shuffle_fn(self, idxs): return self.rng.sample(idxs, len(idxs))
def randomize(self): self.rng = random.Random(self.rng.randint(0,2**32-1))
def retain(self, res, b):  return retain_types(res, b[0] if is_listy(b) else b)
def create_item(self, s):  return next(self.it) if s is None else self.dataset[s]
def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
def to(self, device): self.device = device
def one_batch(self):
    if self.n is not None and len(self)==0: raise ValueError(f'This DataLoader does not contain any batches')
    with self.fake_l.no_multiproc(): res = first(self)
    if hasattr(self, 'it'): delattr(self, 'it')

For when this is specifically called, at the beginning of each epoch get_idxs is called

sut · April 17, 2020, 8:48pm

Setting random.seed(0) or random.Random(0) doesn’t help. It looks like something how in .randomize method of DataLoaders uses random.Random(). From it’s doc:

Used to instantiate instances of Random to get generators that don’t
share state.

So, we can’t affect that random seed of this generator globally (?)

Setting learn.dls.rng = random.Random(0) seems like it should work but it doesn’t work either.

sut · April 17, 2020, 9:06pm

This works:

learn.dls.rng = random.Random(0)
learn.dls[0].rng = random.Random(0)