Unet_learner is not reproducible / seed is not applied

nuhrberg · March 12, 2022, 1:43pm

Hey all,
since a few days I try to seed my unet_learner.
I create an unet_learner several times in a CI pipeline to be able to perform an anomaly detection with it. With different parameters I want to determine the best model. I am primarily interested in finding out which parameters have a good / bad effect on the result.
I use the following data set for this.

Because the dataset contains masks of the errors, my result is an IoU.

Long story short. When I try to seed the training with set_seed(42,True) (fastai.torch_core), I still get different values. Not only the IoU is different, but also the metrics during the training. The data used are always the same and always in the same order.
I have set the number of workers (num_workers) at the dataloader to zero. shuffle_train is also False.

I also tried to do the training with
with no_random(): ....
Unfortunately this did not help either.

I use a method to create the unet_loader including DataBlock / DataLoaders.
The training is outsourced in a separate method.

This is my code. I use different configuration objects (dataloaders_config: dict, learner_config: dict) to configure the learner and the dataloaders

Create Learner

def create_learner(dataloaders_config: dict, learner_config: dict, img_size: int,
                   path_obfuscated: Path, path_domain: Path, seed: int) -> Learner:
    """Creates a `unet_learner` with dataloaders included."""

    set_seed(seed,True)

    datablock = DataBlock(blocks=(ImageBlock, ImageBlock),
                          get_items=get_image_files,
                          get_y=get_y,
                          splitter=RandomSplitter(
                              valid_pct=dataloaders_config["valid_pct"], seed=seed),
                          item_tfms=Resize(size=img_size),
                          batch_tfms=[*aug_transforms(max_zoom=2.), Normalize.from_stats(*imagenet_stats)])
    dls = datablock.dataloaders(
        path_obfuscated, bs=dataloaders_config["bs"], path=path_domain, item_tfms=Resize(img_size),num_workers=0,shuffle_train=False)
    dls.c = dataloaders_config["channels"]

    if learner_config["loss_func"] == "FeatureLoss":
        loss_function = create_feature_loss(learner_config["loss_config"])
    else:
        loss_function = None

    cbs = [MixedPrecision, EarlyStoppingCallback(monitor=learner_config["early_stopping"]["monitor"],
                                                 min_delta=learner_config["early_stopping"]["min_delta"],
                                                 patience=learner_config["early_stopping"]["patience"])]

    return unet_learner(dls=dls,
                        arch=learner_config["arch"],
                        loss_func=loss_function,
                        metrics=LossMetrics(
                            loss_function.metric_names)
                        if learner_config["loss_func"] == "FeatureLoss" else None,
                        blur=learner_config["blur"],
                        norm_type=learner_config["norm_type"],
                        cbs=cbs)

As a loss function I use the feature loss function presented in the fast.ai course.

class FeatureLoss(Module):
    def __init__(self, m_feat, layer_ids, layer_wgts):
        self.m_feat = m_feat
        self.loss_features = [self.m_feat[i] for i in layer_ids]
        self.hooks = hook_outputs(self.loss_features, detach=False)
        self.wgts = layer_wgts
        self.metric_names = ['pixel',] + [f'feat_{i}' for i in range(len(layer_ids))
              ] + [f'gram_{i}' for i in range(len(layer_ids))]

    def make_features(self, x, clone=False):
        self.m_feat(x)
        return [(o.clone() if clone else o) for o in self.hooks.stored]
    
    def forward(self, input, target, reduction='mean'):
        out_feat = self.make_features(target, clone=True)
        in_feat = self.make_features(input)
        self.feat_losses = [base_loss(input,target,reduction=reduction)]
        self.feat_losses += [base_loss(f_in, f_out,reduction=reduction)*w
                             for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
        self.feat_losses += [base_loss(gram_matrix(f_in), gram_matrix(f_out),reduction=reduction)*w**2 * 5e3
                             for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
        if reduction=='none': 
            self.feat_losses = [f.mean(dim=[1,2,3]) for f in self.feat_losses[:4]] + [f.mean(dim=[1,2]) for f in self.feat_losses[4:]]
        for n,l in zip(self.metric_names, self.feat_losses): setattr(self, n, l)
        return sum(self.feat_losses)
    
    def __del__(self): self.hooks.remove()

Fit model

def fit(learner: Learner, result_path=Path, epochs_freeze: int = 10, epochs_unfreeze: int = 15, base_lr: float = 1e-3,
        lowest_lr: float = 1e-5, pct_start: float = 0.9, wd: float = 1e-3, enable_logging: bool = False):
    """Performs the actual fitting. First trains freezed (`epochs_freeze`) and then unfreezed (`epochs_unfreeze`). Send the model and the metrics (loss, etc.) of the training to the mlflow tracking server  if `enable_tracking` ist set to True."""
    def fit_cycles_and_export_model():  
        learner.fit_one_cycle(n_epoch=epochs_freeze, lr_max=base_lr, pct_start=pct_start, wd=wd)
        learner.unfreeze()
        learner.fit_one_cycle(n_epoch=epochs_unfreeze, lr_max=base_lr, pct_start=pct_start, wd=wd, cbs=[MLFlowLogCallback()] if enable_logging else None)
        learner.path = result_path
        learner.export(fname="model.pkl")

    if enable_logging:
        with learner.no_bar(),learner.no_loggings():
            fit_cycles_and_export_model()
    else:
        fit_cycles_and_export_model()

In a different module, I first create the learner and then pass the learner into fit().

I use nvidia-docker to run the pipeline. Maybe this is another cause for my problem.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

I noticed that set_seed() uses numpy.random.seed(). According to numpy’s documentation, this method is a legacy function. I’m not sure if this could be a problem.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html

Unfortunately I can’t find the error and was hoping to get help here.

The following answers could not help me:

If you need more info, leave a comment below