Hey all,
since a few days I try to seed my unet_learner.
I create an unet_learner several times in a CI pipeline to be able to perform an anomaly detection with it. With different parameters I want to determine the best model. I am primarily interested in finding out which parameters have a good / bad effect on the result.
I use the following data set for this.
Because the dataset contains masks of the errors, my result is an IoU.
Long story short. When I try to seed the training with set_seed(42,True)
(fastai.torch_core
), I still get different values. Not only the IoU is different, but also the metrics during the training. The data used are always the same and always in the same order.
I have set the number of workers (num_workers
) at the dataloader to zero. shuffle_train
is also False.
I also tried to do the training with
with no_random(): ....
Unfortunately this did not help either.
I use a method to create the unet_loader including DataBlock / DataLoaders.
The training is outsourced in a separate method.
This is my code. I use different configuration objects (dataloaders_config: dict, learner_config: dict
) to configure the learner and the dataloaders
Create Learner
def create_learner(dataloaders_config: dict, learner_config: dict, img_size: int,
path_obfuscated: Path, path_domain: Path, seed: int) -> Learner:
"""Creates a `unet_learner` with dataloaders included."""
set_seed(seed,True)
datablock = DataBlock(blocks=(ImageBlock, ImageBlock),
get_items=get_image_files,
get_y=get_y,
splitter=RandomSplitter(
valid_pct=dataloaders_config["valid_pct"], seed=seed),
item_tfms=Resize(size=img_size),
batch_tfms=[*aug_transforms(max_zoom=2.), Normalize.from_stats(*imagenet_stats)])
dls = datablock.dataloaders(
path_obfuscated, bs=dataloaders_config["bs"], path=path_domain, item_tfms=Resize(img_size),num_workers=0,shuffle_train=False)
dls.c = dataloaders_config["channels"]
if learner_config["loss_func"] == "FeatureLoss":
loss_function = create_feature_loss(learner_config["loss_config"])
else:
loss_function = None
cbs = [MixedPrecision, EarlyStoppingCallback(monitor=learner_config["early_stopping"]["monitor"],
min_delta=learner_config["early_stopping"]["min_delta"],
patience=learner_config["early_stopping"]["patience"])]
return unet_learner(dls=dls,
arch=learner_config["arch"],
loss_func=loss_function,
metrics=LossMetrics(
loss_function.metric_names)
if learner_config["loss_func"] == "FeatureLoss" else None,
blur=learner_config["blur"],
norm_type=learner_config["norm_type"],
cbs=cbs)
As a loss function I use the feature loss function presented in the fast.ai course.
class FeatureLoss(Module):
def __init__(self, m_feat, layer_ids, layer_wgts):
self.m_feat = m_feat
self.loss_features = [self.m_feat[i] for i in layer_ids]
self.hooks = hook_outputs(self.loss_features, detach=False)
self.wgts = layer_wgts
self.metric_names = ['pixel',] + [f'feat_{i}' for i in range(len(layer_ids))
] + [f'gram_{i}' for i in range(len(layer_ids))]
def make_features(self, x, clone=False):
self.m_feat(x)
return [(o.clone() if clone else o) for o in self.hooks.stored]
def forward(self, input, target, reduction='mean'):
out_feat = self.make_features(target, clone=True)
in_feat = self.make_features(input)
self.feat_losses = [base_loss(input,target,reduction=reduction)]
self.feat_losses += [base_loss(f_in, f_out,reduction=reduction)*w
for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
self.feat_losses += [base_loss(gram_matrix(f_in), gram_matrix(f_out),reduction=reduction)*w**2 * 5e3
for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
if reduction=='none':
self.feat_losses = [f.mean(dim=[1,2,3]) for f in self.feat_losses[:4]] + [f.mean(dim=[1,2]) for f in self.feat_losses[4:]]
for n,l in zip(self.metric_names, self.feat_losses): setattr(self, n, l)
return sum(self.feat_losses)
def __del__(self): self.hooks.remove()
Fit model
def fit(learner: Learner, result_path=Path, epochs_freeze: int = 10, epochs_unfreeze: int = 15, base_lr: float = 1e-3,
lowest_lr: float = 1e-5, pct_start: float = 0.9, wd: float = 1e-3, enable_logging: bool = False):
"""Performs the actual fitting. First trains freezed (`epochs_freeze`) and then unfreezed (`epochs_unfreeze`). Send the model and the metrics (loss, etc.) of the training to the mlflow tracking server if `enable_tracking` ist set to True."""
def fit_cycles_and_export_model():
learner.fit_one_cycle(n_epoch=epochs_freeze, lr_max=base_lr, pct_start=pct_start, wd=wd)
learner.unfreeze()
learner.fit_one_cycle(n_epoch=epochs_unfreeze, lr_max=base_lr, pct_start=pct_start, wd=wd, cbs=[MLFlowLogCallback()] if enable_logging else None)
learner.path = result_path
learner.export(fname="model.pkl")
if enable_logging:
with learner.no_bar(),learner.no_loggings():
fit_cycles_and_export_model()
else:
fit_cycles_and_export_model()
In a different module, I first create the learner and then pass the learner into fit()
.
I use nvidia-docker to run the pipeline. Maybe this is another cause for my problem.
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
I noticed that set_seed()
uses numpy.random.seed()
. According to numpy’s documentation, this method is a legacy function. I’m not sure if this could be a problem.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html
Unfortunately I can’t find the error and was hoping to get help here.
The following answers could not help me:
If you need more info, leave a comment below