How to resume training a model after dataset cleanup?

I’m following the grizzly bear classifier example in Lesson 2, where the code is roughly this:

  • Part 1: Load data and train a classifier
    bears = DataBlock(
        blocks=(ImageBlock, CategoryBlock),
        get_items=get_image_files,
        splitter=RandomSplitter(valid_pct=0.2, seed=42),
        get_y=parent_label,
        item_tfms=Resize(128))
    bears = bears.new(
        item_tfms=RandomResizedCrop(224, min_scale=0.4),
        batch_tfms=aug_transforms(mult=.4))
    path = Path('images/bears')
    dls = bears.dataloaders(path)
    learn = cnn_learner(dls, resnet18, metrics=error_rate)
    learn.fine_tune(4)
  • Part 2: Clean up the data
    cleaner = ImageClassifierCleaner(learn)
    cleaner

(pick a bunch of photos to remove)

    print(f'Removing {len(cleaner.delete())} files.')
    for idx in cleaner.delete(): cleaner.fns[idx].unlink()
  • Part 3: Train the model (again)

Now this is where my question comes in. While we could make a whole new object and train from scratch like this:

    dls = bears.dataloaders(path)
    learn2 = cnn_learner(dls, resnet18, metrics=error_rate)
    learn2.fine_tune(4)

But… it feels inefficient to not reuse the weights we already learned from making learn. However, some of the images in the original dataset were removed during cleanup, so we can’t just learn.fine_tune(1) either without getting errors. So is there a way to start training again with the weights we already calculated in the first round of training, but with the modified datasets?

1 Like

You can do

learn.save('my_model')
...
learn2.load('my_model')

This saves the model parameters to a disk.
Alternatively if you still have the previous learner in the memory you can do something like:

learn2 = Learner(dls, learn.model,...)
1 Like

You can simply assign new dataloaders to your current learner:

dls_clean = bears.dataloaders(path)
learn.dls = dls_clean

and keep training your model using fit_one_cycle instead of fine_tune since your “head” layers are already trained and whole model is unfreezed:

learn.fit_one_cycle(4)

Also you need to be careful, in this example you used random split to get training and validation sets. So after cleaning of data new split will be made and images which were used for training before will and up in new validation set. This will cause overestimation of the model performance. To avoid this you need to split your data explicitly before training. For example you can create train and test folders and use GrandparentSplitter.

1 Like

can you elaborate on how the GrandparentSplitter is used? Would this approach be like:

bears1 = DataBlock(
  ...,
  splitter(RandomSplitter),
)

dls = bears.dataloaders(path)
learn = cnn_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)

bears2 = DataBlock(
  ...,
  splitter(GrandparentSplitter),
)

learn.dls = bears2.dataloaders(path)
learn.fit_one_cycle(4)

Hello. To use GrandparentSplitter you need to organize your dataset like this:

train
 |-grizzly
 |-black
 |-teddy
valid
 |-grizzly
 |-black
 |-teddy

This should be done before first training cycle. So the code would be:

bears1 = DataBlock(
  ...,
  splitter=GrandparentSplitter(),
)

dls = bears.dataloaders(path)
learn = cnn_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)

bears2 = DataBlock(
  ...,
  splitter=GrandparentSplitter(),
)

learn.dls = bears2.dataloaders(path)
learn.fit_one_cycle(4)

You can check out example of usage of GrandparentSplitter with MNIST dataset in Data block tutorial

1 Like

If you however use fine_tune and rerun the whole file, it should not overfit?

The point is you need to ensure that your model never saw labels of your validation data.
So yes, if after data cleanup you restart the process from:

learn = cnn_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)

there would be no problem.

1 Like

It doesn’t work from

learn = cnn_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)

With error being no such file or directory. I think I have to go to the Dataloaders.

Hello, how do we get the images to download in separate ‘train’ and ‘test’ folders?
The code in the lessons only shows how to get all the images in a single folder, as below:

bear_types = 'grizzly','black','teddy'
path = Path('bears')

if not path.exists():
    path.mkdir()
    for o in bear_types:
        dest = (path/o)
        dest.mkdir(exist_ok=True)
        results = search_images_bing(key, f'{o} bear')
        download_images(dest, urls=results.attrgot('contentUrl'))