Epochs of arbitrary length

I was thinking we would change the training loop…

1 Like

The only easy way to do this would be to interrupt the train_dl, which means we might go over the same data.

Would it be a good idea to track this potential improvement as a github issue? I can create one if you want.

Issues are for known bugs only. Potential new features are discussed and tracked on the forum :slight_smile:

I’ve been modifying the callbacks to work at “every n” batch level instead of epochs, maybe that’s one possible avenue? So for example the SaveModelCallback I’ve modified it to save every n batches instead of a every epoch. A default setting could be n_batches = total number of batches in an epoch, which (i think) should then give the same behavior as what currently exists for the callbacks, while allowing for more fine grained control.

That being said I haven’t dug into the code base too much as of yet, so I may be missing a lot of the complexity involved.

2 Likes

I’ve been thinking about custom samplers and with a few changes in DataBunch.create I added the functionality to pass a list of samplers when calling .databunch().

My DataBunch looks like this:

class ImageDataBunch(ImageDataBunch):
    @classmethod
    def create(cls, train_ds:Dataset, valid_ds:Dataset, test_ds:Optional[Dataset]=None, path:PathOrStr='.', bs:int=64,
               val_bs:int=None, num_workers:int=defaults.cpus, dl_tfms:Optional[Collection[Callable]]=None,
               device:torch.device=None, collate_fn:Callable=data_collate, no_check:bool=False, sampler=None, **dl_kwargs)->'DataBunch':
        "Create a `DataBunch` from `train_ds`, `valid_ds` and maybe `test_ds` with a batch size of `bs`. Passes `**dl_kwargs` to `DataLoader()`"
        datasets = cls._init_ds(train_ds, valid_ds, test_ds)
        val_bs = ifnone(val_bs, bs)
        if sampler is None: sampler = [RandomSampler] + 3*[SequentialSampler]
        dls = [DataLoader(d, b, sampler=s(d, bs=b), num_workers=num_workers, **dl_kwargs) for d,b,s in
               zip(datasets, (bs,val_bs,val_bs,val_bs), sampler) if d is not None]
        return cls(*dls, path=path, device=device, dl_tfms=dl_tfms, collate_fn=collate_fn, no_check=no_check)
    
class ImageList(ImageList):
    _bunch = ImageDataBunch

Then the custom samplers (in random and sequential samplers I just add the **kwargs in the init):

class SequentialSampler(SequentialSampler):
    def __init__(self, data_source, **kwargs):
        self.data_source = data_source
        
class RandomSampler(RandomSampler):
    def __init__(self, data_source, replacement=False, num_samples=None, **kwargs):
        self.data_source = data_source
        self.replacement = replacement
        self.num_samples = num_samples
        
class FixedLenRandomSampler(RandomSampler):
    def __init__(self, data_source, bs, epoch_size, *args, **kwargs):
        super().__init__(data_source)
        self.epoch_size = epoch_size*bs
    
    def __iter__(self):
        return iter(torch.randperm(len(self.data_source))[:len(self)].tolist())
    
    def __len__(self):
        return self.epoch_size

Then I create a list of samplers for train_ds, valid_ds, fix_ds and test_ds:

train_sampler = partial(FixedLenRandomSampler, epoch_size=100)
samplers = [train_sampler, SequentialSampler, SequentialSampler, SequentialSampler]

Finally the datablock and learner as usual:

data = (ImageList.from_folder(path) 
        .split_by_folder(train='training', valid='testing')            
        .label_from_folder()           
        .transform(get_transforms(), size=64) 
        .databunch(sampler=samplers, bs=64))

learn = cnn_learner(data, models.densenet121, metrics=[accuracy])

Then calling fit it runs with the specified epoch_size :slight_smile:

Working example on colab: https://colab.research.google.com/drive/1k2Ut_ZINNSzYkJt2bjUPD9gxC_FGwzQd

To avoid repeating samples I guess we can modify the Sampler to remember the already sampled indices and sample only from the remaining until all have been sampled.

This has many other applications like episode sampling for few-shot learning. I will share a sampler for that soon!

3 Likes

Quick update… This should do the trick for fixed epoch length sampling without replacement.

class SequentialSampler(SequentialSampler):
    def __init__(self, data_source, **kwargs):
        self.data_source = data_source
        
class RandomSampler(RandomSampler):
    def __init__(self, data_source, replacement=False, num_samples=None, **kwargs):
        self.data_source = data_source
        self.replacement = replacement
        self.num_samples = num_samples
        
class FixedLenRandomSampler(RandomSampler):
    def __init__(self, data_source, bs, epoch_size, *args, **kwargs):
        super().__init__(data_source)
        self.epoch_size = epoch_size*bs
        self.not_sampled = np.array([True]*len(data_source))
    
    @property
    def _reset_state(self): self.not_sampled[:] = True
        
    def __iter__(self):
        ns = sum(self.not_sampled)
        idx_last = []
        if ns >= len(self):
            idx = np.random.choice(np.where(self.not_sampled)[0], size=len(self), replace=False).tolist()
            if ns == len(self): self._reset_state
        else:
            idx_last = np.where(self.not_sampled)[0].tolist()
            self._reset_state
            idx = np.random.choice(np.where(self.not_sampled)[0], size=len(self)-len(idx_last), replace=False).tolist()
        self.not_sampled[idx] = False
        idx = [*idx_last, *idx]
        # print(ns, len(idx), len(idx_last)) # debug
        return iter(idx)
    
    def __len__(self):
        return self.epoch_size

The idx_last is for when the remaining unused samples are not enough to make an epoch, so it uses the available ones, then resets the state and samples how many needed to complete the epoch.

However, when using lr_find for example, we are “wasting” samples. To correct that the following callback is needed to call _reset_state on_train_begin.

class ResetSamplers(LearnerCallback):
    def __init__(self, learn):
        super().__init__(learn)
        self.dls = learn.data.dls
        
    def on_train_begin(self, **kwargs):
        for o in self.dls:
            if hasattr(o.dl.sampler, '_reset_state'):
                o.dl.sampler._reset_state
2 Likes

A possible approach I thought of would be to modify on_batch_end and introduce return {"stop_epoch":True} depending on quantity of input data already processed.

Let me know if it seems like a reasonable solution and I can try to implement it.

Callbacks at “every n” batches is a great start! I would love to specify time based callbacks: “Stop after 4 hours, saving best model every ten minutes”. I’m forever doing 1 epoch runs and dividing time available to set number of epochs.

Edit: I see there is a StopAfterNBatches callback. Cool! On medium/large datasets I can use this to get an approximate estimate at the start of training and then auto set number of epochs (and soon, it sounds, number of batches) based on time I have to train.

While I am on the train of thought, I’ve always wished for a way to “slice” a 1cycle training cycle. So I can run “pieces” when convenient. Like ¼ now, ½ tonight, and ¼ Thursday. I suppose it’s possible to do if I get my head around the training scheduler enough, but a helper would be cool.

I like the ‘do every n batches’ better than the complex samplers — it seems simpler and cleaner to me. Of course, we can do that already, just by keeping a counter.

But it would make the common case simple to have a value of ‘n’ that was a common denominator that received special treatment in the fit loop, i.e. have a on_n_batch_end callback.

Along with that, it would make sense for validation to be part of this ‘every n’ cycle (in fact it could be a callback itself).

I am actually implementing this approach, as I need to see validation results more often than the end of every epoch.

I’m just adding a note that we can use BatchSampler which can let us create mini-batches easily.

For example BatchSampler(RandomSampler, batch_size=32, drop_last=False) will create a sampler that goes through the entire dataset and pick randomly only 32 samples without replacement at each epoch.
The idea is to iterate through them (for example with batch size of 4) until end of each mini-batch, which would be the end of an epoch.

Hi, is there a plan about implementing “epoch_size” in Fastai? Maybe in fastai-v2? Thanks.

I actually have an implementation here: Fastai v2 chat

In case it might be useful for someone, I patched it in fastai v1:

class MyDl:
    def __init__(self, dl, epoch_size):
        self.dl = dl
        self.iter = iter(dl)
        self.c = dl.c
        self.dataset = dl.dataset
        self.epoch_size = epoch_size
        
    def __iter__(self):
        for i in range(self.epoch_size):
            try:
                yield next(self.iter)
            except:
                # start from beginning if end of self.iter reached
                self.iter = iter(self.dl)
                yield next(self.iter)

    def __len__(self):
        return self.epoch_size

path = untar_data(URLs.MNIST)
data = (ImageList.from_folder(path) 
        .split_by_folder(train='training', valid='testing')            
        .label_from_folder()
        .transform(get_transforms(), size=16) 
        .databunch(bs=64))

data.train_dl = MyDl(data.train_dl, 4)

learn = cnn_learner(data, models.resnet18, metrics=[accuracy])
learn.fit_one_cycle(2)

You can find the Colab notebook here https://colab.research.google.com/drive/1bkATL1uNyHOlB4DW4ImYvAseSAZKCeoj

I hope I didn’t break anything :slight_smile:

2 Likes

Thanks for your working example in collab. I’ll adapt it to my case and test it.
Have a good week-end. :slight_smile:

Just for info you can now use the method “partial_dataloaders”.

May you provide an example? I am trying to use partial_dataloaders with images but I am getting
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'fastai2.vision.core.PILImage'>

My code:

data_block = DataBlock(
    blocks=(
        ImageBlock,
        CategoryBlock
    ),
    n_inp=1,
    get_items=get_image_files,
    get_y=parent_label,
    splitter=RandomSplitter(),
    item_tfms=[Resize(self.input_size), ToTensor],
    batch_tfms=aug_transforms(max_warp=0)
)
tdls = data_block.datasets(self.data_dir).partial_dataloaders(bs=10, partial_n=30)

Does it work when you use regular DataLoaders?

I tried it with

splits = RandomSplitter(valid_pct=0.01)(range_of(df_train))

to = TabularPandas(
    df_train,
    y_names="target",
    cat_names = cat_names,
    cont_names = cont_names,
    procs = [Categorify, FillMissing, Normalize],
    splits=splits
)

# and convert it do dataloader with batch size of 64
batch_size = 64
dls = to.partial_dataloaders(bs=10, partial_n=30)

did not work. During learner fitting I got : KeyError: 131235

I like the idea and it could be a kwarg for the dataloader() method.

The new (v2) API for dataloaders has a n argument to limit the size to n samples; see Data core | fastai

If I understood correctly the implementation, it selects a different set each time.
It works on DataBlock and indeed it uses only a subset of the dataset during training. Code example:

dls = dataset.dataloaders(path, bs=16, n=800)
1 Like