Epochs of arbitrary length

jeremy · October 25, 2018, 2:45pm

I was thinking we would change the training loop…

sgugger · October 25, 2018, 8:05pm

The only easy way to do this would be to interrupt the train_dl, which means we might go over the same data.

boris · April 3, 2019, 5:45am

Would it be a good idea to track this potential improvement as a github issue? I can create one if you want.

sgugger · April 3, 2019, 1:13pm

Issues are for known bugs only. Potential new features are discussed and tracked on the forum

binalpatel · April 3, 2019, 6:29pm

I’ve been modifying the callbacks to work at “every n” batch level instead of epochs, maybe that’s one possible avenue? So for example the SaveModelCallback I’ve modified it to save every n batches instead of a every epoch. A default setting could be n_batches = total number of batches in an epoch, which (i think) should then give the same behavior as what currently exists for the callbacks, while allowing for more fine grained control.

That being said I haven’t dug into the code base too much as of yet, so I may be missing a lot of the complexity involved.

mnpinto · April 6, 2019, 12:34pm

I’ve been thinking about custom samplers and with a few changes in DataBunch.create I added the functionality to pass a list of samplers when calling .databunch().

My DataBunch looks like this:

class ImageDataBunch(ImageDataBunch):
    @classmethod
    def create(cls, train_ds:Dataset, valid_ds:Dataset, test_ds:Optional[Dataset]=None, path:PathOrStr='.', bs:int=64,
               val_bs:int=None, num_workers:int=defaults.cpus, dl_tfms:Optional[Collection[Callable]]=None,
               device:torch.device=None, collate_fn:Callable=data_collate, no_check:bool=False, sampler=None, **dl_kwargs)->'DataBunch':
        "Create a `DataBunch` from `train_ds`, `valid_ds` and maybe `test_ds` with a batch size of `bs`. Passes `**dl_kwargs` to `DataLoader()`"
        datasets = cls._init_ds(train_ds, valid_ds, test_ds)
        val_bs = ifnone(val_bs, bs)
        if sampler is None: sampler = [RandomSampler] + 3*[SequentialSampler]
        dls = [DataLoader(d, b, sampler=s(d, bs=b), num_workers=num_workers, **dl_kwargs) for d,b,s in
               zip(datasets, (bs,val_bs,val_bs,val_bs), sampler) if d is not None]
        return cls(*dls, path=path, device=device, dl_tfms=dl_tfms, collate_fn=collate_fn, no_check=no_check)
    
class ImageList(ImageList):
    _bunch = ImageDataBunch

Then the custom samplers (in random and sequential samplers I just add the **kwargs in the init):

class SequentialSampler(SequentialSampler):
    def __init__(self, data_source, **kwargs):
        self.data_source = data_source
        
class RandomSampler(RandomSampler):
    def __init__(self, data_source, replacement=False, num_samples=None, **kwargs):
        self.data_source = data_source
        self.replacement = replacement
        self.num_samples = num_samples
        
class FixedLenRandomSampler(RandomSampler):
    def __init__(self, data_source, bs, epoch_size, *args, **kwargs):
        super().__init__(data_source)
        self.epoch_size = epoch_size*bs
    
    def __iter__(self):
        return iter(torch.randperm(len(self.data_source))[:len(self)].tolist())
    
    def __len__(self):
        return self.epoch_size

Then I create a list of samplers for train_ds, valid_ds, fix_ds and test_ds:

train_sampler = partial(FixedLenRandomSampler, epoch_size=100)
samplers = [train_sampler, SequentialSampler, SequentialSampler, SequentialSampler]

Finally the datablock and learner as usual:

data = (ImageList.from_folder(path) 
        .split_by_folder(train='training', valid='testing')            
        .label_from_folder()           
        .transform(get_transforms(), size=64) 
        .databunch(sampler=samplers, bs=64))

learn = cnn_learner(data, models.densenet121, metrics=[accuracy])

Then calling fit it runs with the specified epoch_size

Working example on colab: https://colab.research.google.com/drive/1k2Ut_ZINNSzYkJt2bjUPD9gxC_FGwzQd

To avoid repeating samples I guess we can modify the Sampler to remember the already sampled indices and sample only from the remaining until all have been sampled.

This has many other applications like episode sampling for few-shot learning. I will share a sampler for that soon!

mnpinto · April 6, 2019, 4:37pm

Quick update… This should do the trick for fixed epoch length sampling without replacement.

class SequentialSampler(SequentialSampler):
    def __init__(self, data_source, **kwargs):
        self.data_source = data_source
        
class RandomSampler(RandomSampler):
    def __init__(self, data_source, replacement=False, num_samples=None, **kwargs):
        self.data_source = data_source
        self.replacement = replacement
        self.num_samples = num_samples
        
class FixedLenRandomSampler(RandomSampler):
    def __init__(self, data_source, bs, epoch_size, *args, **kwargs):
        super().__init__(data_source)
        self.epoch_size = epoch_size*bs
        self.not_sampled = np.array([True]*len(data_source))
    
    @property
    def _reset_state(self): self.not_sampled[:] = True
        
    def __iter__(self):
        ns = sum(self.not_sampled)
        idx_last = []
        if ns >= len(self):
            idx = np.random.choice(np.where(self.not_sampled)[0], size=len(self), replace=False).tolist()
            if ns == len(self): self._reset_state
        else:
            idx_last = np.where(self.not_sampled)[0].tolist()
            self._reset_state
            idx = np.random.choice(np.where(self.not_sampled)[0], size=len(self)-len(idx_last), replace=False).tolist()
        self.not_sampled[idx] = False
        idx = [*idx_last, *idx]
        # print(ns, len(idx), len(idx_last)) # debug
        return iter(idx)
    
    def __len__(self):
        return self.epoch_size

The idx_last is for when the remaining unused samples are not enough to make an epoch, so it uses the available ones, then resets the state and samples how many needed to complete the epoch.

However, when using lr_find for example, we are “wasting” samples. To correct that the following callback is needed to call _reset_state on_train_begin.

class ResetSamplers(LearnerCallback):
    def __init__(self, learn):
        super().__init__(learn)
        self.dls = learn.data.dls
        
    def on_train_begin(self, **kwargs):
        for o in self.dls:
            if hasattr(o.dl.sampler, '_reset_state'):
                o.dl.sampler._reset_state

boris · April 8, 2019, 11:25pm

A possible approach I thought of would be to modify on_batch_end and introduce return {"stop_epoch":True} depending on quantity of input data already processed.

Let me know if it seems like a reasonable solution and I can try to implement it.

digitalspecialists · April 9, 2019, 5:52am

Callbacks at “every n” batches is a great start! I would love to specify time based callbacks: “Stop after 4 hours, saving best model every ten minutes”. I’m forever doing 1 epoch runs and dividing time available to set number of epochs.

Edit: I see there is a StopAfterNBatches callback. Cool! On medium/large datasets I can use this to get an approximate estimate at the start of training and then auto set number of epochs (and soon, it sounds, number of batches) based on time I have to train.

While I am on the train of thought, I’ve always wished for a way to “slice” a 1cycle training cycle. So I can run “pieces” when convenient. Like ¼ now, ½ tonight, and ¼ Thursday. I suppose it’s possible to do if I get my head around the training scheduler enough, but a helper would be cool.

denised · August 26, 2019, 5:57pm

I like the ‘do every n batches’ better than the complex samplers — it seems simpler and cleaner to me. Of course, we can do that already, just by keeping a counter.

But it would make the common case simple to have a value of ‘n’ that was a common denominator that received special treatment in the fit loop, i.e. have a on_n_batch_end callback.

Along with that, it would make sense for validation to be part of this ‘every n’ cycle (in fact it could be a callback itself).

I am actually implementing this approach, as I need to see validation results more often than the end of every epoch.

boris · December 17, 2019, 4:59pm

I’m just adding a note that we can use BatchSampler which can let us create mini-batches easily.

For example BatchSampler(RandomSampler, batch_size=32, drop_last=False) will create a sampler that goes through the entire dataset and pick randomly only 32 samples without replacement at each epoch.
The idea is to iterate through them (for example with batch size of 4) until end of each mini-batch, which would be the end of an epoch.

gasimovh · February 13, 2020, 1:18pm

Hi, is there a plan about implementing “epoch_size” in Fastai? Maybe in fastai-v2? Thanks.

boris · February 13, 2020, 2:53pm

I actually have an implementation here: Fastai v2 chat

gasimovh · February 13, 2020, 3:57pm

In case it might be useful for someone, I patched it in fastai v1:

class MyDl:
    def __init__(self, dl, epoch_size):
        self.dl = dl
        self.iter = iter(dl)
        self.c = dl.c
        self.dataset = dl.dataset
        self.epoch_size = epoch_size
        
    def __iter__(self):
        for i in range(self.epoch_size):
            try:
                yield next(self.iter)
            except:
                # start from beginning if end of self.iter reached
                self.iter = iter(self.dl)
                yield next(self.iter)

    def __len__(self):
        return self.epoch_size

path = untar_data(URLs.MNIST)
data = (ImageList.from_folder(path) 
        .split_by_folder(train='training', valid='testing')            
        .label_from_folder()
        .transform(get_transforms(), size=16) 
        .databunch(bs=64))

data.train_dl = MyDl(data.train_dl, 4)

learn = cnn_learner(data, models.resnet18, metrics=[accuracy])
learn.fit_one_cycle(2)

You can find the Colab notebook here https://colab.research.google.com/drive/1bkATL1uNyHOlB4DW4ImYvAseSAZKCeoj

I hope I didn’t break anything

Alexandre_DIEUL · March 1, 2020, 12:49am

Thanks for your working example in collab. I’ll adapt it to my case and test it.
Have a good week-end.

boris · March 16, 2020, 3:42pm

Just for info you can now use the method “partial_dataloaders”.

lemme_test_that · July 3, 2020, 7:19am

May you provide an example? I am trying to use partial_dataloaders with images but I am getting
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'fastai2.vision.core.PILImage'>

My code:

data_block = DataBlock(
    blocks=(
        ImageBlock,
        CategoryBlock
    ),
    n_inp=1,
    get_items=get_image_files,
    get_y=parent_label,
    splitter=RandomSplitter(),
    item_tfms=[Resize(self.input_size), ToTensor],
    batch_tfms=aug_transforms(max_warp=0)
)
tdls = data_block.datasets(self.data_dir).partial_dataloaders(bs=10, partial_n=30)

boris · July 3, 2020, 4:39pm

Does it work when you use regular DataLoaders?

soerendip · June 2, 2021, 5:27pm

I tried it with

splits = RandomSplitter(valid_pct=0.01)(range_of(df_train))

to = TabularPandas(
    df_train,
    y_names="target",
    cat_names = cat_names,
    cont_names = cont_names,
    procs = [Categorify, FillMissing, Normalize],
    splits=splits
)

# and convert it do dataloader with batch size of 64
batch_size = 64
dls = to.partial_dataloaders(bs=10, partial_n=30)

did not work. During learner fitting I got : KeyError: 131235

I like the idea and it could be a kwarg for the dataloader() method.

Banus · November 13, 2021, 12:05am

The new (v2) API for dataloaders has a n argument to limit the size to n samples; see Data core | fastai

If I understood correctly the implementation, it selects a different set each time.
It works on DataBlock and indeed it uses only a subset of the dataset during training. Code example:

dls = dataset.dataloaders(path, bs=16, n=800)