Are features/preprocessed data cached across epochs?

Hi,

I understand that the preprocessing steps or transformations are store in the DataBlock and referenced in the learner so that it can be easily applied during online serving to raw data.

For e.g. if we take this DataBlock

bears = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    ....
    item_tfms=Resize(128)) 

For training, taking the example of item_tfms=Resize(128), are the raw images resized again for each epoch? or is it only done once before training and the resized images are used across epochs thus saving on processing time(at the cost of memory/space/persistent cache)?

If the transformations/feature engineering is run again for every epoch, is there a way of introducing a cache(or caching function) that allows us to reuse derived features across epochs?

Thanks,
Rajiv

Hey, looking at the source code, there seems to be no caching.

DataBlock only saves the transforms. These are then applied in Datasets:

class Datasets(FilteredBase):
    ...
    def __getitem__(self, it):
        res = tuple([tl[it] for tl in self.tls])
        return res if is_indexer(it) else list(zip(*res))

To get caching, you would have to create a custom transform that does that caching itself.

@delegates()
class CachedResize(Resize):
    cache = {}
    def encodes(self, x:Image.Image|TensorBBox|TensorPoint):
       # hash input to get lookup key
       lookup_key = hash(x.tobytes() if isinstance(x, Image.Image) else x.numpy().tobytes())

       if lookup_key in self.cache:
             # return cached value if exists
            return self.cache[lookup_key]
        else:
             # otherwise apply transform, cache result and return it
            result = super().encodes(x)
            self.cache[lookup_key] = result
            return result 

Note, it’s not always better to cache results. Especially when using GPUs, you have a lot of compute and limited memory.

1 Like

Thanks Umer!