Are features/preprocessed data cached across epochs?

rabraham · August 3, 2023, 12:25pm

Hi,

I understand that the preprocessing steps or transformations are store in the DataBlock and referenced in the learner so that it can be easily applied during online serving to raw data.

For e.g. if we take this DataBlock

bears = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    ....
    item_tfms=Resize(128))

For training, taking the example of item_tfms=Resize(128), are the raw images resized again for each epoch? or is it only done once before training and the resized images are used across epochs thus saving on processing time(at the cost of memory/space/persistent cache)?

If the transformations/feature engineering is run again for every epoch, is there a way of introducing a cache(or caching function) that allows us to reuse derived features across epochs?

Thanks,
Rajiv

UmerAdil · August 3, 2023, 1:35pm

Hey, looking at the source code, there seems to be no caching.

DataBlock only saves the transforms. These are then applied in Datasets:

class Datasets(FilteredBase):
    ...
    def __getitem__(self, it):
        res = tuple([tl[it] for tl in self.tls])
        return res if is_indexer(it) else list(zip(*res))

To get caching, you would have to create a custom transform that does that caching itself.

@delegates()
class CachedResize(Resize):
    cache = {}
    def encodes(self, x:Image.Image|TensorBBox|TensorPoint):
       # hash input to get lookup key
       lookup_key = hash(x.tobytes() if isinstance(x, Image.Image) else x.numpy().tobytes())

       if lookup_key in self.cache:
             # return cached value if exists
            return self.cache[lookup_key]
        else:
             # otherwise apply transform, cache result and return it
            result = super().encodes(x)
            self.cache[lookup_key] = result
            return result

Note, it’s not always better to cache results. Especially when using GPUs, you have a lot of compute and limited memory.

rabraham · August 3, 2023, 3:22pm

Thanks Umer!

bvk1 · March 23, 2024, 6:53am

I came up with a different solution that saves cache to disk. To use, pass something like this to the data loader: img_cls=CachedPILImage("./pil-resize-cache", Resize(224)). You won’t need the normal Resize item_tfms. Caching the resize sped up training a lot for me, and this solution is fairly straightforward as long as you can fit all your images into the cache directory

class CachedPILImage(PILImage):
    "Caches the passed transformation of a PILImage"

    def __init__(self, path, item_tfms, cache_args={'format': 'jpeg', 'quality': 100}):
        super().__init__()
        self.path = path
        self.cache_args = cache_args
        self.item_tfms = item_tfms
        self.item_tfms_hash = hashlib.sha1(bytes(repr(item_tfms), "ascii")).hexdigest()
        os.makedirs(self.path, exist_ok=True)
    
    def create(self, fn, **kwargs):
        cached_file = os.path.join(self.path, hashlib.sha1(bytes(fn + self.item_tfms_hash, "ascii")).hexdigest())
        if os.path.exists(cached_file):
            ret = super().create(cached_file, **kwargs)
        else:
            ret = super().create(fn, **kwargs)
            ret = self.item_tfms(ret)
            ret.save(cached_file, **self.cache_args)
        return ret