I understand that the preprocessing steps or transformations are store in the DataBlock and referenced in the learner so that it can be easily applied during online serving to raw data.
For training, taking the example of item_tfms=Resize(128), are the raw images resized again for each epoch? or is it only done once before training and the resized images are used across epochs thus saving on processing time(at the cost of memory/space/persistent cache)?
If the transformations/feature engineering is run again for every epoch, is there a way of introducing a cache(or caching function) that allows us to reuse derived features across epochs?
Hey, looking at the source code, there seems to be no caching.
DataBlock only saves the transforms. These are then applied in Datasets:
class Datasets(FilteredBase):
...
def __getitem__(self, it):
res = tuple([tl[it] for tl in self.tls])
return res if is_indexer(it) else list(zip(*res))
To get caching, you would have to create a custom transform that does that caching itself.
@delegates()
class CachedResize(Resize):
cache = {}
def encodes(self, x:Image.Image|TensorBBox|TensorPoint):
# hash input to get lookup key
lookup_key = hash(x.tobytes() if isinstance(x, Image.Image) else x.numpy().tobytes())
if lookup_key in self.cache:
# return cached value if exists
return self.cache[lookup_key]
else:
# otherwise apply transform, cache result and return it
result = super().encodes(x)
self.cache[lookup_key] = result
return result
Note, it’s not always better to cache results. Especially when using GPUs, you have a lot of compute and limited memory.