I understand that the preprocessing steps or transformations are store in the DataBlock and referenced in the learner so that it can be easily applied during online serving to raw data.
For training, taking the example of item_tfms=Resize(128), are the raw images resized again for each epoch? or is it only done once before training and the resized images are used across epochs thus saving on processing time(at the cost of memory/space/persistent cache)?
If the transformations/feature engineering is run again for every epoch, is there a way of introducing a cache(or caching function) that allows us to reuse derived features across epochs?
Hey, looking at the source code, there seems to be no caching.
DataBlock only saves the transforms. These are then applied in Datasets:
class Datasets(FilteredBase):
...
def __getitem__(self, it):
res = tuple([tl[it] for tl in self.tls])
return res if is_indexer(it) else list(zip(*res))
To get caching, you would have to create a custom transform that does that caching itself.
@delegates()
class CachedResize(Resize):
cache = {}
def encodes(self, x:Image.Image|TensorBBox|TensorPoint):
# hash input to get lookup key
lookup_key = hash(x.tobytes() if isinstance(x, Image.Image) else x.numpy().tobytes())
if lookup_key in self.cache:
# return cached value if exists
return self.cache[lookup_key]
else:
# otherwise apply transform, cache result and return it
result = super().encodes(x)
self.cache[lookup_key] = result
return result
Note, it’s not always better to cache results. Especially when using GPUs, you have a lot of compute and limited memory.
I came up with a different solution that saves cache to disk. To use, pass something like this to the data loader: img_cls=CachedPILImage("./pil-resize-cache", Resize(224)). You won’t need the normal Resize item_tfms. Caching the resize sped up training a lot for me, and this solution is fairly straightforward as long as you can fit all your images into the cache directory
class CachedPILImage(PILImage):
"Caches the passed transformation of a PILImage"
def __init__(self, path, item_tfms, cache_args={'format': 'jpeg', 'quality': 100}):
super().__init__()
self.path = path
self.cache_args = cache_args
self.item_tfms = item_tfms
self.item_tfms_hash = hashlib.sha1(bytes(repr(item_tfms), "ascii")).hexdigest()
os.makedirs(self.path, exist_ok=True)
def create(self, fn, **kwargs):
cached_file = os.path.join(self.path, hashlib.sha1(bytes(fn + self.item_tfms_hash, "ascii")).hexdigest())
if os.path.exists(cached_file):
ret = super().create(cached_file, **kwargs)
else:
ret = super().create(fn, **kwargs)
ret = self.item_tfms(ret)
ret.save(cached_file, **self.cache_args)
return ret