Alternatively, it may be possible to create an alternating loading mechanism using Nvidia DALI, although we haven’t really done much with that yet.
This got me curious and I played with it a bit. I took a simple image processing pipeline like this:
dsrc = DataSource(items, tfms)
tdl = TfmdDL(dsrc, bs=10, shuffle=True,
after_item=[ Resize(224, method = ResizeMethod.Crop), ToTensor],
after_batch=[Cuda, ByteToFloatTensor, Normalize(tmeans,tstds)],
num_workers=8)
And created a DALI pipeline that does the same thing:
class DALIPipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id):
super(DALIPipeline, self).__init__(batch_size, num_threads, device_id)
self.input = ops.FileReader(file_root = image_dir,random_shuffle=True)
self.tfms = compose(
ops.ImageDecoder(device = "mixed"),
ops.Resize(device = "gpu", resize_shorter = 224),
ops.CropMirrorNormalize(device = "gpu", crop = (224, 224), mean = means, std = stds)
)
def define_graph(self):
images, labels = self.input(name="Reader")
return self.tfms(images), labels
The results are encouraging performance wise, since I got 5x speed improvement (4sec vs 20sec for 20k images). There are several things that contribute to that:
- Transforms implemented in C++, including their custom JPEG decoder
- More work done on the GPU - here resizing and cropping of the images.
- The pipeline prefetches batches which allows to overlap the CPU and GPU work.
The downside is in the flexibility, if we want to do something not included in the built in operators we have to either implement it in C++, build it and link as a plugin. Or use the [Torch]PythonFunction operator, however these are limited at the moment as they can’t be used in the exec_pipelined
mode and can’t operate on the GPU at all, as far as I can tell.
This is again the same pipeline but using python to load files and assign labels, and DALI operators to decode, resize and normalize images.
class MixedPipeline(Pipeline):
def __init__(self, np_items, batch_size, num_threads, device_id):
super(MixedPipeline, self).__init__(batch_size, num_threads, device_id, exec_async=False, exec_pipelined=False)
self.input_iter = iter(DataLoader(np_items, bs=batch_size, create_batch=noop, shuffle=True, num_workers = num_threads))
self.path_input = ops.ExternalSource()
self.y_tfms = compose(
ops.PythonFunction(extract_label),
ops.PythonFunction(categorize)
)
self.x_tfms = compose(
ops.PythonFunction(read_file),
ops.ImageDecoder(device = "mixed", output_type = types.RGB),
ops.Resize(device = "gpu", resize_shorter = 224),
ops.CropMirrorNormalize(device = "gpu", crop = (224, 224), mean = means, std = stds)
)
def define_graph(self):
self.paths = self.path_input()
return (self.x_tfms(self.paths), self.y_tfms(self.paths))
def iter_setup(self):
self.feed_input(self.paths, next(self.input_iter))
This gave me a smaller, but still respectable 2x speed increase.
The whole notebook is here: