I’ve just started playing with fast.ai v1, and found some minor performance bugs/issues that may be worth fixing in v2:
When using the pets database, loaded via:
win_workers = defaults.cpus
data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(),
size=299, bs=bs//2, num_workers=win_workers).normalize(imagenet_stats)
I get this result:
%time data.show_batch(rows=2, figsize=(5,5))
Wall time: 27.8 s
If I then put in the following monkey patch (that changes how num_workers is assigned), I get a 30x speedup:
def fixed_one_batch(self, ds_type:DatasetType=DatasetType.Train, detach:bool=True, denorm:bool=True, cpu:bool=True)->Collection[Tensor]:
“Get one batch from the data loader of ds_type
. Optionally detach
and denorm
.”
dl = self.dl(ds_type)
w = dl.num_workers # CHANGE: read from and assign to dl explicitly
dl.num_workers = 0
try: x,y = next(iter(dl))
finally: dl.num_workers = w # CHANGE: assign to dl explicitly
if detach: x,y = to_detach(x,cpu=cpu),to_detach(y,cpu=cpu)
norm = getattr(self,‘norm’,False)
if denorm and norm:
x = self.denorm(x)
if norm.keywords.get(‘do_y’,False): y = self.denorm(y, do_x=True)
return x,y
ImageDataBunch.one_batch = fixed_one_batch
%time data.show_batch(rows=2, figsize=(5,5))
Wall time: 673 ms
This was on a Windows machine with an i7-8700k CPU (6 cores, 12 threads). The improvement on Linux will probably not be as great (since the overhead of starting processes is lower on that OS), but I would guess it will still be 10x or so.
Also, I would recommend that the default num_workers value be dropped in the data bunch constructors from defaults.cpus to (maybe) half that. Starting as many processes as totally available hardware cores/threads is highly unlikely to be optimal even with large training sets and long epoch times: a lower value will likely lead to higher throughput. (I found about 6 workers is optimal when testing across 7700k, 8700k, and i9-9900k cpus, when using full res imagenet data and Titan class GPUs. The optimal choice will depend on data set size, disk read speed, number of processor cores, GPU etc, but 4-6 seems a more reasonable generic default than maxing this value out.)