GPU not utilised, problem with DataLoader

Hi everyone!

I have been trying to implement a Conditional Variational Autoencoder with FastAI v2. I came across a great example in another thread in the forums here. So far, so good!

I am trying to run the original example notebook in Colab, and although I am using a GPU runtime, the GPU is being underutilised, while the CPU is working at 100%; an epoch takes 1:30 minutes, which is a lot compared to what is shown in the example notebook from the thread above (around 20 seconds per epoch). I am training on MNIST and have changed nothing about the notebook, yet I seem unable to make my setup utilise the GPU.

I used W&B to monitor the system stats - the GPU Utilisation is consistently under 15%, while the CPU utilisation is at 100%. My intution is that the model is waiting on the DataLoaders, which are using the CPU, but I’m not an expert in the new API. The relevant code is the following:

mnist = DataBlock(blocks=(ImageBlock(cls=PILImageBWNoised), CategoryBlock, ImageBlock(cls=PILImageTarget)), 
             get_x= [noop, parent_label],
             splitter=GrandparentSplitter(train_name='training' if url == URLs.MNIST else 'train', valid_name='testing' if url == URLs.MNIST else 'valid'),
             batch_tfms=[AddNoiseTransform(.3), Normalize()],

dls = mnist.dataloaders(path, num_workers=4, bs=4096, device=device)

I encountered some other topics in the forum where people suggested changing the num_workers argument to speed things up. This lead me to printing dls.num_workers and that returns 1, even though I set it to 4 in the above code snippet.

I am finding this very very confusing - why isn’t the GPU being utilised? What’s up with num_workers and why isn’t it being set correctly?

Any help is appreciated!

Are you able to increase your batch size?

1 Like

Hi @toad_energy,

I think that maybe you might try a smaller batch size. As I understand it the image decoding and any other default item_tfms happen on the CPU. For the MNIST data it should be light duty, but waiting to queue up 4096 could be enough of a bottleneck the batches being sent to the GPU.

It looks like @etremblay got about 1:30 an epoch with num_workers=0, and bs=128 for the conditional-VAE in the “conditional-full-mnist” notebook . But the other “vae” notebook was where he got the 24 seconds with num_workers=8, and bs=4096 . I’m not sure how else the code really differes between the two notebooks, as I am more intimate with the MMD versions. Maybe @etremblay can confirm his experience?

Thanks for bumping this thread. It reminds me that I need to put a notebook together to share back my experience with MMD-VAE on a datset of bigger/color sneaker images!

good luck,

1 Like

Hey @toad_energy,

I think num_workers is the main difference like ergonyc mentionned. I have been playing with running those notebooks in Windows and in WSL2 Linux. In windows I usually had problems if I set num_workers greater than 0 but in Linux I could go as high as 8. Having 8 makes a huge difference on speed.

Hi @ilovescience, @ergonyc and @etremblay and thanks for your replies!

I did try increasing batch size, but that didn’t really make training faster - I think this is because the bottleneck in this case is realated to the CPU performance, rather than the GPU. Maybe increasing batch size in conjunction with increasing num_workers would help, but that is the main problem I am facing - for some reason I cannot set num_workers, it’s always 1 when I print it, regardless of what I pass as argument to the .dataloaders() call. Running the vae notebook with num_workers=8, and bs=4096 in Colab still results in a minute and a half per epoch, because it is in fact using only 1 worker by the looks of it.

I forgot to mention I am using fastai v 2.3.0. It might be worth it if I looked through the source code for that version and see if I can figure out why num_workers is not being set correctly.

the GPU usage you mentioned above does not make sense to me.
If it means Utilization check by nvidia-smi, I think this thread can help you.

In my shallow view, there are many factors can effect GPU utilization when you load data by DataLoader, such as batch_size, pin_memory and num_workers. Generally, the more batch_size the more utilization will be and set pin_memory=True can also have an improvement, as for num_works you could have an experiment to fit your own dataset and hardware.

I can’t make sure these can help you, but it works for me.

I had the same problem, apparently loading the data from disk and converting it into a TensorImage takes a lot of time. I solved it by preloading the whole dataset into RAM (I also applied the after_item transform):

def cached_dataloaders(device=None, **kwargs):
  if device is None: device=default_device()
  ds = mnist.datasets(path)
  ds_train, ds_valid = L(map(ToTensor(), ds.train[:])), L(map(ToTensor(), ds.valid[:]))
  return DataLoaders.from_dsets(ds_train, ds_valid, after_batch=mnist.batch_tfms, device=device, **kwargs)

dls = cached_dataloaders(num_workers=2, bs=2048)

There might be a nicer (and more generalized) way to achieve this, please let me know, how.

At the cost of more memory usage and the initial runtime penalty, the training itself gains a huge speedup. On google colab the training of one epoch took less than 10s.