GPU not utilised, problem with DataLoader

Hi everyone!

I have been trying to implement a Conditional Variational Autoencoder with FastAI v2. I came across a great example in another thread in the forums here. So far, so good!

I am trying to run the original example notebook in Colab, and although I am using a GPU runtime, the GPU is being underutilised, while the CPU is working at 100%; an epoch takes 1:30 minutes, which is a lot compared to what is shown in the example notebook from the thread above (around 20 seconds per epoch). I am training on MNIST and have changed nothing about the notebook, yet I seem unable to make my setup utilise the GPU.

I used W&B to monitor the system stats - the GPU Utilisation is consistently under 15%, while the CPU utilisation is at 100%. My intution is that the model is waiting on the DataLoaders, which are using the CPU, but I’m not an expert in the new API. The relevant code is the following:

mnist = DataBlock(blocks=(ImageBlock(cls=PILImageBWNoised), CategoryBlock, ImageBlock(cls=PILImageTarget)), 
             get_items=get_image_files,
             get_x= [noop, parent_label],
             splitter=GrandparentSplitter(train_name='training' if url == URLs.MNIST else 'train', valid_name='testing' if url == URLs.MNIST else 'valid'),
             batch_tfms=[AddNoiseTransform(.3), Normalize()],
             n_inp=2)

dls = mnist.dataloaders(path, num_workers=4, bs=4096, device=device)

I encountered some other topics in the forum where people suggested changing the num_workers argument to speed things up. This lead me to printing dls.num_workers and that returns 1, even though I set it to 4 in the above code snippet.

I am finding this very very confusing - why isn’t the GPU being utilised? What’s up with num_workers and why isn’t it being set correctly?

Any help is appreciated!

Are you able to increase your batch size?

1 Like

Hi @toad_energy,

I think that maybe you might try a smaller batch size. As I understand it the image decoding and any other default item_tfms happen on the CPU. For the MNIST data it should be light duty, but waiting to queue up 4096 could be enough of a bottleneck the batches being sent to the GPU.

It looks like @etremblay got about 1:30 an epoch with num_workers=0, and bs=128 for the conditional-VAE in the “conditional-full-mnist” notebook . But the other “vae” notebook was where he got the 24 seconds with num_workers=8, and bs=4096 . I’m not sure how else the code really differes between the two notebooks, as I am more intimate with the MMD versions. Maybe @etremblay can confirm his experience?

Thanks for bumping this thread. It reminds me that I need to put a notebook together to share back my experience with MMD-VAE on a datset of bigger/color sneaker images!

good luck,
A

1 Like

Hey @toad_energy,

I think num_workers is the main difference like ergonyc mentionned. I have been playing with running those notebooks in Windows and in WSL2 Linux. In windows I usually had problems if I set num_workers greater than 0 but in Linux I could go as high as 8. Having 8 makes a huge difference on speed.

Hi @ilovescience, @ergonyc and @etremblay and thanks for your replies!

I did try increasing batch size, but that didn’t really make training faster - I think this is because the bottleneck in this case is realated to the CPU performance, rather than the GPU. Maybe increasing batch size in conjunction with increasing num_workers would help, but that is the main problem I am facing - for some reason I cannot set num_workers, it’s always 1 when I print it, regardless of what I pass as argument to the .dataloaders() call. Running the vae notebook with num_workers=8, and bs=4096 in Colab still results in a minute and a half per epoch, because it is in fact using only 1 worker by the looks of it.

I forgot to mention I am using fastai v 2.3.0. It might be worth it if I looked through the source code for that version and see if I can figure out why num_workers is not being set correctly.

the GPU usage you mentioned above does not make sense to me.
If it means Utilization check by nvidia-smi, I think this thread can help you.

In my shallow view, there are many factors can effect GPU utilization when you load data by DataLoader, such as batch_size, pin_memory and num_workers. Generally, the more batch_size the more utilization will be and set pin_memory=True can also have an improvement, as for num_works you could have an experiment to fit your own dataset and hardware.

I can’t make sure these can help you, but it works for me.

I had the same problem, apparently loading the data from disk and converting it into a TensorImage takes a lot of time. I solved it by preloading the whole dataset into RAM (I also applied the after_item transform):

def cached_dataloaders(device=None, **kwargs):
  if device is None: device=default_device()
  ds = mnist.datasets(path)
  ds_train, ds_valid = L(map(ToTensor(), ds.train[:])), L(map(ToTensor(), ds.valid[:]))
  return DataLoaders.from_dsets(ds_train, ds_valid, after_batch=mnist.batch_tfms, device=device, **kwargs)

dls = cached_dataloaders(num_workers=2, bs=2048)

There might be a nicer (and more generalized) way to achieve this, please let me know, how.

At the cost of more memory usage and the initial runtime penalty, the training itself gains a huge speedup. On google colab the training of one epoch took less than 10s.

Thanks for the answer! I have a followup question regarding this preloading: is it possible to make this preloading parallelized? I tried the “parallel” function in “fastcore”, unfortunately neither ThreadPoolExecutor nor ProcessPoolExecutor worked.

That’s what the num_workers parameter is for. No need for manual invotion of parallel. See fastai - DataLoaders or the according pytorch docs.

Thanks a lot for your reply!

As far as I observe, this preloading is executed once before loading the data by the dataloader (i.e. num_workers doesn’t help); for example, when I tried to load 23k images, this preloading takes 1.3min, and with this preloading, the dataloader(num_workers=8) takes 0.01min for loading a batch of 256 images. If I don’t use this preloading, loading the a batch takes 1.3min for the dataloader; and when I increased the number of workers to 16, the loading time doubled (waiting for the load to RAM I guess). This preloading is very helpful for training where you could reuse your loaded data, but it doesn’t assist much for inference (unfortunately this is my case, I have to predict roughly 5 million images). That’s why I think it would be great to parallelize preloading.

Indeed, my fastai1 code using ItemList and Databunch was quite efficient (it took roughly 2.4min to load and predict 1.18k images), however fastai2 (TfmdList and TfmdDL) had a 3x efficiency drop. So I was curious if fastai1 data transfer from SSD to RAM has any hidden parallelization or preloading techniques.

Sorry, I misunderstood your question. You are totally right, num_workers doesn’t affect the preloading.

I’m not sure, why parallel doesn’t just work in this case, maybe because you are applying distributed training and thus the preloading should be done only once (rank1) and not for every GPU. Also I’m not sure if preloading is a good idea if you only want to do inference you probably want fastai - Learner, Metrics, Callbacks. Please open a new thread in the forum if there is the need for further discussion.

Are you using paperspace or a large amount of your own data? I am having a similar issue