Chapter 3: Full MNIST run very slow (not using GPU?)

I used the code below to train on the full MNIST dataset. The code runs quite slowly, and when I monitor my GPU usage using “nvidia-smi -l 1”, although I can see “GPU Memory Usage” ~5xxMiB, “GPU-Util” fluctuates between 0% and 1%.

Is my code using GPU? Or is it simply bottlenecked on data I/O? If the latter, is there a way to make ImageDataLoaders read ahead more?

Thanks so much for your guidance!

def TransformImageToTensor(item_input):    
    if isinstance(item_input, TensorImageBW):
        return item_input.view(-1,28*28)
    else:
        assert isinstance(item_input, TensorCategory)
        return item_input.unsqueeze(1)

full_path = untar_data(URLs.MNIST)

full_dls = ImageDataLoaders.from_folder(full_path,
                                        train='training',
                                        valid='testing',
                                        img_cls=PILImageBW,
                                        batch_tfms=TransformImageToTensor,
                                        bs=1024,
                                        device="cuda")

model = nn.Sequential(
    nn.Linear(28*28,30),
    nn.ReLU(),
    nn.Linear(30,10)).cuda()

def full_mnist_loss(predictions, targets):
    predictions = predictions.exp()
    predictions = predictions / predictions.sum(dim=1, keepdim=True)
    one_hot_targets = F.one_hot(targets, num_classes=10).squeeze()
    cross_ent = -torch.mul(one_hot_targets, predictions.log()).sum(dim=1)
    return cross_ent.mean()

def full_batch_accuracy(xb, yb):
    classification = torch.argmax(xb, dim=1).unsqueeze(1)
    correct = classification == yb
    return correct.float().mean()

learn = Learner(full_dls, model, opt_func=SGD,
                loss_func=full_mnist_loss, metrics=full_batch_accuracy)

learn.fit(40, 0.1)
1 Like

You code does use GPU (device="cuda"). So it may be indeed due to ImageDataLoaders.from_folder you may try adding the argument num_workers= a number greater than 0 Increasing num_workers allows for parallel data loading, which can help mitigate I/O bottlenecks.

for example:

full_dls = ImageDataLoaders.from_folder(full_path,
                                        train='training',
                                        valid='testing',
                                        img_cls=PILImageBW,
                                        batch_tfms=TransformImageToTensor,
                                        bs=1024,
                                        device="cuda",
                                        num_workers=4)
1 Like

Thanks for your response! I tried:

  1. Not setting num_workers
  2. num_workers = 8
  3. num_workers = 32
    (I’m on an 8 core CPU and GTX1080)

The first two didn’t make a difference in training speed (in fact, it’s surprisingly consistent - all epochs took 36 seconds). The last one was slower - all epochs took 40 seconds (probably expected since I only have 8 cores).

Ah well, I guess I’ll learn more ways to optimize later in the course. Thanks again for your help!

1 Like

Maybe another possibility is that the batch size (bs=1024) might be too large for your GPU, which could limit the GPU utilization. You could try reducing the batch size to a smaller value, such as 128 or 256, to see if it leads to better GPU utilization and faster training times.

You can also enable the pin_memory option in the ImageDataLoaders.from_folder function. Setting pin_memory=True allows the data to be directly transferred from the CPU to the GPU memory, potentially reducing the transfer time.

full_dls = ImageDataLoaders.from_folder(full_path,
                                        train='training',
                                        valid='testing',
                                        img_cls=PILImageBW,
                                        batch_tfms=TransformImageToTensor,
                                        bs=256, # reducing batch size
                                        num_workers=4,  # Increase the number of workers
                                        pin_memory=True,  # Enable pin_memory
                                        device="cuda")
1 Like

Thanks Kamui, I tried lowering batch size and pin_memory, but neither raised GPU-Util.

1 Like

sorry I am out of ideas :confused:

I don’t have much experience running things locally but once when I tried training a model on my friend’s pc, instead of using the nvidia gpu it was using gpu built in the processor, I never solved the problem. Perhaps you should check if you got the same problem and find ways to utilise the main gpu.

1 Like

To close the loop - data I/O indeed was the root cause of the slowness and low GPU usage. Loading all the data into memory first solved the problem.

full_dls = ImageDataLoaders.from_folder(full_path,
                                        train='training',
                                        valid='testing',
                                        img_cls=PILImageBW,
                                        device="cuda")

train_data = [data for data in full_dls[0]]
test_data = [data for data in full_dls[1]]

train_x = torch.cat([x for x,_ in train_data], dim=0)
train_y = torch.cat([y for _,y in train_data], dim=0)
train_dset = list(zip(train_x.squeeze(),train_y.squeeze()))

test_x = torch.cat([x for x,_ in test_data], dim=0)
test_y = torch.cat([y for _,y in test_data], dim=0)
test_dset = list(zip(test_x.squeeze(),test_y.squeeze()))

train_dl = DataLoader(train_dset, batch_size=256)
test_dl = DataLoader(test_dset, batch_size=256)
full_dls = DataLoaders(train_dl, test_dl, device="cuda")

class ToBWLayer(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        assert x.shape
        return x.view(-1,28*28)

model = nn.Sequential(
    ToBWLayer(),
    nn.Linear(28*28,30),
    nn.ReLU(),
    nn.Linear(30,10)).cuda()

def full_mnist_loss(predictions, targets):
    predictions = predictions.exp()
    predictions = predictions / predictions.sum(dim=1, keepdim=True)
    one_hot_targets = F.one_hot(targets, num_classes=10)
    cross_ent = -torch.mul(one_hot_targets, predictions.log()).sum(dim=1)
    return cross_ent.mean()

def full_batch_accuracy(xb, yb):
    classification = torch.argmax(xb, dim=1)
    correct = classification == yb
    return correct.float().mean()

learn = Learner(full_dls, model, opt_func=SGD,
                loss_func=full_mnist_loss, metrics=full_batch_accuracy)

learn.fit(40, 0.1)

All epochs recorded 0 seconds, and GPU-util was at 20% for the whole duration.

1 Like