BigData, databunch and training loop

I am thinking of moving an NLP project to Fastai code. I was wondering if I should expect memory problems when creating a Databunch. Consider that my training data consists of almost 4 million documents and around 100GB.

In my current repo (standard PyTorch, using Bert) this surely results in a memory error, and I am obliged to train on data blocks. At this moment we have modified the training loop so that each epoch is run on each block before starting the next epoch (and thus we need to play around with the learning rate updater and the optimizer).

Does that approach make sense? What would be the most Fastai-like way to work with such large datasets?

If your texts are stored in different files, they are lazily loaded (even for a language model) in fastai v2 so there is no OOM problem. There is no easy solution in current v1, we solved that problem while developing v2.

4 Likes

Haha, dang, that’s kind of a bitter-sweet answer
Is version 2 coming out anytime soon? :sweat_smile:

It’s feature-complete and in pre-alpha stage. You can find a full example of ulmfit here.

1 Like

This does not seem to work for me with v2. The LMDataLoader always reads in every file from disk when instantiated as it happens when the databunch is created in the ulmfit example. It seems to me that the LMDataLoader creates a list with all items of the dataset and executes all transformations. This happens at least when it creates ReindexCollection in init. Is this the intended behavior?

I tried the same approach with a wikipedia dump with 800k files and it takes forever to create the databunch and gigs of memory.

You can’t create an LMDataLoader without the length of every item, so yes, there is a first pass on the whole dataset unless you provide those lengths at creation. On a large dataset, you should compute those once and for all using the parallel command, then save them.

Thanks for the explanation with the lengths. The problem I see is that LMDataLoader does already one pass on the dataset before calculating the lengths when it picks out the first element of the tuples: self.items = ReindexCollection([(o[0] if isinstance(o, tuple) else o) for o in dataset], cache=cache). This loads the whole dataset as tensors to memory, which can be huge. It does this so that Chunks works.

I tried a modified version of LMDataLoader without the creation of this new list and passing it a TfmdList instead of a DataSource, which does not have the tuples. But TfmdLists do not seem to be intended to work with data loaders as e.g. show_batch() breaks. This is because the DataLoader creates tuples which have to be dealt with separately during decoding and TfmdLists are not capable doing this.

Any thoughts on this?

No, at this stage, the elements of the dataset are filenames, not tensors (or they should be, I’ll check). They are read lazily by the LMDataLoader (and the cache is there so that when you stop in one text, it caches the result for the next sample you will need, that will start in the same text).

Edit: it did not do this lazily indeed, should be fixed now.

1 Like

Hi @sgugger,

The example does not use the lr_find methodology from v1. What is the reason for that?
Apologies if the relevant documentation exists somewhere. (If it does mind pointing to it?)

Thanks for the quick fix and your great work!

Hello @sgugger!

I am trying to use fastai2 with a very large Dataset for ULMFIT, counting on the lazy loading on the iteration.

The problem is that it seems that there is a memory leak problem with the Dataloader when setting nom_workers > 0 when creating the databunch.

If I keep the default (num_workers=0) the training is very slower with the GPU usage being inneficient. However, if I set num_workers > 0 the training speeds up a lot, but it keep bloating the memory until I get an OMM.

It seems that it is a known issue with pytorch, but that there are some workarounds, (https://github.com/pytorch/pytorch/issues/13246)

Any suggestions?

Best regards,
Fabio.
It

The loader uses numpy arrays, so I have no idea where this leak comes from. You should try the version in v2 that is more memory-efficient (and hopefully not leaking).

Thanks.

I will try some workarounds and see what works best.

With num_workers = 0, the GPU utilization fluctuates between 0 and 100% and each epoch takes ~ 2x the time as training with num_workers = 2. With num_workers = 2, it stays between 90% and 100% most of the time, reducing the epoch time in half aproximately. With num_workers > 2 there is no speed gain and the problem (memory usage gradually increasing after each batch) is reduced. At the end of each epoch the memory is released and its usage starts acumulating again.

On the Pytorch git there is an old open issue about this with a simple example that simulates the problem (https://github.com/pytorch/pytorch/issues/13246):

from torch.utils.data import Dataset, DataLoader
import numpy as np
import torch


class DataIter(Dataset):
    def __init__(self):
        self.data_np = np.array([x for x in range(240000000)])
        self.data = [x for x in range(240000000)]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data = self.data[idx]
        data = np.array([data], dtype=np.int64)
        return torch.tensor(data)


train_data = DataIter()

train_loader = DataLoader(train_data, batch_size=300,
                          shuffle=False,
                          drop_last=True,
                          pin_memory=False,
                          num_workers=12)

for i, item in enumerate(train_loader):
    if i % 1000 == 0:
        print(i)

In this example, with workers > 0, the memory consumption keep rising fast until the end of the iterations.

Changing getitem to return the numpy array, the issue is gone and memory consumption stays relatively stable during the whole iteration:

def __getitem__(self, idx):
        data = self.data[idx]
        data = np.array([data], dtype=np.int64)
        return data

The problem here is that using multiple workers the training time is ~2x faster. I am trying to fine tune a language model over a 20 Million texts Dataset. With num_workers = 0 and a batch_size of 192, it takes around 11 hours for each epoch to complete. With workers >= 2, I get a pace that would lead to less than 5 hours per epoch. But I always face the OOM issue.

Ok, then it should be good in v2 I believe, as there is no persistent array.

I am using the last fastcore and fastai2.

My code is:

tfms = [attrgetter("text"), Tokenizer.from_df(1), Numericalize()]
splits = RandomSplitter(valid_pct=0.1, seed=42)(df)

dsrc = DataSource(df, [tfms], splits=splits, dl_type=LMDataLoader)

dbunch_lm = dsrc.databunch(bs=bs, seq_len=sl, val_bs=bs, after_batch=Cuda, num_workers=2)

learn = language_model_learner(dbunch_lm, 
                               AWD_LSTM, 
                               config, 
                               pretrained=False, 
                               opt_func=opt_func,
                               pretrained_fnames = [weights, vocab],
                               metrics=[accuracy, Perplexity()],
                               path=lm_path
                              )

Then I don’t know what to say: there is no persistent array containing the data, it’s exactly like your second example with get_item.

Edit: ah I get it, it’s the tensor the problem. The problem is that we need a tensor otherwise we can’t use a language model. SO this is not a fix…

Thanks. I will dig a bit more to try to understand what is going on.

The problem is severely reduced if shuffle_train is set to False in the databunch. How bad is to turn shuffle off to que quality of the training while finetuning a language model? Without this option the workers end up eating all 32Gb of physical memory and 32Gb of swap resulting in a OOM. Turning off shuffle_train the memory usage while training starts with 12Gb of physical memory usage in the beggining of an epoch and ends the epoch with ~ 19Gb of physical memory usage. When the epoch ends it goes back to 12Gb.

dbunch_lm = dsrc.databunch(bs=bs, seq_len=sl, val_bs=bs, after_batch=Cuda, num_workers=2, pin_memory=True, shuffle_train=False)

Interesting. If your dataset is so huge, I doubt shuffling the training set will help, so you can definitely try without.
We’ll investigate this memory leak when we have some time, in any case.