NLP speed-up if using SortedDL

morgan · July 9, 2020, 12:06pm

Might be obvious to many, but just wanted to emphasise the functionality in SortedDL to speed up your development iteration time!

`SortedDL`

Summary
When initialising a dataloader that uses SortedDL with 160,000 rows of text, passing a list of text lengths to res took the dataloader initialisation time from 96s down to ~30s. (And we can get this down to less than 1second, with a modification to SortedDL, keep reading!)

Details
This might be obvious to many experienced NLP users, but I just want to highlight the speed-up you can gain by passing a list of sort keys to res when using SortedDL. This avoids the init iterating over every element in your dataset to determine how to sort them.

I found this speed-up invaluable in the early stages of working with a new dataset/ transforms/dataloader or model when you are continuously interating, fixing, restarting your notebook etc.

Example

More concretely, the default SortedDL behaviour is to sort by text length. Say you have 3 text samples in your dataset:

"The cat is running"
"The mouse hides"
"The sparrowhawk watches both of them closely"

And their lengths are:

text_lens = [4, 3, 7]

Now instead of initialising your dataloaders like so:

dls = dsets.dataloaders(..., dl_type = SortedDL)

You pass your list of text_lens to res as part of specifying your SortedDL

dls = dsets.dataloaders(..., dl_type=partial(SortedDL, res=text_lens)

Note: text_lens should only be the lengths of your training dataset, keep reading for validation set speedup.

Faster again, Validation set speedup

Now the above offers a speed up by avoiding iterating over the training set, but res doesn’t help us when it comes to the validation set, the init will still iterate over all elements of the validation set.

Using dl_kwargs
We pass in a val_res list via dl_kwargs. val_res is just a list of the text lengths in our validation dataset. To make this work we also have to modify the new function in SortedDL like so:

def new_srtd_dl(self, dataset=None, **kwargs):
    if kwargs['val_res'] is not None: res = kwargs['val_res']
    else: res = self.res if dataset is None else None
    return TfmdDL.new(self, dataset=dataset, res=res, **kwargs)

# replace new method in SortedDL
SortedDL.new = new_srtd_dl 

# Pass res to SortedDL for our training dataset text lengths 
srtd_dl=partial(SortedDL, res = text_lens)

# Pass val_res to dl_kwargs for our validation dataset text lengths
dl_kwargs = [{},{'val_res': val_text_lens}]

# init our Datasets 
dsets = Datasets(...)   

# init our Dataloaders
dls = dsets.dataloaders(...,dl_type = srtd_dl, dl_kwargs = dl_kwargs)

Now, with a text dataset of 160,000 samples the below will bring the init time down from 96s to less than 1s!

Pull Requests

I added couple of PRs here, happy for any comments

SortedDL documentation update only, PR #419

SortedDL modification of SortedDL.new for val_res and doc update, PR #420

riven314 · July 9, 2020, 1:33pm

excellent hack and explanation!