Might be obvious to many, but just wanted to emphasise the functionality in SortedDL
to speed up your development iteration time!
SortedDL
Summary
When initialising a dataloader that uses SortedDL
with 160,000 rows of text, passing a list of text lengths to res
took the dataloader initialisation time from 96s down to ~30s. (And we can get this down to less than 1second, with a modification to SortedDL
, keep reading!)
Details
This might be obvious to many experienced NLP users, but I just want to highlight the speed-up you can gain by passing a list of sort keys to res
when using SortedDL
. This avoids the init iterating over every element in your dataset to determine how to sort them.
I found this speed-up invaluable in the early stages of working with a new dataset/ transforms/dataloader or model when you are continuously interating, fixing, restarting your notebook etc.
Example
More concretely, the default SortedDL
behaviour is to sort by text length. Say you have 3 text samples in your dataset:
"The cat is running"
"The mouse hides"
"The sparrowhawk watches both of them closely"
And their lengths are:
text_lens = [4, 3, 7]
Now instead of initialising your dataloaders
like so:
dls = dsets.dataloaders(..., dl_type = SortedDL)
You pass your list of text_lens
to res
as part of specifying your SortedDL
dls = dsets.dataloaders(..., dl_type=partial(SortedDL, res=text_lens)
Note: text_lens
should only be the lengths of your training dataset, keep reading for validation set speedup.
Faster again, Validation set speedup
Now the above offers a speed up by avoiding iterating over the training set, but res
doesn’t help us when it comes to the validation set, the init will still iterate over all elements of the validation set.
Using dl_kwargs
We pass in a val_res
list via dl_kwargs
. val_res
is just a list of the text lengths in our validation dataset. To make this work we also have to modify the new
function in SortedDL like so:
def new_srtd_dl(self, dataset=None, **kwargs):
if kwargs['val_res'] is not None: res = kwargs['val_res']
else: res = self.res if dataset is None else None
return TfmdDL.new(self, dataset=dataset, res=res, **kwargs)
# replace new method in SortedDL
SortedDL.new = new_srtd_dl
# Pass res to SortedDL for our training dataset text lengths
srtd_dl=partial(SortedDL, res = text_lens)
# Pass val_res to dl_kwargs for our validation dataset text lengths
dl_kwargs = [{},{'val_res': val_text_lens}]
# init our Datasets
dsets = Datasets(...)
# init our Dataloaders
dls = dsets.dataloaders(...,dl_type = srtd_dl, dl_kwargs = dl_kwargs)
Now, with a text dataset of 160,000 samples the below will bring the init
time down from 96s to less than 1s!
Pull Requests
I added couple of PRs here, happy for any comments
SortedDL documentation update only, PR #419
SortedDL modification of SortedDL.new
for val_res
and doc update, PR #420