Feature suggestion : Timing a Learner object

StatisticDean · August 7, 2019, 3:02pm

When creating a learner in fastai, it could be useful to know how long it will take it to train/predict, and how this time is split along the pipeline. It would help to answer questions like ‘is it worth it to put my data augmentation on GPU’ or ‘is this new part of the model too slow’.

Here’s how I imagine it, having a new method on the Learner class called time_batch (The description would eventually be moved to the doc)

def time_batch(self, ds_type:DatasetType=DatasetType.Train, num_batch=10):
    '''
    Process multiple batch of data and gives an estimation of the average time
    of processing of the differents steps of the pipeline.
    '''
    ...

learn.time_batch()
would then return a summary of the average time taken by the model to process a batch, something along the lines of :

Dataset access : 10 ms per batch
Dataset transform : 1 s per batch
Batch collate : 10 ms per batch
Dataloader transform : 1 ms per batch
Forward pass :
- encoder : 10 ms per batch
- decoder : 50 ms per batch
- loss : 1 ms per batch
Backward pass :
- loss : 10 ms per batch
- decoder : 50 ms per batch
- encoder : 10 ms per batch
Optimizer step : 50 ms per batch
Total time : 1.202 s
Longest step : Dataset transform

In such an example, one could see in a quick look that the data transformation is what takes the most time and the model could probably be improved if that data transformation was done in the dataloader, perhaps on gpu.

What do you think of such a feature ? Do you think it would be worth to implement this in fastai ?

sgugger · August 7, 2019, 7:04pm

Some sort of help to profile things would definitely be welcome. However we won’t implement it ourselves as we are busy with the development of v2. Always happy to look at a PR however

StatisticDean · August 8, 2019, 3:13pm

I have given this some thoughts and I think this is possible to do with a reasonable amount of work in fastai.

Here’s the idea of how I would like to implement this :

You create a function that allows to modify in place an object/ a class to time one method (example forward method for pytorch nn.Module or __getitem__ for a dataset). How is this done ? You override the original class / instance method to save the time it takes every time it is called.
You call this on every part of the Learner you want to monitor.
You run a few batch through the Learner
You save the times for later
You put back every original method on instance and class.

Doing this ways allows to have absolutely no changes to the existing fatstai library to time the Learner, and also works the same way regardless of what you’re trying to monitor.

After some tests, I’ve managed to get this idea to work, and so far I’ve created 4 functions to help in this process.

time_method(x, method) which will modify x inplace to replace its method method by a new method that will record time in a new attribute of x : f'time_of_{method}', while saving the old method in a new attribute of x
untime_method(x, method) which will delete the new attribute and put the old method back in place
time_method_class(x, method) which will modify inplace the class of x to replace its method that will also record time in a new attribute of x, (it will also create that attribute for x)
untime_method_class(x, method) which will put the original method back in place and delete the timing attribute of x

The reason for some modifications happening on instances and other on classes is that some magic methods like __getitem__ get called at class level and as a result you can’t override them at instance level.

So for example, if you want to time the dataloading process, you only have to do :

time_method_class(learn.data.train_dl.dl.dataset.x, '__getitem__')
time_method_class(learn.data.train_dl.dl.dataset.y, '__getitem__')

I have encountered an issue related to multiprocessing which I’m not very familiar which is that if multiprocessing is active (num_workers !=0), then the time_of_method___getitem__ attributes of the x and y dataset are not updated during training(this works nicely with num_workers=0)

What are your thoughts on this way to implement time tracking ? And do you have an idea on what I would need to change so this still works with multiprocessing on ?

StatisticDean · August 8, 2019, 3:45pm

Update on the multiprocessing issue, one solution might be to use a manager,

import multiprocessing
manager = multiprocessing.Manager()

and replace the list that tracks the time of each execution by a manager.list(). This works in the sense that every time I call next on iter(learn.data.train_dl) my list is incremented of 64 (my bs in my test situation), but when calling iter(learn.data.train_dl.dl), my list is augmented to size 1024 (As if __getitem__ was called 1024 times by pytorch during the _DataLoaderIter call). If anyone has an idea why this happens, I would be glad to hear about it If not, I should figure it out in a little time.

Edit: New_update, after some digging in torch code and experimentations, I confirmed that iter(learn.data.train_dl.dl) calls__getitem__ on the dataset multiple times (1024 times to be exact in my case), this seems to be coming from the call to _put_indices at the initialization.

def _put_indices(self):
        assert self.batches_outstanding < 2 * self.num_workers
        indices = next(self.sample_iter, None)
        if indices is None:
            return
        self.index_queues[self.worker_queue_idx].put((self.send_idx, indices))
        self.worker_queue_idx = (self.worker_queue_idx + 1) % self.num_workers
        self.batches_outstanding += 1
        self.send_idx += 1

Thoughts are still welcome !