40+% faster training with a scikit-learn-like API for numpy arrays

TLDR: fastai2 is great, but if you use numpy arrays as input it may be slower than fastai v1. However, if you use NumpyDatasets/ NumpyDataLoader your training will be 33% faster than v1 and 40+% faster than v2.

During the last couple of weeks, I’ve been porting timeseriesAI to fastai v2.

timeseriesAI is a Practical Deep Learning for Time Series / Sequential Data package I built on top fastai v1, that takes X (and y) numpy arrays as input.

The new package can be easily installed from pip (`pip install tsai).

I’ve learned a few things in this process I’d like to share with you.

I think fastai v2 is great. It has many advantages over v1 (like batch tfms, fine-tuning, callbacks, etc). I also think the layered approach allows us to adapt the library to our needs. I used the vanilla datablock API and it worked fine. It’s easy to use and very flexible. :heart:

The fastai ecosystem is superb! nbdev is super useful: a great tool to develop, test and distribute code. Thanks to nbdev I’ve built my first ever pip package. I couldn’t believe how easy it was.

And fastcore has some very useful code too.

Thanks a lot Jeremy and Sylvain for putting all this together!) :clap::clap::clap:

When I learned how to use fastai v2, I run some performance tests with time series data (numpy arrays) to compare v1 and v2. I was disappointed by the poor results:

  • v2 training is 40-60% slower.
  • v2 dataloader takes 2-3x to return batches.
  • v2 dataset takes 2x to return items.

fastai v2 is excellent when using images, text, tabular data, etc, but it’s not streamlined to use numpy arrays. :cry:

Since I use fastai on a daily basis, I started to investigate if there was any way to speed up the code for numpy arrays.

I found a couple of modifications that could potentially accelerate batch creation (1) as long as:

  • No item tfms are applied to X (and y), or
  • Item tfms are deterministic (non-random) and its output fits in memory

These conditions are present in many time series problems.

These 2 processes are:

  1. Apply item tfms inplace (dataset): in time series problems, most item transforms are deterministic (non-random), which means they can be applied during dataset initialization, instead of at batch creation time. In this way they are only applied once instead of once per epoch.
  2. Get all batch items at once instead of one by one (dataloader): if you carefully build the output during datasets initialization, you only need to slice X (and y) and cast them to the desired output types (tensor, TensorCategory, etc.) to create a batch. Slicing and casting are very fast. This removes the need to have a collate function. The only thing the dataloader needs to do is to pass the indices that will be applied in each iteration to get a batch.

With these modifications, the batch creation process with numpy arrays is super fast: 100 times faster than vanilla v2, and 30 times faster than v1. :upside_down_face:

The user interface I built is very similar to scikit-learn’s API:

NumpyDatasets(X, y=None, tfms=tfms, splits=splits, inplace=True)

TSDatasets(X, y=None, tfms=tfms, sel_vars=None, sel_steps=None, splits=splits, inplace=True)

I’ve created a couple of notebooks to the timeseriesAI repo to show you how all this works.

In summary, this means we can have all the benefits of v2 when using numpy arrays with a simple scikit-learn-like API, that is 43-55% faster than factory methods and datablock API in v2, and 30% faster than v1 (see 00b_How_to_use_numpy_arrays_in_fastai2.ipynb)

This is the comparison in a chart:

If you are interested I’d suggest you start using this introductory nb.

If you decide to try it, it’s be good if you would provide some feecback on how it works.

14 Likes

whoa :exploding_head: thanks for sharing!

Thanks for sharing your observation and optimization!

I haven’t looked at your implementations but am curious about this point.

If the transformation is deterministic and can be applied during dataset initialization, how is that different from preprocessing the data before training? Are the transformation outcome different from epoch to epoch for the same sample?

And if the concept is “data transformed remain the same throughout subsequent epochs and fits in memory but can only be done on-the-fly, not offline/preprocessed”, perhaps it can be programmed into a “caching”/“just transform once” switch for the existing Datasets class, so not it’s not “timeseries” data specific ?

Cheers.

That is amazing!!

What does “Numpy DataLoaders, preproccessed, in-memory” mean? Is it np.memmap?

Hi @philchu,
I’ll try to answer your questions.

To apply item tfms inplace is equivalent to preprocessing data before training. Maybe just more convenient because you use the same type of transforms you would normally use during datasets initialization.
The item tfms are just applied once so the data is the same for all epochs.
What you could modify are the batch tfms, which may be different for each epoch.
So you might think of it this way:

  • item tfm + inplace = preprocessor
  • batch tfm = may deliver different output per epoch if randomness is involved

The concept of preprocessing all data during initialization could be applied to any type of data as long as the output fits in memory, and no randomness is required. This has little/ no application in vision and text datasets for example, but it may it other domains. That’s why I built a TSDatasets and TSDataLoaders (time series specific) and a more general NumpyDatasets and NumpyDataLoaders.

Hi @takotab,

Sorry the chart is not clear enough.
That means that if you use NumpyDatasets and NumpyDataLoaders, with inplace=True (so that item tfms are applied during datasets initialization and not every time a batch is created) and using a normal np.array (data already in memory) you can create batches much faster than using the vanilla factory method or datablock API. This in turns significantly accelerate training.
The result when using data in memory (np.memmap) is very similar.

1 Like

This is great, thank you for sharing.

1 Like

@oguiza did you attempt to use this method? It’ll speed it up some in native fastai (specifically the fake_dl). You should find it’s very similar to your PyTorch numbers :slight_smile: