Tabular Memory in `fastai2` - Reducing Memory Size

As I’m going through and attempting to figure out some of the memory drawbacks of fastai2, I’ve also figured out a few tricks that make memory usage much lower. I’m going to use this thread as a post with some of the tricks I’ve found.

  • Note, some of these issues will hopefully be addressed/we’re in the process of working on them

Numerical Data:

You can reduce memory from numerical data (sometimes by 50%!) in your DataFrame if you convert them all to a float32 if possible. This can be as simple as:

for col in train.columns:
  if train[col].dtype == 'float64':
    train[col] = train[col].astype(np.float32)

What you’ll then wind up seeing is TabularPandas will also reduce dynamically in space as well. For example, my particular dataframe has a footprint of ~1.2gb. After processing with float64 it’s new memory is an added .6 gb. With the preprocessing it’s now at a whopping 1.3gb (so an added .1gb!)

Categorical Data

Preprocessing your categorical data by making them into Category types can also reduce the memory. Just by how much?
Before preprocessing the added weight is ~1gb of memory.
If I do the following:

for name in cat_vars:
  train[name] = train[name].astype('category')

(this is on Rossmann), before calling TabularPandas, it’s the same footprint (after performing the astypes

Using Experimental inplace

If you are on the dev version, TabularPandas has an option to use inplace. This can be helpful for large dataframes as TabularPandas will work off of it instead of a copy dataframe. To use this, first set the following:

pd.options.mode.chained_assignment=None

Then when building your TabularPandas, set inplace=True

For now, I hope this helps a few people as we address some of these memory issues in fastai tabular :slight_smile: These tips (especially the numerical ones) is also great for pandas in general!

1 Like

Good news, we’ve got these adjustments in the library now :slight_smile:

When building your TabularPandas object, reduce_memory is always set to True. What this will do is change the types for your DataFrame in fastai2 to the guide above (continuous variables will be set to float32 and categorical to pd.categorical). This is best used in conjunction with inplace=True so you don’t make any copy of the dataframe. Here is an example usage:

to = TabularPandas(df, procs, cat_names=cat_vars, cont_names=cont_vars, dep_var=y_name, splits=splits, inplace=True, reduce_memory=True)

When using it on Rossmann I could reduce the memory usage from 2.6gb overhead to 2.15gb (we started at 1.5g)

This is compared to the 3.5gb overhead from the old version :slight_smile: