As I’m going through and attempting to figure out some of the memory drawbacks of fastai2
, I’ve also figured out a few tricks that make memory usage much lower. I’m going to use this thread as a post with some of the tricks I’ve found.
- Note, some of these issues will hopefully be addressed/we’re in the process of working on them
Numerical Data:
You can reduce memory from numerical data (sometimes by 50%!) in your DataFrame
if you convert them all to a float32
if possible. This can be as simple as:
for col in train.columns:
if train[col].dtype == 'float64':
train[col] = train[col].astype(np.float32)
What you’ll then wind up seeing is TabularPandas
will also reduce dynamically in space as well. For example, my particular dataframe has a footprint of ~1.2gb. After processing with float64
it’s new memory is an added .6 gb. With the preprocessing it’s now at a whopping 1.3gb (so an added .1gb!)
Categorical Data
Preprocessing your categorical data by making them into Category types can also reduce the memory. Just by how much?
Before preprocessing the added weight is ~1gb of memory.
If I do the following:
for name in cat_vars:
train[name] = train[name].astype('category')
(this is on Rossmann), before calling TabularPandas
, it’s the same footprint (after performing the astypes
Using Experimental inplace
If you are on the dev
version, TabularPandas
has an option to use inplace
. This can be helpful for large dataframes as TabularPandas
will work off of it instead of a copy dataframe. To use this, first set the following:
pd.options.mode.chained_assignment=None
Then when building your TabularPandas
, set inplace=True
For now, I hope this helps a few people as we address some of these memory issues in fastai tabular These tips (especially the numerical ones) is also great for pandas in general!