TabularPandas object creation slow mon wide tables

having a table of above 700 columns wide, TabularPandas object gets about half an hour to create, while fastai v-1 TabularList.from_df does this job fast.

to = TabularPandas(df, procs, cat_names, cont_names,

Try pre-processing your columns and see if that speeds things up (so now you pass in procs as just [])

Hi, Zachary Mueller, the df table datatypes are already of the right types (bools, cats, ints, floats and no objects)

What do you actually mean by columns pre-processing, if not the above?

I currently process with the following code:
procs = [Categorify, FillMissing, Normalize]

Thank you!

As in before sending it into a TabularPandas, do the Categorify, FillMissing, and Normalize seperately. IE (this is a very basic one, look at the old fastai for how to kinda do it normally):

df.fillna(df.median(), inplace=True)
for cat in cat_names:
  df[cat] = df[cat].astype('category')
for cont in cont_names:
  df[cont] = (df[cont] - df[cont].mean()) / (df[cont].max() - df[cont].min())

Now if I do TabularPandas and compare the two:
%timeit to = TabularPandas(df, procs, cat_names, cont_names, y_names="salary", splits=splits): loop, best of 3: 227 ms per loop

%timeit to = TabularPandas(df, [], cat_names, cont_names, y_names="salary", splits=splits): 10 loops, best of 3: 25.4 ms per loop

We can see a 10x speedup here. Now fastai does this already. What we’re narrowing down is which proc is giving you the biggest issue, as it’s most likely an issue with the data processing, not the library itself :slight_smile:

I had the same problem. Reading the source code to see what could possibly slow it down so much, by default TabularPandas calls df_shrink() to cast larger datatypes to into int, uint, or float, and does so for each column.

If this is not necessary, add the parameters:

and the result object is instant instead of a very long time.

Yes, df_shrink() ain’t free :sweat_smile:

It’s designed to reduce dataframe memory footprint, and save time moving data from host memory --> GPU memory during training at the expense of extra pre-processing time.

If one is sure the columns dtypes are already optimized for smaller memory footprint, turn off reduce_memory by all means.

tabular.core.df_shrink_dtypes() can suggest a column dtypes mapping with minimal memory footprint but, without actually “shrinking” the df. Perhaps consider using it in Jupyter, while exploring the data?