Hi,
having a table of above 700 columns wide, TabularPandas object gets about half an hour to create, while fastai v-1 TabularList.from_df does this job fast.
to = TabularPandas(df, procs, cat_names, cont_names,
y_names=[‘categorical_column’])
Hi,
having a table of above 700 columns wide, TabularPandas object gets about half an hour to create, while fastai v-1 TabularList.from_df does this job fast.
to = TabularPandas(df, procs, cat_names, cont_names,
y_names=[‘categorical_column’])
Try pre-processing your columns and see if that speeds things up (so now you pass in procs as just [])
Hi, Zachary Mueller, the df table datatypes are already of the right types (bools, cats, ints, floats and no objects)
What do you actually mean by columns pre-processing, if not the above?
I currently process with the following code:
procs = [Categorify, FillMissing, Normalize]
Thank you!
As in before sending it into a TabularPandas
, do the Categorify
, FillMissing
, and Normalize
seperately. IE (this is a very basic one, look at the old fastai for how to kinda do it normally):
df.fillna(df.median(), inplace=True)
for cat in cat_names:
df[cat] = df[cat].astype('category')
for cont in cont_names:
df[cont] = (df[cont] - df[cont].mean()) / (df[cont].max() - df[cont].min())
Now if I do TabularPandas
and compare the two:
%timeit to = TabularPandas(df, procs, cat_names, cont_names, y_names="salary", splits=splits)
: loop, best of 3: 227 ms per loop
%timeit to = TabularPandas(df, [], cat_names, cont_names, y_names="salary", splits=splits)
: 10 loops, best of 3: 25.4 ms per loop
We can see a 10x speedup here. Now fastai does this already. What we’re narrowing down is which proc is giving you the biggest issue, as it’s most likely an issue with the data processing, not the library itself
I had the same problem. Reading the source code to see what could possibly slow it down so much, by default TabularPandas calls df_shrink() to cast larger datatypes to into int, uint, or float, and does so for each column.
If this is not necessary, add the parameters:
reduce_memory=False
and the result object is instant instead of a very long time.
Yes, df_shrink()
ain’t free
It’s designed to reduce dataframe memory footprint, and save time moving data from host memory --> GPU memory during training at the expense of extra pre-processing time.
If one is sure the columns dtypes
are already optimized for smaller memory footprint, turn off reduce_memory
by all means.
tabular.core.df_shrink_dtypes()
can suggest a column dtypes mapping with minimal memory footprint but, without actually “shrinking” the df
. Perhaps consider using it in Jupyter, while exploring the data?