Dtype of Elapsed from add_datepart is not numerical

paulclou · September 3, 2020, 4:04am

Given a dataframe with

tmp['date'].values
array([Timestamp('2015-09-01 17:00:00-0700', tz='US/Pacific'),
       Timestamp('2015-09-02 17:00:00-0700', tz='US/Pacific'),
       Timestamp('2015-09-03 17:00:00-0700', tz='US/Pacific'),
       Timestamp('2015-09-07 17:00:00-0700', tz='US/Pacific'),
       Timestamp('2015-09-08 17:00:00-0700', tz='US/Pacific')],
      dtype=object)

If we run

add_datepart(tmp, 'date')

we get

tmp.dtypes
Year                 int64
Month                int64
Week                UInt32
Day                  int64
Dayofweek            int64
Dayofyear            int64
Is_month_end          bool
Is_month_start        bool
Is_quarter_end        bool
Is_quarter_start      bool
Is_year_end           bool
Is_year_start         bool
Elapsed             object
dtype: object

The dtype of Elapsed column generated by add_datepart is object (strings) and not quantitative in v2.0.7. Since Elapsed represents Unix epoch timestamp, should the dtype instead be int64, so that cont_cat_split will identify Elapsed as continuous? This behavior is the same if I change the column values from Timestamp to str.

paulclou · September 4, 2020, 3:48am

Also, add_datepart may create columns with dtype Uint64 (week number) or bool (eg Is_month_end), which errors out when applying to TabularLearner.predict. Uint64 and apparently bool are not supported for predictions in production environment. I had to convert all dtypes into either float or int to be able to predict.

Error is

~/.local/lib/python3.8/site-packages/fastai/torch_core.py in tensor(x, *rest, **kwargs)
    124            else torch.tensor(x, **kwargs) if isinstance(x, (tuple,list))
    125            else _array2tensor(x) if isinstance(x, ndarray)
--> 126            else as_tensor(x.values, **kwargs) if isinstance(x, (pd.Series, pd.DataFrame))
    127            else as_tensor(x, **kwargs) if hasattr(x, '__array__') or is_iter(x)
    128            else _array2tensor(array(x), **kwargs))

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.