TabularPandas converting continuous variables to integers

Hi,
While working through the Rossmann example, I was getting very strange result from lr_find() and during training, pytorch was throwing the exception:

RuntimeError: exp_vml_cpu not implemented for 'Int'

When I look at the DataLoaders, I noticed that the variables I identified as continuous in the ‘cont_names’ parameter were converted to floats.

df = pd.read_csv(r'/home/tim/.kaggle/rossmann/small.csv')
    dep_var = 'Sales'
    procs = [Categorify, FillMissing, Normalize]
    cont_vars =  ['Customers', 'Store']
    cat_vars =  ['DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']
    dls = TabularDataLoaders.from_df(df, cat_names=cat_vars, cont_names=cont_vars, y_names='Sales', procs=procs)

    print(dls.show_batch(show=False))
(    DayOfWeek  Open  Promo  StateHoliday  SchoolHoliday  Customers     Store   Sales
0         7.0   1.0    1.0           1.0            1.0  -1.544242  0.510249     0.0
1         3.0   2.0    1.0           1.0            2.0  -0.833547 -0.212174  2670.0
2         1.0   2.0    1.0           1.0            1.0  -0.080632 -1.249099  6700.0
3         1.0   2.0    1.0           1.0            2.0   0.294653 -1.149454  6015.0

I tried converting these columns to ‘continuous’ in the dataframe, but got the same results:

maps = {'DayOfWeek': {1: 'Monday', 2: 'Tues', 3: 'Wed', 4: 'Thurs', 5: 'Fri', 6: 'Sat', 7: 'Sun'}, 'Open': {0: 'False', 1: 'True'},
            'Promo': {0: 'False', 1: 'True'}, 'StateHoliday': {0: 'False', 1: 'True'}, 'SchoolHoliday': {0: 'False', 1: 'True'}}
    for key, val in maps.items():
        df[key] = df[key].map(val)
        df[key] = df[key].astype('category')

    print(df.head())
    print(df.info())
   Store DayOfWeek        Date  Sales  Customers  Open Promo StateHoliday SchoolHoliday
0      1       Fri  2015-07-31   5263        555  True  True        False          True
1      2       Fri  2015-07-31   6064        625  True  True        False          True
2      3       Fri  2015-07-31   8314        821  True  True        False          True

Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Store          19999 non-null  int64   
 1   DayOfWeek      19999 non-null  category
 2   Date           19999 non-null  object  
 3   Sales          19999 non-null  int64   
 4   Customers      19999 non-null  int64   
 5   Open           19999 non-null  category
 6   Promo          19999 non-null  category
 7   StateHoliday   19999 non-null  category
 8   SchoolHoliday  19999 non-null  category
dtypes: category(5), int64(3), object(1)

    dls2 = TabularDataLoaders.from_df(df, cat_names=cat_vars, cont_names=cont_vars, y_names='Sales', procs=procs)
    print(dls2.show_batch(show=False))
(    DayOfWeek  Open  Promo  StateHoliday  SchoolHoliday  Customers     Store    Sales
 0         4.0   1.0    1.0           1.0            1.0  -1.540567  0.531858      0.0
 1         1.0   2.0    2.0           1.0            2.0  -0.652530  0.899076   5151.0
 2         3.0   2.0    1.0           1.0            1.0   0.007624 -0.445317   5564.0
 3         6.0   2.0    2.0           1.0            2.0   0.282493 -1.316683   7500.0

And the variables are integers in ‘dls2.train.xs’:

print(dls2.train.xs)
       DayOfWeek  Open  Promo  StateHoliday  SchoolHoliday  Customers     Store
5034           2     2      2             1              1   0.167377  0.058831
7481           3     2      1             1              1  -0.434045  0.734139
6811           3     2      1             1              1   1.020174 -1.350915
3448           6     2      2             1              2   1.391364 -1.406931
13939          4     1      1             1              1  -1.540567  0.012150

This problem has me stumped.
Am I doing something wrong?
Should I submit an issue on github?

Thanks,
Tim

Hi,

Confused…
The two continuous columns (customer and store) are floats and not integers.

Are you asking about the cat_cols (categorical columns)?
Since categorical are discrete, a dictionary will be built and mapped to integers (which become indexes into the embedding layer)

Yes, I was wondering about the categorical columns.
Thanks for helping out a noob. I forgot that the categorical variables would be embeddings.

I got myself turned around trying to troubleshoot an error I’m getting during training.
Calling ‘lr_find’ succeeds, but during training, I get the error:

RuntimeError: exp_vml_cpu not implemented for 'Short'

While ‘batch_show’ indicates the ‘Sales’ variable is a float:

(    DayOfWeek  Open  Promo  StateHoliday  SchoolHoliday  Customers     Store    Sales
0         3.0   2.0    1.0           1.0            2.0  -0.315179 -0.516475   3766.0
1         6.0   2.0    1.0           1.0            1.0  -1.064596  1.718217   2084.0

the ‘y’ series of the dataloaders ‘train’ TabDataLoader is a short:

(    DayOfWeek  Open  Promo  StateHoliday  SchoolHoliday  Customers     Store    Sales
0         3.0   2.0    1.0           1.0            2.0  -0.315179 -0.516475   3766.0
1         6.0   2.0    1.0           1.0            1.0  -1.064596  1.718217   2084.0

I’m confused about a couple of parameters when creating the DataLoaders and the TabularLearner.
This is a regression problem, so

  1. When creating the TabularDataLoaders
    a. Should I include the continuous dependent variable column in the cont_name parameter?
    b. Should I set the y_names parameter? Does this imply labels of a classification problem?
    c. Should I set y_block=RegressionBlock()? It looks like the Tabular baseclass of TabularPandas will correctly infer this as long as ‘y_names’ is set:
f y_block is None and self.y_names:
            # Make ys categorical if they're not numeric
            ys = df[self.y_names]
            if len(ys.select_dtypes(include='number').columns)!=len(ys.columns): y_block = CategoryBlock()
            else: y_block = RegressionBlock()
  1. When creating the tabular_learner
    a. Should I set the ‘n_out’ parameter? I’m a little confused by this one. Is it the number of outputs, therefore always ‘1’ except for multi-class classification problems?

I think I’ve solved the problem. Adding the dependent variable to the ‘cont_names’ seems to have fixed the problem.

Previously I had been setting the ‘y_block’ parameter to RegressionBlock, but leaving the continuous dependent variable out of the ‘cont_names’ list. This seems to have resulted in a “invalid” state where the dependent variable was a float when calling dls.show_batch():

Out[5]: 
(    DayOfWeek  Open  Promo  StateHoliday  SchoolHoliday  Customers     Store    Sales
 0         6.0   2.0    1.0           2.0            1.0  -0.202263 -0.084810   4005.0
 1         7.0   1.0    1.0           2.0            1.0  -1.361938  0.962064      0.0
 2         1.0   2.0    1.0           2.0            1.0   0.856289 -0.106555  10633.0

but an in in the data loader ‘y’ field:

dls.train.y 
42849      3991
300129    11954
104782     6986
3412       8195

Thanks again for the help,
Tim

Here’s the code:

    df = pd.read_csv(r'/home/tim/.kaggle/rossmann/train.csv')
    dep_var = 'Sales'
    procs = [Categorify, FillMissing, Normalize]
    y_block = RegressionBlock()
    cont_vars =  ['Customers', 'Store']
    # To make this work, add the dependent variable 'Sales' to cont_vars and there is no need to set y_block
    cont_vars = ['Customers', 'Store', 'Sales']
    y_block = None
    cat_vars =  ['DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']
    dls = TabularDataLoaders.from_df(df, cat_names=cat_vars, cont_names=cont_vars, y_names='Sales',
                                     y_block=y_block, procs=procs)

    print(dls.show_batch(show=False))

    metrics = exp_rmspe
    layers = [1000, 500]
    learn = tabular_learner(dls, layers=layers, metrics=metrics)
    learn.fit_one_cycle(4, 1e-4)

I know you said it solved a problem below, but it should not be included in the cont_names column.

All the columns specified in the cont_names and cat_names will be used as the input to the model. Two issues: the model can just send the sales input values to the output and result in a 100% accuracy (during training). Second, when you go to use it to predict future sales, you can’t feed the model since you will not have the sales value to input.

May be a bug here

So dls.show_batch() displays the sales as float
but if we look at the raw input to the model
dls.train.y as you have done or
print(next(iter(dls.train)))
sales is an integer

Not sure of the correct fix for that…

For now, can you remove the metric to see if you can complete training? I’m guessing that the mse is wanting 2 floats.

I was getting phenomenal results during training;-)
And, as you know, couldn’t predict with the model.

Yes, the metric is the problem. When I remove the dependent variable from the cont_names list and use rmse or no metric in the learner, training succeeds. But when I use exp_rmspe as the metric, I get the PyTorch error (“exp_vml_cpu not implemented for ‘Short’”).

I don’t know how to understand this. I don’t really understand RMSPE. As a software engineer trying to learn tabular deep learning, I’m quickly understanding that I’m in over my head. Are there any resource you would recommend to get my head around these error metrics? Maybe Jeremy’s ML course from a few years ago?

Thanks again,
Tim

Personally, while learning, I wouldn’t worry about such specific details (unless you know for sure that RMSPE is specifically needed for your task)

Lot of the difficulties are in the data mangling/library knowledge and not necessary machine learning - in this case data is integer when a float is needed.

Older courses bring in a different issue in that the V1 library is a bit different and can get confusing when trying to adapt it to the V2 code base.