Regression with negative values and tabular_learner

Hello, i’m writing here because in the lessons subforums all the questions seem regarding only lessons specific and errors within the library/notebook

I’m trying to deeply understand and generalizing the V3 rossman notebook for regression and classification. I’ll write everything of what i’ve understood and what it’s a bit obscure so please be patient :slight_smile:

procs=[FillMissing, Categorify, Normalize]
#preprocessing to be applied to train and test set

cat_vars = [‘feature_1’, ‘feature_2’, ‘feature_3’, ‘dayofweek’, ‘weekofyear’, ‘dayofyear’, ‘quarter’, ‘month’]
cont_vars = [‘cont_column1’, 'cont_column1_2, ‘cont_column_3’]
#my categorical and numerical columns

dep_var = ‘target’
#my target column

def rmse(pred, targ):
“RMSE between pred and targ.”
return torch.sqrt(((targ - pred)**2).mean())

#i have defined my own metric in this case RMSE

In the notebook i have understood the validation set is derived by a DATE slice not by idx

dep_var = ‘Sales’
df = train_df[cat_vars + cont_vars + [dep_var,‘Date’]].copy()
test_df[‘Date’].min(), test_df[‘Date’].max()
cut = train_df[‘Date’][(train_df[‘Date’] == train_df[‘Date’][len(test_df)])].index.max()
valid_idx = range(cut)

In my case i have a dataset of ~100k rows and i want to retain the last 10% for the validation set (am i doing it right?)

valid_idx = range(len(train_df)-10000, len(train_df))

then i want to pass the test_df to the test DataBunch

test = TabularList.from_df(test_df, cat_names=cat_vars, cont_names=cont_vars, procs=procs)

#take the same cat_vars, cont_vars and do the same preprocessing as the train set

the databunch:

data = (TabularList.from_df(train_df, path=’~//python_test/’, cat_names=cat_vars, cont_names=cont_vars, procs=procs)

                       .label_from_df(cols=dep_var, label_cls=FloatList, log=False)
                       .add_test(test, label=0)

#label_cls=FloatList because i want to do regression
#log=False because i have negative values in my target range

Now comes the part that i don’t understand well:

max_log_y = np.log(np.max(train_df[‘target’])*1.2)
y_range = torch.tensor([0, max_log_y], device=defaults.device)

from the docs:

The last size is out_sz, and we add a last activation that is a sigmoid rescaled to cover y_range (if it’s not None). Lastly, if use_bn is set to False, all BatchNorm layers are skipped except the one applied to the continuous variables.

It works in the notebook example (all positive values) but what if my target variable is between [-10,7]? I think it couldn’t work, in fact if I remove from the model i have NAN for the train loss

learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04,
y_range=y_range, metrics=rmse)

what about the layers param? Does it apply only to the continous columns right?
Also (i know it’s always the same question) what is the good the good size to pass? for example i get:
(layers): Sequential(
(0): Linear(in_features=44, out_features=1000, bias=True) so dropout it’s useless here because i have 1000 neurons “slots” and 44 inputs but what if:
(layers): Sequential(
(0): Linear(in_features=1001, out_features=1000, bias=True)

Will the architecture be affected only by the number of columns or i have to take into consideration also the number of rows?

also in the notebook we have:

(0): Linear(in_features=233, out_features=1000, bias=True)

All the above questions will let me to understand a LOT better the model, and this question is specific to my model
i got from the first 3 epochs

epoch train_loss valid_loss rmse
1 14.634343 15.495975 3.431877
2 16.888639 15.495975 3.431877
3 14.539230 15.495975 3.431877

same rmse and same valid loss that for me looks that there is an error, but what error?

I’m not sure I understand what you did. If your data ranges from -10 to 7, just pass y_range=(-10,7).

This line was part of a trick Jeremy mentioned where if your expected target is close to your input targets then you can put a sigmoid on the end to help the model train. The *1.2 part is so that the model can pick the maximum value easier. Maybe go back to the video for the explanation

max_log_y = np.log(np.max(train_df[‘target’])*1.2)
 y_range = torch.tensor([0, max_log_y], device=defaults.device)

As sgugger says, your max and min values are (-10,7) so you can just use those (or maybe a little more).

The layers param specifies the size of the BatchNorm, Dropout, Linear and ReLU blocks that follow both the continuous and categorical variables.

As a side note, you can display code with nice formatting using backticks like this:

code in here

Ehy thanks for your answer, but there is still something sneaky for me.
This afternoon, trying to figure out what’s wrong i’ve followed this kaggle kernel

data = (TabularList.from_df(train_df, path=’~//python_test//kaggle//titanic//’, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
.add_test(test, label=0)
if i change the label_from_df with:
.label_from_df(cols=dep_var, label_cls=FloatList, log=False)

when i call
predictions, *_ = learn.get_preds(DatasetType.Test)
labels = np.argmax(predictions, 1)

the labels are all 0 while i expected to have a probability p

In my dataframe, where i just want to do regression to a real value i just have all zeros.
I have read the docs and still don’t understand the right way to do it, sorry!