Hello, i’m writing here because in the lessons subforums all the questions seem regarding only lessons specific and errors within the library/notebook
I’m trying to deeply understand and generalizing the V3 rossman notebook for regression and classification. I’ll write everything of what i’ve understood and what it’s a bit obscure so please be patient
procs=[FillMissing, Categorify, Normalize]
#preprocessing to be applied to train and test set
cat_vars = [‘feature_1’, ‘feature_2’, ‘feature_3’, ‘dayofweek’, ‘weekofyear’, ‘dayofyear’, ‘quarter’, ‘month’]
cont_vars = [‘cont_column1’, 'cont_column1_2, ‘cont_column_3’]
#my categorical and numerical columns
dep_var = ‘target’
#my target column
def rmse(pred, targ):
“RMSE between pred
and targ
.”
return torch.sqrt(((targ - pred)**2).mean())
#i have defined my own metric in this case RMSE
In the notebook i have understood the validation set is derived by a DATE slice not by idx
dep_var = ‘Sales’
df = train_df[cat_vars + cont_vars + [dep_var,‘Date’]].copy()
test_df[‘Date’].min(), test_df[‘Date’].max()
cut = train_df[‘Date’][(train_df[‘Date’] == train_df[‘Date’][len(test_df)])].index.max()
valid_idx = range(cut)
In my case i have a dataset of ~100k rows and i want to retain the last 10% for the validation set (am i doing it right?)
valid_idx = range(len(train_df)-10000, len(train_df))
then i want to pass the test_df to the test DataBunch
test = TabularList.from_df(test_df, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
#take the same cat_vars, cont_vars and do the same preprocessing as the train set
the databunch:
data = (TabularList.from_df(train_df, path=’~//python_test/’, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=False)
.add_test(test, label=0)
.databunch())
#label_cls=FloatList because i want to do regression
#log=False because i have negative values in my target range
Now comes the part that i don’t understand well:
max_log_y = np.log(np.max(train_df[‘target’])*1.2)
y_range = torch.tensor([0, max_log_y], device=defaults.device)
from the docs: https://docs.fast.ai/tabular.models.html#TabularModel
The last size is out_sz, and we add a last activation that is a sigmoid rescaled to cover y_range (if it’s not None). Lastly, if use_bn is set to False, all BatchNorm layers are skipped except the one applied to the continuous variables.
It works in the notebook example (all positive values) but what if my target variable is between [-10,7]? I think it couldn’t work, in fact if I remove from the model i have NAN for the train loss
learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04,
y_range=y_range, metrics=rmse)
what about the layers param? Does it apply only to the continous columns right?
Also (i know it’s always the same question) what is the good the good size to pass? for example i get:
(layers): Sequential(
(0): Linear(in_features=44, out_features=1000, bias=True) so dropout it’s useless here because i have 1000 neurons “slots” and 44 inputs but what if:
(layers): Sequential(
(0): Linear(in_features=1001, out_features=1000, bias=True)
??
Will the architecture be affected only by the number of columns or i have to take into consideration also the number of rows?
also in the notebook we have:
(0): Linear(in_features=233, out_features=1000, bias=True)
All the above questions will let me to understand a LOT better the model, and this question is specific to my model
i got from the first 3 epochs
epoch train_loss valid_loss rmse
1 14.634343 15.495975 3.431877
2 16.888639 15.495975 3.431877
3 14.539230 15.495975 3.431877
same rmse and same valid loss that for me looks that there is an error, but what error?