cat_flds is the categorical variables we had setup above
bs = batch size
Is this dependent on if we want to treat the data as time series? Or does this change when we are trying to do different categorizations (Porto Seguro Insurance)?
Only Train and Validation. I don’t think the API has argument for Test Data (I could be wrong).
Target Variable. In this case it’s the Log of Sales
I don’t think so. This API is useful for all structured data. In a time-series, you are also extracting features from dates and has implications on how you select validation indexes. Jeremy talked about not a random split of train and validation, but having last two weeks of training data as validation data.
I have additional questions for ColumnarModelData and I thought it would be better to ask them here instead of creating new topic.
Q1: How should I create a columnar model data if I don’t have any categorical variables? If I pass empty list, I later on (when starting the fit process) get empty array error thrown by pytorch.
Q2: Is it possible to do classification instead of regression using ColumnarModelData’s learner?
Oops! Good point. For now, just add a column of zeros, I think.
You can do a binary model now, by passing y_range=[0,1]. It doesn’t support multiclass classification at the moment, although it would be easy to add to class MixedInputModel if someone’s interested.
@jeremy I have one question too. From the api, the ColumnarModelData seemed doesn’t to support test sets, how can we make predictions on the test sets?
def from_data_frame(cls, path, val_idxs, df, y, cat_flds, bs):
@jeremy Thanks for answering my previous question. I have another one along the line of @jakcycsl’s.
My workflow is very usual:
train NN on columnar data
a) load train/val set
b) do ‘preprocessing’ (i.e. standardize each column)
c) initialize ColumnarModelData and get the learner
d) use lr_find; SGDR;
e) save model at the end
go to local machine
a) load sample
b) do 'preprocessing’
c) initialize ColumnarModelData using loaded sample and get the learner
d) load model trained on AWS
e) do classification on loaded sample
In order to do step 2b) I need the mapper from step 1b) and pass it as an argument in proc_df function, right? So I need to serialize the mapper as well.
Yeap, for step 2c, you would need to pass in the sample data from proc_df into ColumnarModelData. After that, get learner, load the model, and run the prediction.
I have a project that would use ColumnarModelData for categorical predictions. Before I begin the project, I’m curious: Did you have success using it for that purpose? Also, in my project, I would begin with binary classification, but would likely want to move to multiclass. Did you try multi-class?