Understanding ColumnarModelData.from_data_frame from Rossman

mindtrinket · November 21, 2017, 2:08pm

I keep running into problems with ColumnarModelData from the Rossman notebook.

md = ColumnarModelData.from_data_frame(path, val_idx, joined, yl, cat_flds=cat_vars, bs=128)

So that I understand this correctly.

Path goes back to our folders
val_idx is a list of indexes only; [3000,3001,3002…] which labels which rows of data will be used for validation
joined is the total columnar dataset which has both the training and validation information in it. Test data is also chosen from it.
yl is confusing me. Is it taking the log of the Sales?

df, y, nas, mapper = proc_df(joined_samp, ‘Sales’, do_scale=True)
yl = np.log(y)
cat_flds is the categorical variables we had setup above
bs = batch size

Is this dependent on if we want to treat the data as time series? Or does this change when we are trying to do different categorizations (Porto Seguro Insurance)?

ramesh · November 21, 2017, 3:53pm

Path - Used to Save trained Model weights

Only Train and Validation. I don’t think the API has argument for Test Data (I could be wrong).

Target Variable. In this case it’s the Log of Sales

I don’t think so. This API is useful for all structured data. In a time-series, you are also extracting features from dates and has implications on how you select validation indexes. Jeremy talked about not a random split of train and validation, but having last two weeks of training data as validation data.

zpnc · November 22, 2017, 2:38pm

I have additional questions for ColumnarModelData and I thought it would be better to ask them here instead of creating new topic.

Q1: How should I create a columnar model data if I don’t have any categorical variables? If I pass empty list, I later on (when starting the fit process) get empty array error thrown by pytorch.

Q2: Is it possible to do classification instead of regression using ColumnarModelData’s learner?

Thanks in advance.

jeremy · November 22, 2017, 6:05pm

Oops! Good point. For now, just add a column of zeros, I think.

You can do a binary model now, by passing y_range=[0,1]. It doesn’t support multiclass classification at the moment, although it would be easy to add to class MixedInputModel if someone’s interested.

jakcycsl · November 23, 2017, 1:04am

@jeremy I have one question too. From the api, the ColumnarModelData seemed doesn’t to support test sets, how can we make predictions on the test sets?

def from_data_frame(cls, path, val_idxs, df, y, cat_flds, bs):

zpnc · November 23, 2017, 12:34pm

@jeremy Thanks for answering my previous question. I have another one along the line of @jakcycsl’s.

My workflow is very usual:

train NN on columnar data
a) load train/val set
b) do ‘preprocessing’ (i.e. standardize each column)
c) initialize ColumnarModelData and get the learner
d) use lr_find; SGDR;
e) save model at the end
go to local machine
a) load sample
b) do 'preprocessing’
c) initialize ColumnarModelData using loaded sample and get the learner
d) load model trained on AWS
e) do classification on loaded sample

In order to do step 2b) I need the mapper from step 1b) and pass it as an argument in proc_df function, right? So I need to serialize the mapper as well.

Is step 2c correct?

jakcycsl · November 23, 2017, 2:31pm

Yeap, for step 2c, you would need to pass in the sample data from proc_df into ColumnarModelData. After that, get learner, load the model, and run the prediction.

To predict the test sets, I have used the workaround steps below, worked fine for me so far.
http://forums.fast.ai/t/structured-learner/8224/3

travis · April 21, 2018, 12:15pm

I have a project that would use ColumnarModelData for categorical predictions. Before I begin the project, I’m curious: Did you have success using it for that purpose? Also, in my project, I would begin with binary classification, but would likely want to move to multiclass. Did you try multi-class?