Structured Learner

Are you working on the Favorita Grocery Sales Forecasting too?
I am thinking of a hypothetical situation where we only have data for training sets but not the test sets, what should we do with those data?
But for the Grocery Sales Forecasting, we are well covered, as the oil price, holiday events includes values during both train and test data timeframe. The transactions csv data is only included training data timeframe, but I think the transactions are too similar to unit sales and should be the target variable.

There’s another solution:

  • create ColumnarDataSet from DataFrame (df)

cds = ColumnarDataset.from_data_frame(df,cat_flds=[…put your cat vars here…],y=dummy_y)

  • create DataLoader from ColumnarDataSet

dl = DataLoader(cds)

  • make predictions for DataLoader

predictions = m.predict_dl(dl)

4 Likes

You should be able to grab it here https://github.com/entron/entity-embedding-rossmann/tree/kaggle

Hi Anze,

Just a little lost here. How do you pass the test data while using these three steps?
Do I have to create a new df containing only the test sets using the proc_df command,
and then make sure that that df should have the same columns as the ones on training?

1 Like

@arjunrajkumar Exactly.

2 Likes

I updated the Rossmann notebook a couple of days ago to show how to create a submission file for the test set. It’s quite an involved process unfortunately - you have to replicate all the preprocessing steps for the test set.

5 Likes

@jeremy, In this example we have a regression model, right? How can I change this Structured Learner to use in a classification task? Just by changing the out_sz of get_learner to the number of classes?

3 Likes

That sounds right - and change learn.crit to use cross_entropy. Let us know if you get it working.

1 Like

Thanks! I’ll give it a shot.

I have set the output categorical variable to 0 & 1 using the following code:

dp_input[dep] = dp_input[dep].cat.codes # Mapped to 0 or 1
df, y, nas, mapper = proc_df(dp_input, dep, do_scale=True)
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y, cat_flds=cat_vars, bs=128)
cat_sz = [(c, len(dp_input[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(10, (c+1)//2)) for _,c in cat_sz]
m = md.get_learner(emb_szs, n_cont = len(df.columns)-len(cat_vars),
                   emb_drop = 0.04, out_sz = 2, szs = [250,100], drops = [0.001,0.01], y_range=[0,1], use_bn = True)
lr = 1e-3
m.fit(lr, 1)

The code is throwing following error:
RuntimeError: input and target have different number of elements: input[128 x 2] has 256 elements, while target[128 x 1] has 128 elements at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THNN/generic/MSECriterion.c:12

Not sure of how to set the learn.crit to use cross_entropy.

But the same code runs if I set out_sz = 1 as regression

You need to one hot encode the target variable. Also use a sigmoid layer as the final layer in the model if your original model was a regression model.

2 Likes

@jeremy What would be a minimum dataset row count when structured deep learning would work ? I have a dataset with ~5000 entries with ~15 features and want to use it for regression. Would random forest help ? I know it’s a subjective question, but do you have any rule of thumb ?

Sorry I don’t have enough experience with this kind of approach to have a rule of thumb yet. I’d be very interested to hear your results if you give it a try. Definitely try RF first however.

1 Like

Sure. Will give a try and get back on this.

I also have a data engineering question :slight_smile: :

Are there packages that might help or sample code in python to chop a log file into multiple parts by some pattern and apply a single procedure to extract data from all these parts in parallel?

You don’t need a package for that - just use ProcessPoolExecutor.map

@jeremy I tried the following for getting classification task but the code gave error

m = md.get_learner(emb_szs, n_cont = len(df.columns)-len(cat_vars),
                   emb_drop = 0.04, out_sz = 2, szs = [250,100], drops = [0.001,0.01], use_bn = True)
lr = 1e-3
m.fit(lr, 1, crit = F.cross_entropy) 

It threw following error. Not sure what should be the settings to get model working for classification

TypeError: fit() got multiple values for argument 'crit'

1 Like

Try changing the loss function to F.cross_entropy in ‘StructuredLearner’ class in column_data.py.

Or I think you can just move the crit= bit into the get_learner call.

1 Like

Thanks. I tried changing the following functions.

1 . Changed F.mse_loss to F.cross_entropy and it threw some error
2. I tried to change the API of get_learner to pass crit = F.cross_entropy and it also threw error that crit was invalid argument. I had made change in relevant functions.

Not sure what all the mistakes I am doing

@jeremy : Is there a sample python notebook for classification task for structured data. I tried to go through the functions but I didn’t get too far in terms of getting the code working for classification at my end for structured data.