Structured Learner

kcturgutlu · November 23, 2017, 3:57am

Hi,

I am working on structured model. But having some trouble making predictions.

So, I trained my model and ran this code:
test_x is a dataframe.

test_ds = ColumnarDataset.from_data_frame(test_x[:10], cats, test.price)
test_dl = DataLoader(test_ds)
preds = m.predict_dl(test_dl)

Still keeps running, what might be the issue.

Thanks in Advance

jeremy · November 23, 2017, 4:32am

Try interrupting it, and check the stack trace. And next time you run it, run it under the debugger. Let me know if you need any help with those steps - and let us know if you find the solution!

kcturgutlu · November 23, 2017, 4:49am

What I am doing for now as a temp solution is (at least works):

Tune model with train and val sets.
Store these hyperparameters (lr, dimension of embeddings)
Combine train and test (add target column to test)
Run model as you are running with val
Make predictions with m.predict()

Note: I am very fascinated by expressing categorical variables in Eucledian spaces, especially after seeing some t-SNE visualizations on them. Very cool

jeremy · November 23, 2017, 5:28am

Yup it’s kinda freaky…

jakcycsl · November 23, 2017, 6:31am

Combining test set and train set sounds like a nice trick.

How do you fill up the data that are not available now? For eg. for test data in Rossman competition, we do not have the data in the future for weather, googletrend, etc. Can we fill up with just null value, wonder if that will distort the prediction results.

kcturgutlu · November 23, 2017, 6:37am

I actually did this for a new competition in Kaggle. As for weather and googletrend data, I think we should have it since it’s the only way 3rd place winners to had a submission. So it must be somewhere in there

The only think that is not available is target, so just fill it with 0s.

jakcycsl · November 23, 2017, 6:54am

Are you working on the Favorita Grocery Sales Forecasting too?
I am thinking of a hypothetical situation where we only have data for training sets but not the test sets, what should we do with those data?
But for the Grocery Sales Forecasting, we are well covered, as the oil price, holiday events includes values during both train and test data timeframe. The transactions csv data is only included training data timeframe, but I think the transactions are too similar to unit sales and should be the target variable.

zpnc · November 23, 2017, 8:11pm

There’s another solution:

create ColumnarDataSet from DataFrame (df)

cds = ColumnarDataset.from_data_frame(df,cat_flds=[…put your cat vars here…],y=dummy_y)

create DataLoader from ColumnarDataSet

dl = DataLoader(cds)

make predictions for DataLoader

predictions = m.predict_dl(dl)

jeremy · November 24, 2017, 1:10am

You should be able to grab it here GitHub - entron/entity-embedding-rossmann at kaggle

arjunrajkumar · November 25, 2017, 7:34am

Hi Anze,

Just a little lost here. How do you pass the test data while using these three steps?
Do I have to create a new df containing only the test sets using the proc_df command,
and then make sure that that df should have the same columns as the ones on training?

zpnc · November 25, 2017, 7:34pm

@arjunrajkumar Exactly.

jeremy · November 25, 2017, 11:44pm

I updated the Rossmann notebook a couple of days ago to show how to create a submission file for the test set. It’s quite an involved process unfortunately - you have to replicate all the preprocessing steps for the test set.

thiago · November 27, 2017, 1:06am

@jeremy, In this example we have a regression model, right? How can I change this Structured Learner to use in a classification task? Just by changing the out_sz of get_learner to the number of classes?

jeremy · November 27, 2017, 4:39am

That sounds right - and change learn.crit to use cross_entropy. Let us know if you get it working.

thiago · November 27, 2017, 10:26am

Thanks! I’ll give it a shot.

cvgoudar · November 29, 2017, 10:21am

I have set the output categorical variable to 0 & 1 using the following code:

dp_input[dep] = dp_input[dep].cat.codes # Mapped to 0 or 1
df, y, nas, mapper = proc_df(dp_input, dep, do_scale=True)
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y, cat_flds=cat_vars, bs=128)
cat_sz = [(c, len(dp_input[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(10, (c+1)//2)) for _,c in cat_sz]
m = md.get_learner(emb_szs, n_cont = len(df.columns)-len(cat_vars),
                   emb_drop = 0.04, out_sz = 2, szs = [250,100], drops = [0.001,0.01], y_range=[0,1], use_bn = True)
lr = 1e-3
m.fit(lr, 1)

The code is throwing following error:
RuntimeError: input and target have different number of elements: input[128 x 2] has 256 elements, while target[128 x 1] has 128 elements at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THNN/generic/MSECriterion.c:12

Not sure of how to set the learn.crit to use cross_entropy.

But the same code runs if I set out_sz = 1 as regression

ar_ai · November 29, 2017, 10:29am

You need to one hot encode the target variable. Also use a sigmoid layer as the final layer in the model if your original model was a regression model.

ravivijay · November 29, 2017, 10:03pm

@jeremy What would be a minimum dataset row count when structured deep learning would work ? I have a dataset with ~5000 entries with ~15 features and want to use it for regression. Would random forest help ? I know it’s a subjective question, but do you have any rule of thumb ?

jeremy · November 29, 2017, 11:12pm

Sorry I don’t have enough experience with this kind of approach to have a rule of thumb yet. I’d be very interested to hear your results if you give it a try. Definitely try RF first however.

ravivijay · November 29, 2017, 11:25pm

Sure. Will give a try and get back on this.