Structured Learner


(Kerem Turgutlu) #1

Hi,

I am working on structured model. But having some trouble making predictions.

So, I trained my model and ran this code:
test_x is a dataframe.

test_ds = ColumnarDataset.from_data_frame(test_x[:10], cats, test.price)
test_dl = DataLoader(test_ds)
preds = m.predict_dl(test_dl)

Still keeps running, what might be the issue.

Thanks in Advance


(Jeremy Howard) #2

Try interrupting it, and check the stack trace. And next time you run it, run it under the debugger. Let me know if you need any help with those steps - and let us know if you find the solution!


(Kerem Turgutlu) #3

What I am doing for now as a temp solution is (at least works):

  • Tune model with train and val sets.
  • Store these hyperparameters (lr, dimension of embeddings)
  • Combine train and test (add target column to test)
  • Run model as you are running with val
  • Make predictions with m.predict()

Note: I am very fascinated by expressing categorical variables in Eucledian spaces, especially after seeing some t-SNE visualizations on them. Very cool :smiley:


Understanding ColumnarModelData.from_data_frame from Rossman
(Jeremy Howard) #4

Yup it’s kinda freaky…


(Chan Sooi Loong) #5

Combining test set and train set sounds like a nice trick. :grinning:

How do you fill up the data that are not available now? For eg. for test data in Rossman competition, we do not have the data in the future for weather, googletrend, etc. Can we fill up with just null value, wonder if that will distort the prediction results.


(Kerem Turgutlu) #6

I actually did this for a new competition in Kaggle. As for weather and googletrend data, I think we should have it since it’s the only way 3rd place winners to had a submission. So it must be somewhere in there :smiley:

The only think that is not available is target, so just fill it with 0s.


(Chan Sooi Loong) #7

Are you working on the Favorita Grocery Sales Forecasting too?
I am thinking of a hypothetical situation where we only have data for training sets but not the test sets, what should we do with those data?
But for the Grocery Sales Forecasting, we are well covered, as the oil price, holiday events includes values during both train and test data timeframe. The transactions csv data is only included training data timeframe, but I think the transactions are too similar to unit sales and should be the target variable.


(Anze Zupanc) #8

There’s another solution:

  • create ColumnarDataSet from DataFrame (df)

cds = ColumnarDataset.from_data_frame(df,cat_flds=[…put your cat vars here…],y=dummy_y)

  • create DataLoader from ColumnarDataSet

dl = DataLoader(cds)

  • make predictions for DataLoader

predictions = m.predict_dl(dl)


(Jeremy Howard) #9

You should be able to grab it here https://github.com/entron/entity-embedding-rossmann/tree/kaggle


(Arjun Rajkumar) #10

Hi Anze,

Just a little lost here. How do you pass the test data while using these three steps?
Do I have to create a new df containing only the test sets using the proc_df command,
and then make sure that that df should have the same columns as the ones on training?


(Anze Zupanc) #11

@arjunrajkumar Exactly.


(Jeremy Howard) #12

I updated the Rossmann notebook a couple of days ago to show how to create a submission file for the test set. It’s quite an involved process unfortunately - you have to replicate all the preprocessing steps for the test set.


Rossmann COmpetition
#14

@jeremy, In this example we have a regression model, right? How can I change this Structured Learner to use in a classification task? Just by changing the out_sz of get_learner to the number of classes?


(Jeremy Howard) #15

That sounds right - and change learn.crit to use cross_entropy. Let us know if you get it working.


Using ColumnarModelData.from_data_frame for classification
#16

Thanks! I’ll give it a shot.


(CVG) #17

I have set the output categorical variable to 0 & 1 using the following code:

dp_input[dep] = dp_input[dep].cat.codes # Mapped to 0 or 1
df, y, nas, mapper = proc_df(dp_input, dep, do_scale=True)
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y, cat_flds=cat_vars, bs=128)
cat_sz = [(c, len(dp_input[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(10, (c+1)//2)) for _,c in cat_sz]
m = md.get_learner(emb_szs, n_cont = len(df.columns)-len(cat_vars),
                   emb_drop = 0.04, out_sz = 2, szs = [250,100], drops = [0.001,0.01], y_range=[0,1], use_bn = True)
lr = 1e-3
m.fit(lr, 1)

The code is throwing following error:
RuntimeError: input and target have different number of elements: input[128 x 2] has 256 elements, while target[128 x 1] has 128 elements at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THNN/generic/MSECriterion.c:12

Not sure of how to set the learn.crit to use cross_entropy.

But the same code runs if I set out_sz = 1 as regression


#18

You need to one hot encode the target variable. Also use a sigmoid layer as the final layer in the model if your original model was a regression model.


(Ravi Sekar Vijayakumar) #19

@jeremy What would be a minimum dataset row count when structured deep learning would work ? I have a dataset with ~5000 entries with ~15 features and want to use it for regression. Would random forest help ? I know it’s a subjective question, but do you have any rule of thumb ?


(Jeremy Howard) #20

Sorry I don’t have enough experience with this kind of approach to have a rule of thumb yet. I’d be very interested to hear your results if you give it a try. Definitely try RF first however.


(Ravi Sekar Vijayakumar) #21

Sure. Will give a try and get back on this.