Structured Learner

You should be able to grab it here https://github.com/entron/entity-embedding-rossmann/tree/kaggle

Hi Anze,

Just a little lost here. How do you pass the test data while using these three steps?
Do I have to create a new df containing only the test sets using the proc_df command,
and then make sure that that df should have the same columns as the ones on training?

1 Like

@arjunrajkumar Exactly.

2 Likes

I updated the Rossmann notebook a couple of days ago to show how to create a submission file for the test set. It’s quite an involved process unfortunately - you have to replicate all the preprocessing steps for the test set.

5 Likes

@jeremy, In this example we have a regression model, right? How can I change this Structured Learner to use in a classification task? Just by changing the out_sz of get_learner to the number of classes?

3 Likes

That sounds right - and change learn.crit to use cross_entropy. Let us know if you get it working.

1 Like

Thanks! I’ll give it a shot.

I have set the output categorical variable to 0 & 1 using the following code:

dp_input[dep] = dp_input[dep].cat.codes # Mapped to 0 or 1
df, y, nas, mapper = proc_df(dp_input, dep, do_scale=True)
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y, cat_flds=cat_vars, bs=128)
cat_sz = [(c, len(dp_input[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(10, (c+1)//2)) for _,c in cat_sz]
m = md.get_learner(emb_szs, n_cont = len(df.columns)-len(cat_vars),
                   emb_drop = 0.04, out_sz = 2, szs = [250,100], drops = [0.001,0.01], y_range=[0,1], use_bn = True)
lr = 1e-3
m.fit(lr, 1)

The code is throwing following error:
RuntimeError: input and target have different number of elements: input[128 x 2] has 256 elements, while target[128 x 1] has 128 elements at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THNN/generic/MSECriterion.c:12

Not sure of how to set the learn.crit to use cross_entropy.

But the same code runs if I set out_sz = 1 as regression

You need to one hot encode the target variable. Also use a sigmoid layer as the final layer in the model if your original model was a regression model.

2 Likes

@jeremy What would be a minimum dataset row count when structured deep learning would work ? I have a dataset with ~5000 entries with ~15 features and want to use it for regression. Would random forest help ? I know it’s a subjective question, but do you have any rule of thumb ?

Sorry I don’t have enough experience with this kind of approach to have a rule of thumb yet. I’d be very interested to hear your results if you give it a try. Definitely try RF first however.

1 Like

Sure. Will give a try and get back on this.

I also have a data engineering question :slight_smile: :

Are there packages that might help or sample code in python to chop a log file into multiple parts by some pattern and apply a single procedure to extract data from all these parts in parallel?

You don’t need a package for that - just use ProcessPoolExecutor.map

@jeremy I tried the following for getting classification task but the code gave error

m = md.get_learner(emb_szs, n_cont = len(df.columns)-len(cat_vars),
                   emb_drop = 0.04, out_sz = 2, szs = [250,100], drops = [0.001,0.01], use_bn = True)
lr = 1e-3
m.fit(lr, 1, crit = F.cross_entropy) 

It threw following error. Not sure what should be the settings to get model working for classification

TypeError: fit() got multiple values for argument 'crit'

1 Like

Try changing the loss function to F.cross_entropy in ‘StructuredLearner’ class in column_data.py.

Or I think you can just move the crit= bit into the get_learner call.

1 Like

Thanks. I tried changing the following functions.

1 . Changed F.mse_loss to F.cross_entropy and it threw some error
2. I tried to change the API of get_learner to pass crit = F.cross_entropy and it also threw error that crit was invalid argument. I had made change in relevant functions.

Not sure what all the mistakes I am doing

@jeremy : Is there a sample python notebook for classification task for structured data. I tried to go through the functions but I didn’t get too far in terms of getting the code working for classification at my end for structured data.

No, I’ve not tried it. You’re in unchartered waters!

1 Like

Hi what is the intuition behind this embedding weight initialization:

def emb_init(x):
    x = x.weight.data
    sc = 2/(x.size(1)+1)
    x.uniform_(-sc,sc)

Thanks