You should be able to grab it here https://github.com/entron/entity-embedding-rossmann/tree/kaggle
Just a little lost here. How do you pass the test data while using these three steps?
Do I have to create a new df containing only the test sets using the proc_df command,
and then make sure that that df should have the same columns as the ones on training?
I updated the Rossmann notebook a couple of days ago to show how to create a submission file for the test set. It’s quite an involved process unfortunately - you have to replicate all the preprocessing steps for the test set.
@jeremy, In this example we have a regression model, right? How can I change this Structured Learner to use in a classification task? Just by changing the out_sz of get_learner to the number of classes?
That sounds right - and change
learn.crit to use cross_entropy. Let us know if you get it working.
Thanks! I’ll give it a shot.
I have set the output categorical variable to 0 & 1 using the following code:
dp_input[dep] = dp_input[dep].cat.codes # Mapped to 0 or 1 df, y, nas, mapper = proc_df(dp_input, dep, do_scale=True) md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y, cat_flds=cat_vars, bs=128) cat_sz = [(c, len(dp_input[c].cat.categories)+1) for c in cat_vars] emb_szs = [(c, min(10, (c+1)//2)) for _,c in cat_sz] m = md.get_learner(emb_szs, n_cont = len(df.columns)-len(cat_vars), emb_drop = 0.04, out_sz = 2, szs = [250,100], drops = [0.001,0.01], y_range=[0,1], use_bn = True) lr = 1e-3 m.fit(lr, 1)
The code is throwing following error:
RuntimeError: input and target have different number of elements: input[128 x 2] has 256 elements, while target[128 x 1] has 128 elements at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THNN/generic/MSECriterion.c:12
Not sure of how to set the learn.crit to use cross_entropy.
But the same code runs if I set out_sz = 1 as regression
You need to one hot encode the target variable. Also use a sigmoid layer as the final layer in the model if your original model was a regression model.
@jeremy What would be a minimum dataset row count when structured deep learning would work ? I have a dataset with ~5000 entries with ~15 features and want to use it for regression. Would random forest help ? I know it’s a subjective question, but do you have any rule of thumb ?
Sorry I don’t have enough experience with this kind of approach to have a rule of thumb yet. I’d be very interested to hear your results if you give it a try. Definitely try RF first however.
Sure. Will give a try and get back on this.
I also have a data engineering question :
Are there packages that might help or sample code in python to chop a log file into multiple parts by some pattern and apply a single procedure to extract data from all these parts in parallel?
You don’t need a package for that - just use
@jeremy I tried the following for getting classification task but the code gave error
m = md.get_learner(emb_szs, n_cont = len(df.columns)-len(cat_vars), emb_drop = 0.04, out_sz = 2, szs = [250,100], drops = [0.001,0.01], use_bn = True) lr = 1e-3 m.fit(lr, 1, crit = F.cross_entropy)
It threw following error. Not sure what should be the settings to get model working for classification
TypeError: fit() got multiple values for argument 'crit'
Try changing the loss function to F.cross_entropy in ‘StructuredLearner’ class in column_data.py.
Or I think you can just move the
crit= bit into the
Thanks. I tried changing the following functions.
1 . Changed F.mse_loss to F.cross_entropy and it threw some error
2. I tried to change the API of get_learner to pass crit = F.cross_entropy and it also threw error that crit was invalid argument. I had made change in relevant functions.
Not sure what all the mistakes I am doing
@jeremy : Is there a sample python notebook for classification task for structured data. I tried to go through the functions but I didn’t get too far in terms of getting the code working for classification at my end for structured data.
No, I’ve not tried it. You’re in unchartered waters!
Hi what is the intuition behind this embedding weight initialization:
def emb_init(x): x = x.weight.data sc = 2/(x.size(1)+1) x.uniform_(-sc,sc)