Problem with multi-class structured data learner and loss function

ronlut · April 5, 2018, 12:33pm

Hi all.
I’m struggling with this issue for quite a few hours already and would be glad to get some hints from the experts here

I’m trying to fit a model for a dataset containing ~45K rows with a single multi-class label.
I have 10 optional classes for the label ranging from 1 to 100 (not continuous obviously). ^(see comment below)
I guess i’m doing something silly, but i’m getting exceptions from ClassNLLCriterion.cu:

ClassNLLCriterion.cu:101: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [24,0,0]
Assertion t >= 0 && t < n_classes failed.

It happens when line 45 of model.py is being executed: loss.backward().

Probably it has something to do with the target (y) shape of the loss function?

My code looks like this:

y = df.label.apply(lambda l: int(float(l))) # labels are originally decimal, shape of y is: (49513,)
df.drop('label', axis=1, inplace=True) # shape of df is: (49513, 2298)
val_idx = get_cv_idxs(len(df), val_pct=0.1)
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y.values, cat_flds=[], bs=128, is_reg=False) 
# I have no categorical variables that's why cat_flds=[]
m = md.get_learner(emb_szs=[], n_cont=len(df.columns), emb_drop=0.04, out_sz=1, szs= 
[1000,500], drops=[0.001,0.01])

Then I’m getting the above exception (ClassNLLCriterion) followed by THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCTensorMath.cu line=15 error=59 : device-side assert triggered.

^ I also tried encoding my labels to numbers ranging from 0 to 9 but it didn’t help.

Do you have any idea what am I doing wrong?
Thanks a lot!

ronlut · April 5, 2018, 3:38pm

I got it to work by changing the out_sz to 10 (which is my number of classes).
I confused this variable with my wanted final output (1 label) but that should not be the output of the last activation function (log_softmax).
If I understand correctly the reason is that the nll_loss function must get a (bs X num_of_classes) shape input, with the log probability as the value for each column.
The target (y) should still be (bs X 1) as it expects to get there the index of the correct class.

Hope that helps someone

joshgel · April 5, 2018, 4:54pm

Yes! Thanks so much. This helped immensely. If I set out_sz to the number of classes for non-regression problems, it runs fine.

But, then I get 4 values when I run model.predict(), need to figure out how to handle that…

ronlut · April 6, 2018, 4:01pm

Just use np.argmax to get the index of the highest probability. Then, if you encoded your labels to indices, translate it to your label.

superives · May 7, 2018, 4:26am

Hey Rony, thanks for your example.
1.
Did you change the loss function to self.crit = F.nll_loss ?
2.
Did you add a softmax layer in the end?

Do you mind share how to implement these two steps?

Thank you!