ColumnarModelData - Some help pls

Hi! I have problems understanding ColumnarModelData class.

I created an X with 800 rows x 10 features and Y with 800 x 1 scalars.
I used ColumnarModelData.from_arrays(".",[-1],X_train,Y_train,bs=32, shuffle=False) but when I tried to generate some batches, they look wrong.
I dug deeper and found this line:

return cls(path, PassthruDataset(*(trn_xs.T), trn_y, is_reg=is_reg, is_multi=is_multi)

Why trn_xs.T? Why transpose the X? What is the intention for this class? I saw on forums people using it from pandas dataframes.

The shape of X_train is (800,10) and Y_train is (800,1). The reason why it takes transpose is because it is simpler to split into batches. Now you get a batch something like [ 10 vectors of (32) ] in a list which is the batch size. You can stack them and you will get (10,32) vector where 32 is your batch size.

Thanks @isarth In the end, yes, I figured it out. In the forward() I do the stacking and then permute so the batch dimension is the first one (eg get 32x10 input shape).

The question is about the intent behind this particular design choice. Can’t be that by batching the transpose, one gains like 100x speed improvement!

Maybe a hint in this direction is that DL part 1, Lesson 6, RNNs, this class ColumnarModelData works seamlessly with the time series models. [Actually this was my original intent, to work with time series].

However, because of a lot of errors, I tried to do something simpler, a logistic regression. The errors moved to shape mismatch and you know the rest…

Like from the RNN layer documentation: https://pytorch.org/docs/stable/nn.html#rnn

I am not sure but I think it was designed mainly for handling time series data because for image dataset we use a different class.