ColumnarModelData - Some help pls

visoft · June 29, 2018, 8:17am

Hi! I have problems understanding ColumnarModelData class.

I created an X with 800 rows x 10 features and Y with 800 x 1 scalars.
I used ColumnarModelData.from_arrays(".",[-1],X_train,Y_train,bs=32, shuffle=False) but when I tried to generate some batches, they look wrong.
I dug deeper and found this line:

github.com

fastai/fastai/blob/ffb2caaf22ea0ebd30f6dbb260021aeebfacc90c/fastai/column_data.py#L60


class ColumnarModelData(ModelData):
def __init__(self, path, trn_ds, val_ds, bs, test_ds=None, shuffle=True):
    test_dl = DataLoader(test_ds, bs, shuffle=False, num_workers=1) if test_ds is not None else None
    super().__init__(path, DataLoader(trn_ds, bs, shuffle=shuffle, num_workers=1),
        DataLoader(val_ds, bs*2, shuffle=False, num_workers=1), test_dl)


@classmethod
def from_arrays(cls, path, val_idxs, xs, y, is_reg=True, is_multi=False, bs=64, test_xs=None, shuffle=True):
    ((val_xs, trn_xs), (val_y, trn_y)) = split_by_idx(val_idxs, xs, y)
    test_ds = PassthruDataset(*(test_xs.T), [0] * len(test_xs), is_reg=is_reg, is_multi=is_multi) if test_xs is not None else None
    return cls(path, PassthruDataset(*(trn_xs.T), trn_y, is_reg=is_reg, is_multi=is_multi),
               PassthruDataset(*(val_xs.T), val_y, is_reg=is_reg, is_multi=is_multi),
               bs=bs, shuffle=shuffle, test_ds=test_ds)


@classmethod
def from_data_frames(cls, path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, is_multi, test_df=None, shuffle=True):
    trn_ds  = ColumnarDataset.from_data_frame(trn_df,  cat_flds, trn_y, is_reg, is_multi)
    val_ds  = ColumnarDataset.from_data_frame(val_df,  cat_flds, val_y, is_reg, is_multi)
    test_ds = ColumnarDataset.from_data_frame(test_df, cat_flds, None,  is_reg, is_multi) if test_df is not None else None
    return cls(path, trn_ds, val_ds, bs, test_ds=test_ds, shuffle=shuffle)

return cls(path, PassthruDataset(*(trn_xs.T), trn_y, is_reg=is_reg, is_multi=is_multi)

Why trn_xs.T? Why transpose the X? What is the intention for this class? I saw on forums people using it from pandas dataframes.

isarth · June 29, 2018, 8:52am

The shape of X_train is (800,10) and Y_train is (800,1). The reason why it takes transpose is because it is simpler to split into batches. Now you get a batch something like [ 10 vectors of (32) ] in a list which is the batch size. You can stack them and you will get (10,32) vector where 32 is your batch size.

visoft · June 29, 2018, 1:01pm

Thanks @isarth In the end, yes, I figured it out. In the forward() I do the stacking and then permute so the batch dimension is the first one (eg get 32x10 input shape).

The question is about the intent behind this particular design choice. Can’t be that by batching the transpose, one gains like 100x speed improvement!

Maybe a hint in this direction is that DL part 1, Lesson 6, RNNs, this class ColumnarModelData works seamlessly with the time series models. [Actually this was my original intent, to work with time series].

However, because of a lot of errors, I tried to do something simpler, a logistic regression. The errors moved to shape mismatch and you know the rest…

visoft · June 29, 2018, 1:05pm

Like from the RNN layer documentation: https://pytorch.org/docs/stable/nn.html#rnn

isarth · June 29, 2018, 5:45pm

I am not sure but I think it was designed mainly for handling time series data because for image dataset we use a different class.