Use multiple columns as classes for datablock

marcossantana · September 27, 2020, 12:47am

Hi guys,

I’m a bit confused with how to get the labels using the datablock api. More specifically, how can we pass multiple columns of a dataframe to be used as classes to the datablock? In fastai_v1, it was straight forward to use:

label_from_df(multi_columns)

and the output of this would be a target with size (batch_size, number_of_columns).

This is a bit different from the multicat example, where the labels were in just one column and separated by commas. In the problem I’m trying to solve, I have 92 classes and my samples can be assigned to 2+ of them and recieve a label of ‘Active’ or ‘Inactive’. Therefore, I expect my target to be of shape:

batch_size x 92

and each column would tell if the sample is Active or Inative in one of the classes.

marcossantana · September 27, 2020, 5:25am

Ok, I thought I solved it but didn’t. Let me show exactly how my data looks like and what I want to do.

Each sample of my data can get a label “1” or “0” in one or more of 92 classes. Basically it is a dataframe with text in one column and 92 columns with values “1” and “0”. However, not all samples have a label in all classes. In fact, most of my data consists of samples where only 1 or 2 labels were assigned. For these missing labels, I simply assigned a -1 value.

In this scenario, my target at each batch would be a matrix of shape batch_size x 92.
Like this:

array([[ 0., 0., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1.],
[ 0., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
-1.]])

I tried this:

d_block = DataBlock(blocks(TextBlock.from_df(‘text’,res_col_name=‘text’,is_lm=False,
tok=MyTokenizer(), rules=[], vocab=vocab),Category),
get_x=ColReader(‘text’),get_y=ColReader(cols_classes),
splitter=RandomSplitter(0.2))

But now my targets are in a list of size 92, which obviously isn’t the right format:

x,y = first(dls.train)
print(y)
(#92) [tensor([1., 1., 1., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 1., 1., 1.,
0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0.],

If I use MultiCategoryBlock, my targets are of shape batch_size x 3, which I believe is because of the three labels, -1, 1, 0.

I have a working prototype in fastai1 and it was quite easy to implement. As I said, I just had to pass a list of columns (or classes) to label.from_df. But I’m struggling to do the same in V2. Is there any transform to do that? Something like ColReaderS, that accept multiple columns. Maybe there is a easier way to do that but I cant see it.

marcossantana · September 27, 2020, 5:21pm

Ok, Now I solved it :solved it

Here’s my working datablock and a function to get the targets:

def get_y(r): return r[classes].values.astype(np.int32)

 d_block = DataBlock(blocks=(TextBlock.from_df('text',res_col_name='text',is_lm=False,
                                             tok=Tokenizer_V2(), rules=[], vocab=vocab),MultiCategoryBlock(encoded=True,vocab=classes)),
                    get_x=ColReader('text'),get_y=get_y,
                     splitter=ColSplitter('is_valid'))

and the targets of the first mini-batch:

TensorMultiCategory([[-1., -1., -1.,  ..., -1., -1., -1.],
        [-1., -1., -1.,  ..., -1., -1., -1.],
        [-1., -1., -1.,  ..., -1., -1., -1.],
        ...,
        [-1., -1., -1.,  ..., -1., -1., -1.],
        [-1., -1., -1.,  ..., -1., -1., -1.],
        [-1., -1., -1.,  ..., -1., -1., -1.]], device='cuda:0')

After reviewing the multicat notebook, I realised the get_y functions works on single rows of the dataframe. By passing encoded = True and the classes to the vocab in MultiCategoryBlock, we can get the correct targets for all 92 classes.