Pass explicit labels through a dataframe on ImageDataBunch

ysig · February 1, 2020, 1:32pm

Hi,

I am trying to provide custom labels as the targets (output ‘y’) of my model.

As so I have a data-frame having one column image_id which contains the locations of image files and one named c containing vectors (which all have the same size)

I run the following command (TRAIN is the folder where the pictures are):

data = ImageDataBunch.from_df(TRAIN, df[[‘image_id’, ‘c’]], ds_tfms=tfms, size=sz).normalize()

and what I get is that the vectors are chopped at having two elements.
How is it possible that my targets stay at the raw format as I give them, while still keeping the useful properties of ImageDataBunch?

Thanks.

ysig · February 1, 2020, 1:55pm

Now I see that the data are passed as a Multi-Category (having the correct size) and as so, when I run my custom loss-function the size of a target is (BATCH_SIZE, 2) instead of (BATCH_SIZE, VECTOR_SIZE). How can I avoid that?

I use the https://docs.fast.ai/basic_train.html#Learner.
Also the multicategory has converted ints to floats.

bwarner · February 2, 2020, 2:08am

I’d recommend switching from using ImageDataBunch to the data_block api, as you will have a lot more flexibility defining your dataset. Lesson 3 goes over the data_block api and the first example is for multi-category data.

Using data_block instead of ImageDataBunch inferring what you want will probably solve the labeling issue.

Labels being converted from ints to floats is normal. Many loss functions require the labels to be float or long.

ysig · February 2, 2020, 1:51pm

I’d tried to use that, but I cannot find an easy way to do this.

All the methods for labels are:

and I don’t seem to find one that allows you to past a list of labels.
Also I can’t find a way to disable categorical treatment of the data.
I want the data targets to stay as integer tensors of constant length.
I have also designed a custom loss function (anyone please specification for that).
I want my targets to be given as is to the loss function.

In my point of view, allowing a user to pass his/her own targets and his/her own loss-function both elementary.
Thanks

ysig · February 2, 2020, 3:07pm

I also tried the approach of transforming my labels to multiple categories and then using:

data = (ImageList.from_df(df, PARENT_FOLDER, folder=DATA_FOLDER, suffix='.png')
        .split_by_rand_pct()
        .label_from_df(cols='categories', label_delim=';')
        .transform(tfms, size=sz).databunch().normalize())

to create my dataset of ‘multi-categorical data’.
But after creating it and plotting it by running show_batch I see that the order of categories is both not respected and mixed.

bwarner · February 2, 2020, 5:28pm

You can. I do it all the time.

From your next post it appears you figured out how to pass in multi-category data using the data_block api. Custom loss is as simple as setting loss_func in the learner.

I’m not sure what you mean by this. show_batch is going to return a random labeled batch from the training set.

ysig · February 2, 2020, 5:44pm

The question is how? Can you maybe help me learn how to do it?

I know that (I’ve read the documentation), but as I said I didn’t find a specification for this function.
You can have a look at my whole model:

class MyModel(nn.Module):
    def __init__(self, na, nb, nc):
        super(MyModel, self).__init__()
        self.ma = torch.hub.load('pytorch/vision:v0.5.0', 'densenet121', pretrained=True)
        self.ma.eval()
        self.gr = nn.Sequential(nn.Linear(1000, na))
        self.vd = nn.Sequential(nn.Linear(1000, nb))
        self.cd = nn.Sequential(nn.Linear(1000, nc))

    def forward(self,x):
        x = self.ma(x)
        a = (self.gr(x), self.vd(x), self.cd(x))
        x = torch.cat(a, axis=1)
        return x

def apply_to_three(a, b, s1, s2, s3, f, agg):
    print(a.shape)
    print(b.shape)
    return agg([f(a[:,:s1], b[:,:s1]),f(a[:,s1:s2], b[:,s1:s2]), f(a[:,s2:s3], b[:,s2:s3])])

s1 = na
s2 = s1+nb
s3 = s2+nc

loss_function = partial(apply_to_three, f=nn.CrossEntropyLoss, s1=s1, s2=s2, s3=s3, agg=sum)

# Create learner
learner = Learner(data, model=MyModel(na, nb, nc), loss_func=loss_function, metrics=accuracy)

I know that you maybe can do the same with a cnn_learner, but I would like to both learn and enrich it.

Say you have three categories with size 167, 8, 5.
Then I get for one picture: 10;23;1 and for the next 123;1;2 and for another one just 5;8.
Maybe it’s a problem of show_batch itself(?)

PS: I am new to fastai, pytorch, but not at all new to the concepts of deep-learning and programming in Python.

bwarner · February 2, 2020, 6:59pm

You are working on the Kaggle Bengal.ai competition, correct? And thus are working with a multi-task problem, not a multi-category problem?.

Fast.ai doesn’t natively support multi-task learning, but it can be done. Check out this kernel for one method https://www.kaggle.com/iafoss/grapheme-fast-ai-starter-lb-0-964

Fast.ai v2 appears to have better out of the box support for multi-task learning. Here’s an example on how to set it up https://www.kaggle.com/mnpinto/bengali-ai-fastai2-starter-lb0-9598

Same specification as PyTorch loss functions: https://pytorch.org/docs/stable/nn.html#loss-functions

If you pass a single label column with label_delim fastai will one-hot-encode the labels. I don’t have enough information to know what’s happening, but I’d guess it’s with how the data is set up.