Creating a Multi-Label torchtext Dataset

Hi! I’m having a little trouble understanding how to create a torchtext dataset that supports multiple labels. Per this thread (Creating a ModelData object without torchtext splits?), I tried creating a custom dataset, but I’m not sure I’m doing the right thing with the labels.

In the code below, I’ve created a different field and entry for each label. Is this correct, or should I just be creating one Label field and a list of labels to enter into the dataset Example along with the text? Right now, the below isn’t working because TextData.from_splits is expecting a Label field, but I’m wondering if I’m close. Thanks for any pointers! (Note, I’m also pulling the data from a dictionary of dataframes (dfs), as opposed to a directory).

class ToxicCommentDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label1_field, label2_field, 
                 label3_field, dfs, **kwargs):
        fields = [('text', text_field), ('Label1', label1_field), ('Label2', label2_field),
                 ('Label3', label3_field)]
        examples = []
        for i in range(dfs[path].values[:,1].shape[0]):
            text = dfs[path].comment_text.iloc[i]
            Label1 = None
            Label2 = None
            Label3 = None
            if 'Label1' in dfs[path]:
                Label1 = dfs[path].Label1[i]
                Label2 = dfs[path].Label2[i]
                Label3 = dfs[path].Label3[i]
            examples.append(data.Example.fromlist([text, Label1, Label2, Label3], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)

    @classmethod
    def splits(cls, path, text_field, label1_field, label2_field, 
                 label3_field, train, val, test, dfs, **kwargs):
        return super().splits(path,
            text_field=text_field, label1_field=label1_field, label2_field=label2_field, 
                  label3_field=label3_field, train=train, validation=val, test=test, dfs=dfs, **kwargs)
1 Like

If all the labels are processed in the same way (e.g. if they are all labels of 1s and 0s) then you only need to create a single field.
I think it would be easier overall to just write the dataframes to disk as csv files and read them using the TabularDataset.
Do you happen to be working on the toxic comment classification competition for Kaggle? I’ve written a tutorial on using torchtext for text classification here that uses the exact same dataset. I hope it can help!

3 Likes

I still haven’t been able to get your notebook to run :frowning:

I get crazy looking training and validation loss scores. My gut feeling is that there is some other preprocessing you may have done that isn’t included in your post, but my scores look like:

Epoch: 0, Training Loss: -44.4812, Validation Loss: -53.3443

And this after I changed the code to calculate loss to remove the multiply by the size of the batches first dimension (If I leave that in there the values are much more bizarre … like -9100).

Any ideas what I may be missing?

HI wgpubs. Sorry to hear about that, I’ll try to address the problem. How are you installing torchtext? The current release has some bugs that might cause the errors you’re mentioning, so I recommend you install from the current master branch on github like
$ pip install --upgrade git+https://github.com/pytorch/text
Please tell me if this still doesn’t solve the problem!

Hi @keitakurita, I’ve also tried using your notebook and while I can get it to run I experience a strange error where many of the Examples in my trn dataset end up not having a comment_text attribute. Any pointers on if this was something you had to face. It is greatly reducing the number of examples and therefore the predictive power of the algorithm.

Hi @gcmcalister. Like I mentioned above, the current pip release of torchtext has bugs, so I recommend you install from the current master branch on github like
$ pip install --upgrade git+https://github.com/pytorch/text
Please tell me if this still doesn’t solve the problem!

Hi @keitakurita!

Didn’t realize you were a fellow fastai student when I read your blog post! Nice.

Anyways, figured it out. It looks like you’re calling your loss function with the parameters switched up.

loss = loss_func(y, preds) should be loss = loss_func(preds, y)

Also, curious why you are measuring your running_loss like this: running_loss += loss.data[0] * x.size(0)? Why the multiplier of * x.size(0)? Doesn’t simply setting running_loss += loss.data[0] give you the loss for the mini-batch?

Thanks

1 Like

Hi @wgpubs! I originally became interested in torchtext through fastai!
Thanks for the pointer, I’ll make sure to fix it ASAP.
I measure running loss like running_loss += loss.data[0] * x.size(0) because the mini-batch size isn’t constant (specifically, the final mini-batch will have a different size due to the number of examples). This means that if I want the average training loss across all training examples, I can’t just average the losses of the mini batches - I need to weight them by their respective mini batch sizes.

gotcha.

thanks again for the great writeup. it really helped clarify how torchtext worked.

1 Like

Hi @keitakurita, thanks for the write up.
I would need to uninstall
pip uninstall torchtext
and then install again
pip install git+https://github.com/pytorch/text
in order to solve the trn dataset without comment_text attribute.

Hi @jakcycsl! Thanks for the notice!
I’ve edited my comments to point to the command
$ pip install --upgrade git+https://github.com/pytorch/text
which should uninstall the current version of torchtext and install the github version in one command.

1 Like

Hello. I tried to follow your tutorial, but i am getting bad results. I am classifying yelp comments. In this case, the labels are not (0 and 1), there are 1,2,3,4,5. Is there something np-utils.to-categorical? Moreover, which loss function can use because BCEwithlogistics is giving me : Epoch: 0, Training Loss: -44.4812, Validation Loss: -53.3443.
Thank you for answer.