Lessons Learned: Setting up custom Dataset torchtext

guthl · December 25, 2017, 6:58pm

@rob I generalized your PredictHapinessDataset for any DataFrame.
I could not find a better way to pass the path for the validation and test set. If someone has a proposition for improvement, I would really appreciate.

class DataFrameDataset(torchtext.data.Dataset):
def __init__(self, path, text_field, label_field, col, label, dfs, **kwargs):
    fields = [("text", text_field), ("label", label_field)]
    examples = []
    for i in range(dfs[path].values[:,1].shape[0]):
        text = dfs[path][col].iloc[i]
        label = dfs[path][label].iloc[i]
        examples.append(data.Example.fromlist([text, label], fields))
    super().__init__(examples, fields, **kwargs)

@staticmethod
def sort_key(ex): return len(ex.text)

@classmethod
def splits(cls, text_field, label_field, path, col, label, train, validation=None, test=None, **kwargs):
    dfs = {'train': train}
    
    if validation is not None:
        dfs['validation'] = validation
        has_validation = 'validation'
    else:
        has_validation = None
    if test is not None:
        dfs['test'] = test
        has_test = 'test'
    else:
        has_test = None
            
    return super().splits(path,
        text_field=text_field, label_field=label_field, col=col, label=label, 
                          train='train', validation=has_validation, test=has_test,  dfs=dfs, **kwargs)

creviera · December 26, 2017, 6:28am

Why are we using path here? If this is for dataframes instead of folders shouldn’t this just be a key like ‘train’, ‘test’, and ‘valid’? Path makes me think of folders. Could something like this work:

        for key, df in dfs.items():
            for i, row in df.iterrows():
                text = row[col]
                label = row[label]

Would this break when accessing a test dataframe which would not have a label column?

creviera · December 26, 2017, 6:48am

This seems to be working for me:

class DataFrameDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, text_col, label_col, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field)]
        examples = []
                
        for key, df in dfs.items():
            for i, row in df.iterrows():
                text = row[text_col]
                label = row[label_col]
                examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, path, col, label, train, valid=None, test=None, **kwargs):
        dfs = {'train': train}

        if valid is not None:
            dfs['valid'] = valid
            has_validation = 'valid'
        else:
            has_validation = None
        if test is not None:
            dfs['test'] = test
            has_test = 'test'
        else:
            has_test = None

        return super().splits(path, text_field=text_field, label_field=label_field, text_col=col, label_col=label, train='train', validation=has_validation, test=has_test,  dfs=dfs, **kwargs)

In [72]:

toxicity_label = data.Field(sequential=False)

splits = DataFrameDataset.splits(
    text_field=TEXT, 
    label_field=toxicity_label,
    path=PATH,
    col="comment_text",
    label="toxic",
    train=training_dataframe, 
    valid=validation_dataframe)

In [73]:

splits[0].examples[16]
t = splits[0].examples[16]
t.label, ' '.join(t.text[:16])

Out[73]:

('0',
 'lets all kaikolas not to worry what mudaliar says he himself knows devadasis are from vellala')

But I verified that it breaks when passing a test dataframe which doesn’t have labels. Also, since I’m using this on a dataset that has multiple labels like: toxic:1, threat:0, insult:1, etc. I’m wondering how I will need to modify this to work with multiple labels.

I’m thinking that maybe I need to pass in an array of labels and then maybe for now I’ll just add the values (0 or 1) up to get the ultimate toxicity score. Idk… just a thought. Anyone else working on the toxicity kaggle?

guthl · December 26, 2017, 7:51am

Path is used because of what TorchText is expecting. Since TorchText works with text files, what @rob did is to use the path as a key in a dictionnary and iterate over each values.
You can actually get rid of the path in the DataFrameDataset:

class DataFrameDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, col, gt, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field)]
        examples = []
        for i in range(dfs[path].values[:,1].shape[0]):
            text = dfs[path][col].iloc[i]
            label = dfs[path][gt].iloc[i]
            examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, col, label, train, validation=None, test=None, **kwargs):
        dfs = {'train': train}
        if validation is not None:
            dfs['validation'] = validation
            has_validation = 'validation'
        else:
            has_validation = None
        if test is not None:
            dfs['test'] = test
            has_test = 'test'
        else:
            has_test = None
                
        return super().splits('',
            text_field=text_field, label_field=label_field, col=col, gt=label, 
                              train='train', validation=has_validation, test=has_test,  dfs=dfs, **kwargs)

For the second point, I did not dive enough into TorchText expectation to check that. Obviously, this piece of code would break.

I’ll check the behavior of your code a bit latter. I’m just under the impression that there might be a problem. Keep you posted

guthl · December 26, 2017, 9:12am

I’m under the impression that your splits contains both train and validation for the train split and the validation split. Not fully understand how TorchText operates, I did not get to know how many items there is in each splits and check this hypothesis.

Even · February 23, 2018, 6:59am

Thanks so much for this. I was trying struggling to get pytorch splits from csv/tsv files using a number of examples but the end result kept failing when I tried to fit the model because of strange EOF issues.

varoon · March 31, 2018, 4:57am

Hi @creviera, could you post, how you solved the multi label issue? How did you handle multiple labels? I’m also solving the Toxic Comment Challenge on kaggle and would like to use the fast.ai’s approach. Curious on how you went about doing it… I’m stuck at getting the splits.

creviera · April 3, 2018, 2:10am

Hi @varoon . So… I spent oh-so-much time trying to get this to work with fastAi and torchtext. I have a fastAi branch which I will push to my toxic classifier repo soon, but you’ll notice if you click on the link and view my toxic classifier notebook that I switched to scikit-learn. This is because I had so much trouble getting multilabel classification working using torchtext. It was also tremendously slow compared to using other libraries.

You can see that this question was moved to the pytorch forums here: https://discuss.pytorch.org/t/how-to-do-multi-label-classification-with-torchtext/11571.

Also today during class Jeremy mentioned that we should now use fastAi.text instead of fastAi.nlp. You can see a good example of the new way using fastai.text here: https://github.com/fastai/fastai/blob/1f35c0259c3c5e1a9f1742c691a7d53ddd099650/courses/dl2/imdb.ipynb. This is the old way, using fastAi.nlp, which is what I was looking at oringally: https://github.com/fastai/fastai/blob/6cba68b8e21853c7a5992afc62b822f1b33464cc/courses/dl1/lang_model-arxiv.ipynb.

If you want to see a kaggle-winning way of making a toxicity classifier, take a look at @sermakarevich notebook: https://github.com/sermakarevich/jigsaw-toxic-comment-classification-challenge. Also Jeremy’s notebook: https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline. Both aren’t using fastAi, I think because fastai.text didn’t exist or wasn’t updated yet (I’m looking forward to taking a look to see how it solved the multilabel problem).

Also if you want to understand torchtext more, read this: http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext.

creviera · April 3, 2018, 2:18am

Or maybe not “kaggle-winning” but in the top 1%