Lessons Learned: Setting up custom Dataset torchtext

jeremy · November 23, 2017, 5:46am

Yes, please modify your PR as you see fit to make an API that you think is reasonably clean. Don’t feel the need to keep my crappy API in place!

wgpubs · November 23, 2017, 5:48am

haha … yah ok, I didn’t want to step on anyone’s toes.

Will do.

I’m going to really be pushing the use of DataFrames as data sources where possible … they are just so maleable and make it trivial to do NLP with all kinds of underlying formats (text, html, csv, etc…, etc…).

jeremy · November 23, 2017, 6:07am

Yeah I’ve been starting to feel the same way.

wgpubs · November 23, 2017, 6:41am

Done.

Regarding DataFrames, treating it as an interface allows for cleaner and concise code vs. writing a bunch of somewhat redundant code (that we’ll have to be maintained).

Essentially what you’re saying is, “I don’t care what kind of data you have, put it into a DataFrame and we’ll do the rest”.

It’s beautiful.

jamesrequa · November 23, 2017, 7:14am

Love this idea. Btw there’s actually a Wiki here for Fastai library “feature requests” which is basically a thread to keep track of all the ongoing stuff being added to fastai. Feel free to post this in there…I already have from_df and texts_from_df on the list both of which I was planning on doing PR’s for but it sounds like you would definitely be able to implement this much quicker/more efficient than I could

guthl · November 23, 2017, 8:49am

I definitely think that DataFrames as datasource should be part of the library.

By the way, as i’m trying to use LanguageModelData from my sequence problem, two problems occurred.
The first one is the assumption that the text will be English (spacy_tok uses spacy_en and has no function to specify another language or in my case, a completely new language).
The second is a smaller one: ConcatTextDataset takes the assumption that the files will have a dot in their filename. As I used csplit from bash to cut my files, it took me a while to understand my error. I think a DocString should be added and the error message should be more specific.
At last, a progress bar for the loading could very useful. As my data is really big, it keeps freezing and I don’t where I stand.

As I have never done any proposition to an open source project, I don’t really know how to proceed from those observations.

jeremy · November 24, 2017, 12:11am

I’ve merged the changes - thanks @wgpubs! If there are notebooks that need changes, maybe you could push those changes too, if you happen to have time?..

rob · November 24, 2017, 9:41am

I was getting an error from the last set of changes, so I submitted a PR.

rob · November 24, 2017, 12:11pm

Hi @KevinB , I have a few comments/questions about the happiness dataset and using NLP in fast ai.

(1) How did you specify your test set and get its predictions for the happiness dataset?

(2) I had to modify TextData.from_splits() in nlp.py to this because previously it wasn’t expecting a test set,

trn_iter,val_iter,test_iter = torchtext.data.BucketIterator.splits(splits, batch_size=bs)

(3) FWIW. this is how I set up using dataframes, after a decent amount of confusion. I’m still not sure this is correct.

class PredictHappinessDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field)]
        examples = []
        for i in range(dfs[path].values[:,1].shape[0]):
            text = dfs[path].Description[i]
            label = dfs[path].Is_Response[i]
            examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, path,
               train, val, test, dfs, **kwargs):
        return super().splits(path,
            text_field=text_field, label_field=label_field,
            train=train, validation=val, test=test, dfs=dfs, **kwargs)

…

df = {'train': train_df, 'val': val_df, 'test': test_df}
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))
LABEL = data.Field(sequential=False)
splits = PredictHappinessDataset.splits(TEXT, LABEL, '',
                             train='train',
                             val='val', test='test', dfs=df)

ecdrid · November 28, 2017, 9:21pm

Did it work?

rob · November 28, 2017, 9:23pm

Yes it did!

guthl · December 25, 2017, 6:58pm

@rob I generalized your PredictHapinessDataset for any DataFrame.
I could not find a better way to pass the path for the validation and test set. If someone has a proposition for improvement, I would really appreciate.

class DataFrameDataset(torchtext.data.Dataset):
def __init__(self, path, text_field, label_field, col, label, dfs, **kwargs):
    fields = [("text", text_field), ("label", label_field)]
    examples = []
    for i in range(dfs[path].values[:,1].shape[0]):
        text = dfs[path][col].iloc[i]
        label = dfs[path][label].iloc[i]
        examples.append(data.Example.fromlist([text, label], fields))
    super().__init__(examples, fields, **kwargs)

@staticmethod
def sort_key(ex): return len(ex.text)

@classmethod
def splits(cls, text_field, label_field, path, col, label, train, validation=None, test=None, **kwargs):
    dfs = {'train': train}
    
    if validation is not None:
        dfs['validation'] = validation
        has_validation = 'validation'
    else:
        has_validation = None
    if test is not None:
        dfs['test'] = test
        has_test = 'test'
    else:
        has_test = None
            
    return super().splits(path,
        text_field=text_field, label_field=label_field, col=col, label=label, 
                          train='train', validation=has_validation, test=has_test,  dfs=dfs, **kwargs)

creviera · December 26, 2017, 6:28am

Why are we using path here? If this is for dataframes instead of folders shouldn’t this just be a key like ‘train’, ‘test’, and ‘valid’? Path makes me think of folders. Could something like this work:

        for key, df in dfs.items():
            for i, row in df.iterrows():
                text = row[col]
                label = row[label]

Would this break when accessing a test dataframe which would not have a label column?

creviera · December 26, 2017, 6:48am

This seems to be working for me:

class DataFrameDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, text_col, label_col, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field)]
        examples = []
                
        for key, df in dfs.items():
            for i, row in df.iterrows():
                text = row[text_col]
                label = row[label_col]
                examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, path, col, label, train, valid=None, test=None, **kwargs):
        dfs = {'train': train}

        if valid is not None:
            dfs['valid'] = valid
            has_validation = 'valid'
        else:
            has_validation = None
        if test is not None:
            dfs['test'] = test
            has_test = 'test'
        else:
            has_test = None

        return super().splits(path, text_field=text_field, label_field=label_field, text_col=col, label_col=label, train='train', validation=has_validation, test=has_test,  dfs=dfs, **kwargs)

In [72]:

toxicity_label = data.Field(sequential=False)

splits = DataFrameDataset.splits(
    text_field=TEXT, 
    label_field=toxicity_label,
    path=PATH,
    col="comment_text",
    label="toxic",
    train=training_dataframe, 
    valid=validation_dataframe)

In [73]:

splits[0].examples[16]
t = splits[0].examples[16]
t.label, ' '.join(t.text[:16])

Out[73]:

('0',
 'lets all kaikolas not to worry what mudaliar says he himself knows devadasis are from vellala')

But I verified that it breaks when passing a test dataframe which doesn’t have labels. Also, since I’m using this on a dataset that has multiple labels like: toxic:1, threat:0, insult:1, etc. I’m wondering how I will need to modify this to work with multiple labels.

I’m thinking that maybe I need to pass in an array of labels and then maybe for now I’ll just add the values (0 or 1) up to get the ultimate toxicity score. Idk… just a thought. Anyone else working on the toxicity kaggle?

guthl · December 26, 2017, 7:51am

Path is used because of what TorchText is expecting. Since TorchText works with text files, what @rob did is to use the path as a key in a dictionnary and iterate over each values.
You can actually get rid of the path in the DataFrameDataset:

class DataFrameDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, col, gt, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field)]
        examples = []
        for i in range(dfs[path].values[:,1].shape[0]):
            text = dfs[path][col].iloc[i]
            label = dfs[path][gt].iloc[i]
            examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, col, label, train, validation=None, test=None, **kwargs):
        dfs = {'train': train}
        if validation is not None:
            dfs['validation'] = validation
            has_validation = 'validation'
        else:
            has_validation = None
        if test is not None:
            dfs['test'] = test
            has_test = 'test'
        else:
            has_test = None
                
        return super().splits('',
            text_field=text_field, label_field=label_field, col=col, gt=label, 
                              train='train', validation=has_validation, test=has_test,  dfs=dfs, **kwargs)

For the second point, I did not dive enough into TorchText expectation to check that. Obviously, this piece of code would break.

I’ll check the behavior of your code a bit latter. I’m just under the impression that there might be a problem. Keep you posted

guthl · December 26, 2017, 9:12am

I’m under the impression that your splits contains both train and validation for the train split and the validation split. Not fully understand how TorchText operates, I did not get to know how many items there is in each splits and check this hypothesis.

Even · February 23, 2018, 6:59am

Thanks so much for this. I was trying struggling to get pytorch splits from csv/tsv files using a number of examples but the end result kept failing when I tried to fit the model because of strange EOF issues.

varoon · March 31, 2018, 4:57am

Hi @creviera, could you post, how you solved the multi label issue? How did you handle multiple labels? I’m also solving the Toxic Comment Challenge on kaggle and would like to use the fast.ai’s approach. Curious on how you went about doing it… I’m stuck at getting the splits.

creviera · April 3, 2018, 2:10am

Hi @varoon . So… I spent oh-so-much time trying to get this to work with fastAi and torchtext. I have a fastAi branch which I will push to my toxic classifier repo soon, but you’ll notice if you click on the link and view my toxic classifier notebook that I switched to scikit-learn. This is because I had so much trouble getting multilabel classification working using torchtext. It was also tremendously slow compared to using other libraries.

You can see that this question was moved to the pytorch forums here: https://discuss.pytorch.org/t/how-to-do-multi-label-classification-with-torchtext/11571.

Also today during class Jeremy mentioned that we should now use fastAi.text instead of fastAi.nlp. You can see a good example of the new way using fastai.text here: https://github.com/fastai/fastai/blob/1f35c0259c3c5e1a9f1742c691a7d53ddd099650/courses/dl2/imdb.ipynb. This is the old way, using fastAi.nlp, which is what I was looking at oringally: https://github.com/fastai/fastai/blob/6cba68b8e21853c7a5992afc62b822f1b33464cc/courses/dl1/lang_model-arxiv.ipynb.

If you want to see a kaggle-winning way of making a toxicity classifier, take a look at @sermakarevich notebook: https://github.com/sermakarevich/jigsaw-toxic-comment-classification-challenge. Also Jeremy’s notebook: https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline. Both aren’t using fastAi, I think because fastai.text didn’t exist or wasn’t updated yet (I’m looking forward to taking a look to see how it solved the multilabel problem).

Also if you want to understand torchtext more, read this: http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext.

creviera · April 3, 2018, 2:18am

Or maybe not “kaggle-winning” but in the top 1%