Lessons Learned: Setting up custom Dataset torchtext


(WG) #9

You might want to check out a pull request I just made to the fastai repo.

I’m not sure it does exactly what you are trying to do, but I added two new classes to nlp.py that allows you to build a LanguageModelData object from dataframes instead of multiple text files. Take a look. Not sure if it will get accepted, but its working for me on the spooky author dataset.

The two classes I’m proposing are:

ConcatTextDatasetFromDataFrames(torchtext.data.Dataset)

… and

LanguageModelDataFromDataFrames()

Works just like the lesson-4-imdb notebook but with dataframes.

@jeremy I coded things to not break anything, but may I suggest modifying LanguageModelData class to simply expose class methods, from_dfs and from_text_files, to build the ModelData object.


(Jeremy Howard (Admin)) #10

Yes, please modify your PR as you see fit to make an API that you think is reasonably clean. Don’t feel the need to keep my crappy API in place!


(WG) #11

haha … yah ok, I didn’t want to step on anyone’s toes.

Will do.

I’m going to really be pushing the use of DataFrames as data sources where possible … they are just so maleable and make it trivial to do NLP with all kinds of underlying formats (text, html, csv, etc…, etc…).


(Jeremy Howard (Admin)) #12

Yeah I’ve been starting to feel the same way.


(WG) #13

Done.

Regarding DataFrames, treating it as an interface allows for cleaner and concise code vs. writing a bunch of somewhat redundant code (that we’ll have to be maintained).

Essentially what you’re saying is, “I don’t care what kind of data you have, put it into a DataFrame and we’ll do the rest”.

It’s beautiful.


(James Requa) #14

Love this idea. Btw there’s actually a Wiki here for Fastai library “feature requests” which is basically a thread to keep track of all the ongoing stuff being added to fastai. Feel free to post this in there…I already have from_df and texts_from_df on the list both of which I was planning on doing PR’s for but it sounds like you would definitely be able to implement this much quicker/more efficient than I could :slight_smile:


(Louis Guthmann) #15

I definitely think that DataFrames as datasource should be part of the library.

By the way, as i’m trying to use LanguageModelData from my sequence problem, two problems occurred.
The first one is the assumption that the text will be English (spacy_tok uses spacy_en and has no function to specify another language or in my case, a completely new language).
The second is a smaller one: ConcatTextDataset takes the assumption that the files will have a dot in their filename. As I used csplit from bash to cut my files, it took me a while to understand my error. I think a DocString should be added and the error message should be more specific.
At last, a progress bar for the loading could very useful. As my data is really big, it keeps freezing and I don’t where I stand.

As I have never done any proposition to an open source project, I don’t really know how to proceed from those observations.


(Jeremy Howard (Admin)) #16

I’ve merged the changes - thanks @wgpubs! If there are notebooks that need changes, maybe you could push those changes too, if you happen to have time?..


(Rob H) #17

I was getting an error from the last set of changes, so I submitted a PR.


(Rob H) #18

Hi @KevinB , I have a few comments/questions about the happiness dataset and using NLP in fast ai.

(1) How did you specify your test set and get its predictions for the happiness dataset?

(2) I had to modify TextData.from_splits() in nlp.py to this because previously it wasn’t expecting a test set,

trn_iter,val_iter,test_iter = torchtext.data.BucketIterator.splits(splits, batch_size=bs)

(3) FWIW. this is how I set up using dataframes, after a decent amount of confusion. I’m still not sure this is correct.

class PredictHappinessDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field)]
        examples = []
        for i in range(dfs[path].values[:,1].shape[0]):
            text = dfs[path].Description[i]
            label = dfs[path].Is_Response[i]
            examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, path,
               train, val, test, dfs, **kwargs):
        return super().splits(path,
            text_field=text_field, label_field=label_field,
            train=train, validation=val, test=test, dfs=dfs, **kwargs)

df = {'train': train_df, 'val': val_df, 'test': test_df}
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))
LABEL = data.Field(sequential=False)
splits = PredictHappinessDataset.splits(TEXT, LABEL, '',
                             train='train',
                             val='val', test='test', dfs=df)

IMDB Test Data is different from what is shown in the Class
(ecdrid) #19

Did it work?


(Rob H) #20

Yes it did!


(Louis Guthmann) #21

@rob I generalized your PredictHapinessDataset for any DataFrame.
I could not find a better way to pass the path for the validation and test set. If someone has a proposition for improvement, I would really appreciate.

class DataFrameDataset(torchtext.data.Dataset):
def __init__(self, path, text_field, label_field, col, label, dfs, **kwargs):
    fields = [("text", text_field), ("label", label_field)]
    examples = []
    for i in range(dfs[path].values[:,1].shape[0]):
        text = dfs[path][col].iloc[i]
        label = dfs[path][label].iloc[i]
        examples.append(data.Example.fromlist([text, label], fields))
    super().__init__(examples, fields, **kwargs)

@staticmethod
def sort_key(ex): return len(ex.text)

@classmethod
def splits(cls, text_field, label_field, path, col, label, train, validation=None, test=None, **kwargs):
    dfs = {'train': train}
    
    if validation is not None:
        dfs['validation'] = validation
        has_validation = 'validation'
    else:
        has_validation = None
    if test is not None:
        dfs['test'] = test
        has_test = 'test'
    else:
        has_test = None
            
    return super().splits(path,
        text_field=text_field, label_field=label_field, col=col, label=label, 
                          train='train', validation=has_validation, test=has_test,  dfs=dfs, **kwargs)

(Allie Crevier) #22

Why are we using path here? If this is for dataframes instead of folders shouldn’t this just be a key like ‘train’, ‘test’, and ‘valid’? Path makes me think of folders. Could something like this work:

        for key, df in dfs.items():
            for i, row in df.iterrows():
                text = row[col]
                label = row[label]

Would this break when accessing a test dataframe which would not have a label column?


(Allie Crevier) #23

This seems to be working for me:

class DataFrameDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, text_col, label_col, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field)]
        examples = []
                
        for key, df in dfs.items():
            for i, row in df.iterrows():
                text = row[text_col]
                label = row[label_col]
                examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)
​
    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, path, col, label, train, valid=None, test=None, **kwargs):
        dfs = {'train': train}
​
        if valid is not None:
            dfs['valid'] = valid
            has_validation = 'valid'
        else:
            has_validation = None
        if test is not None:
            dfs['test'] = test
            has_test = 'test'
        else:
            has_test = None
​
        return super().splits(path, text_field=text_field, label_field=label_field, text_col=col, label_col=label, train='train', validation=has_validation, test=has_test,  dfs=dfs, **kwargs)

In [72]:

toxicity_label = data.Field(sequential=False)
​
splits = DataFrameDataset.splits(
    text_field=TEXT, 
    label_field=toxicity_label,
    path=PATH,
    col="comment_text",
    label="toxic",
    train=training_dataframe, 
    valid=validation_dataframe)

In [73]:

splits[0].examples[16]
t = splits[0].examples[16]
t.label, ' '.join(t.text[:16])

Out[73]:

('0',
 'lets all kaikolas not to worry what mudaliar says he himself knows devadasis are from vellala')

But I verified that it breaks when passing a test dataframe which doesn’t have labels. Also, since I’m using this on a dataset that has multiple labels like: toxic:1, threat:0, insult:1, etc. I’m wondering how I will need to modify this to work with multiple labels.

I’m thinking that maybe I need to pass in an array of labels and then maybe for now I’ll just add the values (0 or 1) up to get the ultimate toxicity score. Idk… just a thought. Anyone else working on the toxicity kaggle?


(Louis Guthmann) #24

Path is used because of what TorchText is expecting. Since TorchText works with text files, what @rob did is to use the path as a key in a dictionnary and iterate over each values.
You can actually get rid of the path in the DataFrameDataset:

class DataFrameDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, col, gt, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field)]
        examples = []
        for i in range(dfs[path].values[:,1].shape[0]):
            text = dfs[path][col].iloc[i]
            label = dfs[path][gt].iloc[i]
            examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, col, label, train, validation=None, test=None, **kwargs):
        dfs = {'train': train}
        if validation is not None:
            dfs['validation'] = validation
            has_validation = 'validation'
        else:
            has_validation = None
        if test is not None:
            dfs['test'] = test
            has_test = 'test'
        else:
            has_test = None
                
        return super().splits('',
            text_field=text_field, label_field=label_field, col=col, gt=label, 
                              train='train', validation=has_validation, test=has_test,  dfs=dfs, **kwargs)

For the second point, I did not dive enough into TorchText expectation to check that. Obviously, this piece of code would break.

I’ll check the behavior of your code a bit latter. I’m just under the impression that there might be a problem. Keep you posted


(Louis Guthmann) #25

I’m under the impression that your splits contains both train and validation for the train split and the validation split. Not fully understand how TorchText operates, I did not get to know how many items there is in each splits and check this hypothesis.


(Even Oldridge) #26

Thanks so much for this. I was trying struggling to get pytorch splits from csv/tsv files using a number of examples but the end result kept failing when I tried to fit the model because of strange EOF issues.


(Varoon Ravi) #28

Hi @creviera, could you post, how you solved the multi label issue? How did you handle multiple labels? I’m also solving the Toxic Comment Challenge on kaggle and would like to use the fast.ai’s approach. Curious on how you went about doing it… I’m stuck at getting the splits.


(Allie Crevier) #29

Hi @varoon . So… I spent oh-so-much time trying to get this to work with fastAi and torchtext. I have a fastAi branch which I will push to my toxic classifier repo soon, but you’ll notice if you click on the link and view my toxic classifier notebook that I switched to scikit-learn. This is because I had so much trouble getting multilabel classification working using torchtext. It was also tremendously slow compared to using other libraries.

You can see that this question was moved to the pytorch forums here: https://discuss.pytorch.org/t/how-to-do-multi-label-classification-with-torchtext/11571.

Also today during class Jeremy mentioned that we should now use fastAi.text instead of fastAi.nlp. You can see a good example of the new way using fastai.text here: https://github.com/fastai/fastai/blob/1f35c0259c3c5e1a9f1742c691a7d53ddd099650/courses/dl2/imdb.ipynb. This is the old way, using fastAi.nlp, which is what I was looking at oringally: https://github.com/fastai/fastai/blob/6cba68b8e21853c7a5992afc62b822f1b33464cc/courses/dl1/lang_model-arxiv.ipynb.

If you want to see a kaggle-winning way of making a toxicity classifier, take a look at @sermakarevich notebook: https://github.com/sermakarevich/jigsaw-toxic-comment-classification-challenge. Also Jeremy’s notebook: https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline. Both aren’t using fastAi, I think because fastai.text didn’t exist or wasn’t updated yet (I’m looking forward to taking a look to see how it solved the multilabel problem).

Also if you want to understand torchtext more, read this: http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext.