Lessons Learned: Setting up custom Dataset torchtext


(Kevin Bird) #1

Posting my code below, but I just wanted to share a few lessons that I’ve learned. The first thing I learned is extremely important is to only put the text field and the label field in the fields variable. This is important because there are other functions that use fields and if you have extra columns, it will error out. The second is that you don’t necessarily need to have everything in a separate file to make this work. In this instance I did because I didn’t understand how things worked, but if I were to do this again, I would just pull the data directly from a DataFrame to create my examples. Another reason I wanted to create this thread is to hear from other people that are having similar problems/solutions.

class PredictHappinessDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, **kwargs):
        fields = [("Description", text_field), ("Is_Response", label_field)]
        examples = []
        for label in ['happy', 'not_happy']:
            for fname in iglob(os.path.join(path, label, '*.txt')):
                with open(fname, 'r') as f: text = f.readline()
                examples.append(data.Example.fromlist([text, label], fields))#[fields[1], fields[-1]]))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.Description)
    
    @classmethod
    def splits(cls, text_field, label_field, root='.data',
               train='train', test='test', **kwargs):
        return super().splits(
            root, text_field=text_field, label_field=label_field,
            train=train, validation=None, test=test, **kwargs)

(Jeremy Howard (Admin)) #2

Can you describe this in more detail? I’m not quite following… (sorry I’m a little slow!)


(Kevin Bird) #3

Yeah, so if I was doing things over, I wouldn’t save everything into a separate file and then pull it in for examples like this:

        for label in ['happy', 'not_happy']:
            for fname in iglob(os.path.join(path, label, '*.txt')):
                with open(fname, 'r') as f: text = f.readline()
                examples.append(data.Example.fromlist([text, label], fields))

Instead I would do something closer to this:

for i in range(trn.values[:,1].shape[0]):
            text = trn.Description[i] #trn is a pandas Dataframe
            label = trn.Is_Response[i]
            examples.append(data.Example.fromlist([text, label], fields))

I haven’t tested this yet, but I’m going through now to see if it actually works.


(Arvind Nagaraj) #4

Also, when you “open” files, it is better to pass the char encoding of your dataset explicitly. For instance if your input text is Unicode, you could add this parameter ( encoding = ‘utf-8’ )


(Kevin Bird) #5

How would I go about finding what my char encoding is?


(Jeremy Howard (Admin)) #6

That makes sense. Looks like the start of a DataframeTextDataset class?.. :slight_smile:


(Kevin Bird) #7

Yeah, I think so. I am still working out which parameters you would want, but I’m thinking you would just pass in the DataFrame, text column, and label column. I guess you would still need the Text and Label Field arguments as well. Maybe I will work on implementing this over the weekend. I’m sure other people would find this useful as well.


(Jeremy Howard (Admin)) #8

That’s what I’m thinking too. :slight_smile:


(WG) #9

You might want to check out a pull request I just made to the fastai repo.

I’m not sure it does exactly what you are trying to do, but I added two new classes to nlp.py that allows you to build a LanguageModelData object from dataframes instead of multiple text files. Take a look. Not sure if it will get accepted, but its working for me on the spooky author dataset.

The two classes I’m proposing are:

ConcatTextDatasetFromDataFrames(torchtext.data.Dataset)

… and

LanguageModelDataFromDataFrames()

Works just like the lesson-4-imdb notebook but with dataframes.

@jeremy I coded things to not break anything, but may I suggest modifying LanguageModelData class to simply expose class methods, from_dfs and from_text_files, to build the ModelData object.


(Jeremy Howard (Admin)) #10

Yes, please modify your PR as you see fit to make an API that you think is reasonably clean. Don’t feel the need to keep my crappy API in place!


(WG) #11

haha … yah ok, I didn’t want to step on anyone’s toes.

Will do.

I’m going to really be pushing the use of DataFrames as data sources where possible … they are just so maleable and make it trivial to do NLP with all kinds of underlying formats (text, html, csv, etc…, etc…).


(Jeremy Howard (Admin)) #12

Yeah I’ve been starting to feel the same way.


(WG) #13

Done.

Regarding DataFrames, treating it as an interface allows for cleaner and concise code vs. writing a bunch of somewhat redundant code (that we’ll have to be maintained).

Essentially what you’re saying is, “I don’t care what kind of data you have, put it into a DataFrame and we’ll do the rest”.

It’s beautiful.


(James Requa) #14

Love this idea. Btw there’s actually a Wiki here for Fastai library “feature requests” which is basically a thread to keep track of all the ongoing stuff being added to fastai. Feel free to post this in there…I already have from_df and texts_from_df on the list both of which I was planning on doing PR’s for but it sounds like you would definitely be able to implement this much quicker/more efficient than I could :slight_smile:


(Louis Guthmann) #15

I definitely think that DataFrames as datasource should be part of the library.

By the way, as i’m trying to use LanguageModelData from my sequence problem, two problems occurred.
The first one is the assumption that the text will be English (spacy_tok uses spacy_en and has no function to specify another language or in my case, a completely new language).
The second is a smaller one: ConcatTextDataset takes the assumption that the files will have a dot in their filename. As I used csplit from bash to cut my files, it took me a while to understand my error. I think a DocString should be added and the error message should be more specific.
At last, a progress bar for the loading could very useful. As my data is really big, it keeps freezing and I don’t where I stand.

As I have never done any proposition to an open source project, I don’t really know how to proceed from those observations.


(Jeremy Howard (Admin)) #16

I’ve merged the changes - thanks @wgpubs! If there are notebooks that need changes, maybe you could push those changes too, if you happen to have time?..


(Rob H) #17

I was getting an error from the last set of changes, so I submitted a PR.


(Rob H) #18

Hi @KevinB , I have a few comments/questions about the happiness dataset and using NLP in fast ai.

(1) How did you specify your test set and get its predictions for the happiness dataset?

(2) I had to modify TextData.from_splits() in nlp.py to this because previously it wasn’t expecting a test set,

trn_iter,val_iter,test_iter = torchtext.data.BucketIterator.splits(splits, batch_size=bs)

(3) FWIW. this is how I set up using dataframes, after a decent amount of confusion. I’m still not sure this is correct.

class PredictHappinessDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field)]
        examples = []
        for i in range(dfs[path].values[:,1].shape[0]):
            text = dfs[path].Description[i]
            label = dfs[path].Is_Response[i]
            examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, path,
               train, val, test, dfs, **kwargs):
        return super().splits(path,
            text_field=text_field, label_field=label_field,
            train=train, validation=val, test=test, dfs=dfs, **kwargs)

df = {'train': train_df, 'val': val_df, 'test': test_df}
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))
LABEL = data.Field(sequential=False)
splits = PredictHappinessDataset.splits(TEXT, LABEL, '',
                             train='train',
                             val='val', test='test', dfs=df)

IMDB Test Data is different from what is shown in the Class
(ecdrid) #19

Did it work?


(Rob H) #20

Yes it did!