Creating a ModelData object without torchtext splits?

KevinB · November 22, 2017, 7:22am

I am working on a sentiment prediction problem and I am following the lesson4 notebook and I have built the text prediction piece and that all seems to be working as expected. Now I am trying to do the second part which is use transfer learning to instead of predicting the next work, predict whether the reviewer is happy or not_happy. So I have gotten to this part here and am having trouble converting it into a non-split version.

IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

md2 = TextData.from_splits(PATH, splits, bs)

So what I currently have is (not working at all currently):

training_data = data.TabularDataset(PATH+"train.csv", "csv", [("User_ID", data.Field()), ("Description", data.Field()), ("Browser_Used", data.Field()), ("Device_Used", data.Field()), ("Is_Response", data.Field(sequential=False))], skip_header=True)

md2 = TextData.from_splits(PATH, [training_data], bs, text_name="Description", label_name="Is_Response")

If anybody has any advice here I would really appreciate it!

The reason I got to this point is I know that I won’t be able to use datasets since that is only for torchtexts prebuilt datasets so I believe the solution will be something using data.something, but I haven’t quite put the pieces together yet.

jeremy · November 22, 2017, 5:57pm

I suggest creating your own torchtext dataset. See the arxiv example notebook where I’ve created one.

KevinB · November 22, 2017, 6:22pm

Thanks I was hoping there was an example of this somewhere. I’ll try this tonight. I’m hoping this is my last roadblock to submit something on that predict the happiness challenge.

mmr · December 5, 2017, 8:20pm

Hi Jeremy, will this api call suffice if I have to load my train and test data from dataframes –

Signature: TextData.from_dls(path, trn_dl, val_dl, test_dl=None)
Source:   
    @classmethod
    def from_dls(cls, path,trn_dl,val_dl,test_dl=None):
        trn_dl,val_dl = ModelDataLoader(trn_dl),ModelDataLoader(val_dl)
        if test_dl: test_dl = ModelDataLoader(test_dl)
        return cls(path, trn_dl, val_dl, test_dl)
File:      ~/fastai/courses/dl1/fastai/dataset.py
Type:      method

jeremy · December 6, 2017, 4:28am

I’m not sure just from looking - give it a try and tell us how it goes!

mmr · December 6, 2017, 6:15am

I am using the api like this - but I also do realize that there is no way to give labels to the data.

md2 = TextData.from_dls(PATH, trainDF, validDF)
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.65, wdrop=0.5, dropoute=0.1, dropouth=0.3)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-35-42e22ef1b92d> in <module>()
      1 md2 = TextData.from_dls(PATH, trainDF, validDF)
      2 m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
----> 3            dropout=0.1, dropouti=0.65, wdrop=0.5, dropoute=0.1, dropouth=0.3)

~/fastai/courses/dl1/fastai/nlp.py in get_model(self, opt_fn, max_sl, bptt, emb_sz, n_hid, n_layers, **kwargs)
    350 
    351     def get_model(self, opt_fn, max_sl, bptt, emb_sz, n_hid, n_layers, **kwargs):
--> 352         m = get_rnn_classifer(max_sl, bptt, self.bs, self.c, self.nt, emb_sz=emb_sz, n_hid=n_hid, n_layers=n_layers,
    353                               pad_token=self.pad_idx, **kwargs)
    354         model = TextModel(to_gpu(m))

AttributeError: 'TextData' object has no attribute 'bs'

mmr · December 6, 2017, 7:25am

Just checking, if anybody has managed to modify this part of code from nlp-arxis notebook to take dataframe of sentences with labels.

# class ArxivDataset(torchtext.data.Dataset):
#     def __init__(self, path, text_field, label_field, **kwargs):
#         fields = [('text', text_field), ('label', label_field)]
#         examples = []
#         for label in ['yes', 'no']:
#             for fname in iglob(os.path.join(path, label, '*.txt')):
#                 with open(fname, 'r') as f: text = f.readline()
#                 examples.append(data.Example.fromlist([text, label], fields))
#         super().__init__(examples, fields, **kwargs)

#     @staticmethod
#     def sort_key(ex): return len(ex.text)
    
#     @classmethod
#     def splits(cls, text_field, label_field, root='.data',
#                train='train', test='test', **kwargs):
#         return super().splits(
#             root, text_field=text_field, label_field=label_field,
#             train=train, validation=None, test=test, **kwargs)

guthl · December 11, 2017, 1:19pm

Also working on this. Any updates ?

Thanks

jeremy · December 11, 2017, 2:43pm

The IMDB dataset code shows how to do this - that would be the best place to start.

rob · December 11, 2017, 3:57pm

I did, you can check here @guthl

Basically, I added an attribute dfs that is a dict pointing to different training/validation/test datasets,

df = {'train': train_df, 'val': val_df, 'test': None}

splits = PredictHappinessDataset.splits(TEXT, LABEL, '',
                             train='train',
                             val='val', test=None, dfs=df)

class PredictHappinessDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field) ]
        examples = []
        for i in range(dfs[path].values[:,1].shape[0]):
            text = dfs[path].Description[i]
            label = None
            if 'Is_Response' in dfs[path]:
                label = dfs[path].Is_Response[i]
            examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, path,
               train, val, test, dfs, **kwargs):
        return super().splits(path,
            text_field=text_field, label_field=label_field, 
            train=train, validation=val, test=test, dfs=dfs, **kwargs)

It’s not pretty but works.