Creating a ModelData object without torchtext splits?

I am working on a sentiment prediction problem and I am following the lesson4 notebook and I have built the text prediction piece and that all seems to be working as expected. Now I am trying to do the second part which is use transfer learning to instead of predicting the next work, predict whether the reviewer is happy or not_happy. So I have gotten to this part here and am having trouble converting it into a non-split version.

IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

md2 = TextData.from_splits(PATH, splits, bs)

So what I currently have is (not working at all currently):

training_data = data.TabularDataset(PATH+"train.csv", "csv", [("User_ID", data.Field()), ("Description", data.Field()), ("Browser_Used", data.Field()), ("Device_Used", data.Field()), ("Is_Response", data.Field(sequential=False))], skip_header=True)

md2 = TextData.from_splits(PATH, [training_data], bs, text_name="Description", label_name="Is_Response")

If anybody has any advice here I would really appreciate it!

The reason I got to this point is I know that I won’t be able to use datasets since that is only for torchtexts prebuilt datasets so I believe the solution will be something using data.something, but I haven’t quite put the pieces together yet.

I suggest creating your own torchtext dataset. See the arxiv example notebook where I’ve created one.

1 Like

Thanks I was hoping there was an example of this somewhere. I’ll try this tonight. I’m hoping this is my last roadblock to submit something on that predict the happiness challenge.

Hi Jeremy, will this api call suffice if I have to load my train and test data from dataframes –

Signature: TextData.from_dls(path, trn_dl, val_dl, test_dl=None)
Source:   
    @classmethod
    def from_dls(cls, path,trn_dl,val_dl,test_dl=None):
        trn_dl,val_dl = ModelDataLoader(trn_dl),ModelDataLoader(val_dl)
        if test_dl: test_dl = ModelDataLoader(test_dl)
        return cls(path, trn_dl, val_dl, test_dl)
File:      ~/fastai/courses/dl1/fastai/dataset.py
Type:      method

I’m not sure just from looking - give it a try and tell us how it goes! :slight_smile:

I am using the api like this - but I also do realize that there is no way to give labels to the data.

md2 = TextData.from_dls(PATH, trainDF, validDF)
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.65, wdrop=0.5, dropoute=0.1, dropouth=0.3)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-35-42e22ef1b92d> in <module>()
      1 md2 = TextData.from_dls(PATH, trainDF, validDF)
      2 m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
----> 3            dropout=0.1, dropouti=0.65, wdrop=0.5, dropoute=0.1, dropouth=0.3)

~/fastai/courses/dl1/fastai/nlp.py in get_model(self, opt_fn, max_sl, bptt, emb_sz, n_hid, n_layers, **kwargs)
    350 
    351     def get_model(self, opt_fn, max_sl, bptt, emb_sz, n_hid, n_layers, **kwargs):
--> 352         m = get_rnn_classifer(max_sl, bptt, self.bs, self.c, self.nt, emb_sz=emb_sz, n_hid=n_hid, n_layers=n_layers,
    353                               pad_token=self.pad_idx, **kwargs)
    354         model = TextModel(to_gpu(m))

AttributeError: 'TextData' object has no attribute 'bs'

Just checking, if anybody has managed to modify this part of code from nlp-arxis notebook to take dataframe of sentences with labels.

# class ArxivDataset(torchtext.data.Dataset):
#     def __init__(self, path, text_field, label_field, **kwargs):
#         fields = [('text', text_field), ('label', label_field)]
#         examples = []
#         for label in ['yes', 'no']:
#             for fname in iglob(os.path.join(path, label, '*.txt')):
#                 with open(fname, 'r') as f: text = f.readline()
#                 examples.append(data.Example.fromlist([text, label], fields))
#         super().__init__(examples, fields, **kwargs)

#     @staticmethod
#     def sort_key(ex): return len(ex.text)
    
#     @classmethod
#     def splits(cls, text_field, label_field, root='.data',
#                train='train', test='test', **kwargs):
#         return super().splits(
#             root, text_field=text_field, label_field=label_field,
#             train=train, validation=None, test=test, **kwargs)

Also working on this. Any updates ?

Thanks

The IMDB dataset code shows how to do this - that would be the best place to start.

I did, you can check here @guthl

Basically, I added an attribute dfs that is a dict pointing to different training/validation/test datasets,

df = {'train': train_df, 'val': val_df, 'test': None}

splits = PredictHappinessDataset.splits(TEXT, LABEL, '',
                             train='train',
                             val='val', test=None, dfs=df)

class PredictHappinessDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, dfs, **kwargs):
        fields = [("text", text_field), ("label", label_field) ]
        examples = []
        for i in range(dfs[path].values[:,1].shape[0]):
            text = dfs[path].Description[i]
            label = None
            if 'Is_Response' in dfs[path]:
                label = dfs[path].Is_Response[i]
            examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    @staticmethod
    def sort_key(ex): return len(ex.text)
    
    @classmethod
    def splits(cls, text_field, label_field, path,
               train, val, test, dfs, **kwargs):
        return super().splits(path,
            text_field=text_field, label_field=label_field, 
            train=train, validation=val, test=test, dfs=dfs, **kwargs)

It’s not pretty but works.

4 Likes