TypeError: '<' not supported between instances of 'Example' and 'Example' when using custom NLP dataset



I am currently trying to apply some sentiment analysis knowledge from 1st part of Deep learning course. The data was in form of one big .csv file, which I reduced to 4 columns (id, Clothing ID, Review Text and Recommended IND (= dependent variable)) and split to train and validation sets. I trained a model on the train set to predict words (as Jeremy did) and saved the encoder and vocabulary. Then I proceeded to sentiment analysis.

Data loading
I found out it’s possible to use TabularDataset in order to get splits from .csv files, so I used that.

splits = data.TabularDataset.splits(path=PATH, format='csv', skip_header=True,
                                train='trn.csv', validation='val.csv',
                                fields=[('id', None),
                                        ('Clothing ID', None),
                                        ('Review Text', TEXT),
                                        ('Recommended IND', LABEL)])

Having the splits created I went on and created TextData and RNN_Learner.

m2 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
m2.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)

Then I started learning.

m2.fit(lrs/2, 1, metrics=[accuracy])

But after the first epoch I get this error:

TypeError: '<' not supported between instances of 'Example' and 'Example'

Full traceback:

TypeError                                 Traceback (most recent call last)
<ipython-input-35-893af9e1e024> in <module>()
      1 m2.freeze_to(-1)
----> 2 m2.fit(lrs/2, 1, metrics=[accuracy])

D:\Dropbox\Dropbox\Computer science\Applied machine learning\FastAI NLP\Clothing reviews\fastai\learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
    285         self.sched = None
    286         layer_opt = self.get_layer_opt(lrs, wds)
--> 287         return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    289     def warm_up(self, lr, wds=None):

D:\Dropbox\Dropbox\Computer science\Applied machine learning\FastAI NLP\Clothing reviews\fastai\learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
    232             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
    233             swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
--> 234             swa_eval_freq=swa_eval_freq, **kwargs)
    236     def get_layer_groups(self): return self.models.get_layer_groups()

D:\Dropbox\Dropbox\Computer science\Applied machine learning\FastAI NLP\Clothing reviews\fastai\model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, **kwargs)
    160         if not all_val:
--> 161             vals = validate(model_stepper, cur_data.val_dl, metrics, seq_first=seq_first)
    162             stop=False
    163             for cb in callbacks: stop = stop or cb.on_epoch_end(vals)

D:\Dropbox\Dropbox\Computer science\Applied machine learning\FastAI NLP\Clothing reviews\fastai\model.py in validate(stepper, dl, metrics, seq_first)
    220     stepper.reset(False)
    221     with no_grad_context():
--> 222         for (*x,y) in iter(dl):
    223             preds, l = stepper.evaluate(VV(x), VV(y))
    224             batch_cnts.append(batch_sz(x, seq_first=seq_first))

D:\Dropbox\Dropbox\Computer science\Applied machine learning\FastAI NLP\Clothing reviews\fastai\nlp.py in __iter__(self)
    322         it = iter(self.src)
    323         for i in range(len(self)):
--> 324             b = next(it)
    325             yield getattr(b, self.x_fld).data, getattr(b, self.y_fld).data

D:\Programs\Anaconda3\envs\fastai\lib\site-packages\torchtext\data\iterator.py in __iter__(self)
    133     def __iter__(self):
    134         while True:
--> 135             self.init_epoch()
    136             for idx, minibatch in enumerate(self.batches):
    137                 # fast-forward if loaded from state

D:\Programs\Anaconda3\envs\fastai\lib\site-packages\torchtext\data\iterator.py in init_epoch(self)
    109             self._random_state_this_epoch = self.random_shuffler.random_state
--> 111         self.create_batches()
    113         if self._restored_from_state:

D:\Programs\Anaconda3\envs\fastai\lib\site-packages\torchtext\data\iterator.py in create_batches(self)
    233     def create_batches(self):
    234         if self.sort:
--> 235             self.batches = batch(self.data(), self.batch_size,
    236                                  self.batch_size_fn)
    237         else:

D:\Programs\Anaconda3\envs\fastai\lib\site-packages\torchtext\data\iterator.py in data(self)
     94         """Return the examples in the dataset in order, sorted, or shuffled."""
     95         if self.sort:
---> 96             xs = sorted(self.dataset, key=self.sort_key)
     97         elif self.shuffle:
     98             xs = [self.dataset[i] for i in self.random_shuffler(range(len(self.dataset)))]

TypeError: '<' not supported between instances of 'Example' and 'Example'


I have exactly the same problem when using TabularDataset. Did you get this worked out?



1 Like


Honestly, I gave up. I thought (and still think) that the problem is sort key of TabularDataset (’<’, instead of ‘len’) . I tried to find a way to change it, but didn’t find it, so I decided to go with the approach Jeremy suggests and created my own dataset object (as shown in his notebook - fastai/courses/dl1/lang_model-arxiv.ipynb).

I would share my approach, but back when I was implementing this, I struggled with PyTorch and was a bit desperate about getting this done, so I just copy-pasted and slightly changed Jeremy’s solution and created bunch of .txt files… Which git apparently doesn’t like.

Anyway, everything you need to know should be in the mentioned notebook.

class ArxivDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, label_field, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        examples = []
        for label in ['yes', 'no']:
            for fname in glob(os.path.join(path, label, '*.txt')):
                with open(fname, 'r') as f: text = f.readline()
                examples.append(data.Example.fromlist([text, label], fields))
        super().__init__(examples, fields, **kwargs)

    def sort_key(ex): return len(ex.text)
    def splits(cls, text_field, label_field, root='.data',
               train='train', test='test', **kwargs):
        return super().splits(
            root, text_field=text_field, label_field=label_field,
            train=train, validation=None, test=test, **kwargs)

ARX_LABEL = data.Field(sequential=False)
splits = ArxivDataset.splits(TEXT, ARX_LABEL, PATH, train='trn', test='val')

md2 = TextData.from_splits(PATH, splits, bs)

This is essentially all you need (apart from splitting your data to .txt files)


Thanks so much for sharing your findings and solution. I was going to try the same approach, ie convert the csv into separate files and use a version of Jeremy’s code. At least I know this is a viable way forward.

It seems to me that somehow the splits.examples generated by torchtext.data.TabularDataset.splits are treated, maybe as you say the sort key is wrong. But it is beyond me at the moment to make a suggestion on how to fix this.

Thanks again for your kind help


1 Like

I actually found some code from @hiromi

Which really helped me quickly build a dataset without having to convert my nice csv file into 1000s of files.

So thanks also to Hiromi


Yeah, that’s great solution. Thanks for sharing it! (and @hiromi too!)

1 Like