Value Error in Lesson 4 -NLP

castilla · May 27, 2018, 12:04am

After successfully reproducing the sentiment training on IMDB of lesson 4 I tried to reproduce it on my own data set.

The code I used to generate the datasets and training.
`
home = os.getenv(“HOME”)
PATH = home + “/fastai/courses/dl1/”
arquivo = “model_txt_sedi.pkl”
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))
TEXT = pickle.load(open(arquivo,‘rb’))
spacy_tok_pt = spacy.load(‘pt_core_news_sm’)

def tokenizer(text): # create a tokenizer function
return [tok.text for tok in spacy_tok_pt.tokenizer(text)]

LABEL = data.Field(sequential=False, use_vocab=False )
splits = data.TabularDataset.splits(path=PATH, train=‘train_.csv’, validation=‘val_.csv’, test=‘test_.csv’, format=‘csv’, fields=[(‘text’, TEXT), (‘label’, LABEL)])

md2 = TextData.from_splits(PATH, splits, bs)
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl,
dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
m3.load_encoder(f’adam1_enc_sedi’)
m3.clip=25.
lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])
m3.freeze_to(-1)
m3.fit(lrs/2, 3)
m3.unfreeze()
m3.fit(lrs, 3, metrics=[accuracy], cycle_len=1)
`
However after the first epoch I receive the following Error:

100%|██████████| 313/313 [03:17<00:00, 1.58it/s, loss=0.0891]
Traceback (most recent call last):
File “/Applications/PyCharm CE.app/Contents/helpers/pydev/pydev_run_in_console.py”, line 53, in run_file
pydev_imports.execfile(file, globals, locals) # execute the script
File “/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py”, line 18, in execfile
exec(compile(contents+"\n", file, ‘exec’), glob, loc)
File “/Users/castilla/fastai/courses/dl1/splits.py”, line 58, in
m3.fit(lrs/2, 3)
File “/Users/castilla/fastai/courses/dl1/fastai/learner.py”, line 287, in fit
return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
File “/Users/castilla/fastai/courses/dl1/fastai/learner.py”, line 234, in fit_gen
swa_eval_freq=swa_eval_freq, **kwargs)
File “/Users/castilla/fastai/courses/dl1/fastai/model.py”, line 159, in fit
vals = validate(model_stepper, cur_data.val_dl, metrics)
File “/Users/castilla/fastai/courses/dl1/fastai/model.py”, line 216, in validate
for (*x,y) in iter(dl):
File “/Users/castilla/fastai/courses/dl1/fastai/nlp.py”, line 324, in iter
b = next(it)
File “/Users/castilla/anaconda3/envs/fastai-cpu/lib/python3.6/site-packages/torchtext/data/iterator.py”, line 134, in iter
self.init_epoch()
File “/Users/castilla/anaconda3/envs/fastai-cpu/lib/python3.6/site-packages/torchtext/data/iterator.py”, line 111, in init_epoch
self.create_batches()
File “/Users/castilla/anaconda3/envs/fastai-cpu/lib/python3.6/site-packages/torchtext/data/iterator.py”, line 234, in create_batches
self.batches = batch(self.data(), self.batch_size, self.batch_size_fn)
File “/Users/castilla/anaconda3/envs/fastai-cpu/lib/python3.6/site-packages/torchtext/data/iterator.py”, line 96, in data
xs = sorted(self.dataset, key=self.sort_key)
TypeError: ‘<’ not supported between instances of ‘Example’ and ‘Example’

Any clues where my code is failing?

andreasl · June 10, 2018, 9:16am

I have the exact same problem.

For some reason it is working when using the IMDB dataset from torchtext:

splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, PATH)
md2 = TextData.from_splits(PATH, splits, bs)

This is not working:

splits = data.TabularDataset.splits(
        path=PATH, train='train.tsv',
        validation='valid.tsv', test='testData.tsv', format='tsv',
        fields=[('Text', TEXT), ('Label', IMDB_LABEL)],
        skip_header=True)
# For some reason the field names in the Examples are capitalized. But this shouldn't cause a problem (?)
md2 = TextData.from_splits(PATH, splits, bs, text_name='Text', label_name='Label')

Results in:

TypeError: ‘<’ not supported between instances of ‘Example’ and ‘Example’

One clue to what could be wrong is that it takes 01:51 to run an epoch on the IMDB dataset, while it takes 00:02 to run one epoch on the custom dataset.

This is surprising, considering that both datasets contain IMDB movie reviews and the IMDB set has:
Training set: 25000 examples
Validation set: 25000 examples

While the custom dataset has:

train, val, test = splits
len(train.examples), len(val.examples), len(test.examples)

Training set: 20000 examples
Validation set: 5000 examples
Test set: 25000 examples

Anyone have any clue how to fix this?

tombishop · August 1, 2018, 2:02pm

Hi,

I have exactly the same problem when using TabularDataset. Did you get this worked out?

Thanks

Tom

andreasl · August 7, 2018, 10:22am

Hi. Sorry for the late reply.

No, unfortunately I never got this to work.

tombishop · August 7, 2018, 12:58pm

No problem - I did get round it: