IMDB Test Data is different from what is shown in the Class

@anurag, I tried it, but same error. thanks for your inputs…

I’d Google those errors at the top of local output and see if there’s a solution.

Are you using an Amazon image that was made specifically for this course? If so, surely you’re not the only one who will encounter this issue

I did the git pull and updated the conda environment. I ran the lesson 4 -imdb note book. The first cell itself threw an error which got fixed by performing

python -m spacy download en

Then I ran all the cells and still encounter error as described below.

Satheesh has the same issue. It has something to do with your environment not being set up for using UTF-8 by default. I don’t know the solution, other than trying what’s been discussed and linked in this thread, the other thread, and various results from Googling.

Several people have encountered this, though I haven’t seen a magic bullet yet.

Hi, I had this issue at the place.

I modified the library to add the encoding where the file are opened. You can use the following snippet of code to check if it solves your problem.
Please notice the name LanguageModelData_ is different than the library one when using it.

class LanguageModelData_():
    def __init__(self, path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs):
        self.bs = bs
        self.path = path
        self.trn_ds = trn_ds; self.val_ds = val_ds; self.test_ds = test_ds

        field.build_vocab(self.trn_ds, **kwargs)

        self.pad_idx = field.vocab.stoi[field.pad_token]
        self.nt = len(field.vocab)

        self.trn_dl, self.val_dl, self.test_dl = [ LanguageModelLoader(ds, bs, bptt) 
                                                    for ds in (self.trn_ds, self.val_ds, self.test_ds) ]

    def get_model(self, opt_fn, emb_sz, n_hid, n_layers, **kwargs):
        m = get_language_model(self.bs, self.nt, emb_sz, n_hid, n_layers, self.pad_idx, **kwargs)
        model = SingleModel(to_gpu(m))
        return RNN_Learner(self, model, opt_fn=opt_fn)

    @classmethod
    def from_dataframes(cls, path, field, col, train_df, val_df, test_df=None, bs=64, bptt=70, **kwargs):
        # split train, val, and test datasets
        trn_ds, val_ds, test_ds = ConcatTextDatasetFromDataFrames.splits(text_field=field, col=col,
                                    train_df=train_df, val_df=val_df, test_df=test_df)

        return cls(path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs)

    @classmethod
    def from_text_files(cls, path, field, train, validation, test=None, bs=64, bptt=70, **kwargs):
        # split train, val, and test datasets
        trn_ds, val_ds, test_ds = ConcatTextDataset_.splits(
                                    path, text_field=field, train=train, validation=validation, test=test)

        return cls(path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs)

    
class ConcatTextDataset_(torchtext.data.Dataset):
    def __init__(self, path, text_field, newline_eos=True, **kwargs):
        fields = [('text', text_field)]
        text = []
        if os.path.isdir(path): paths=glob(f'{path}/*.*')
        else: paths=[path]
        for p in paths:
            for line in open(p,  encoding="utf-8"): text += text_field.preprocess(line)
            if newline_eos: text.append('<eos>')

        examples = [torchtext.data.Example.fromlist([text], fields)]
        super().__init__(examples, fields, **kwargs)

Maybe if you have a moment you could add an optional encoding parameter to the constructor and send in a PR?

It would only fix half the problem. When I went on with the notebook, I had another Unicode issue with torchtext.datasets.IMDB.splits and the problem is very likely to just repeat with other libraries.
According to the Python developers (as read in the Python Enhancement Proposal that I posted on another thread), the current right solution is to correct the environnement.

I’ll try to provide a tutorial to correct this error as well as my Dockerfile with all the instructions.

2 Likes