IMDB Test Data is different from what is shown in the Class

I think a recent change breaks this notebook. Maybe @wgpubs can suggest a fix for the notebook

I think you want to update the code to do this,

md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

But I’m not at my machine to test it.

1 Like

That’s it.

Codebase was modified so LanguageModelData objects can be built from text files or dataframes. from_text_files and from_dataframes are class methods to do each respectively.

I’ve updated the notebook based on @wgpubs’s changes now, so if you git pull, it should work fine.

Thanks for doing this. I was just getting on today and about to look at the notebooks when I saw all was good.

@jeremy @wgpubs , I have done the latest git pull, now getting unicode decode error. "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 796: ordinal not in range(128)"
@rob, I have the same line of code as you suggested, still has issue.

Also, I keep getting “Back to the Future Imdb review” and number of words seem to be wrong. I did exactly as suggested by moving files, etc. See below. Would be great, if you all can have a look and help.

@satheesh , regarding utf-8, check the threads titled Crestle. One talks about encoding

I think you need to git pull and/or checkout the imdb notebook again to get it fixed

@rob, I have the latest code pull. I have checked those threads, but nothing worked. I am using Amazon’s fastai AMI , they use utf-8 by default ( check : https://aws.amazon.com/amazon-linux-ami/faqs/ ) . Not sure, what’s going on…two issues, the word counts are mismatching and this ascii error…

Sorry I’m on my phone so can’t link the other thread. Did you find it?

I think you need to recreate the notebook, even after getting the latest pull. The Crestle author said the utf-8 fix will work for new notebooks. If you have further trouble I suggest at-ing him directly

no worries @rob. What do you mean by recreate notebook ? Copy paste each line or just duplicate ? I am assuming this is the thread :Crestle - Spacy installation failed , that @anurag answered, but does not talk about anything like recreate .

@satheesh the other thread applies only to notebooks run on Crestle.

For the ascii issue with Amazon’s AMI, what is the output of the locale command?

@anurag, it seems to be UTF-8 . Below is what I get.

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Got it. The only other thing I’d try is what’s recommended in that thread, since LC_ALL seems to be unset for you:

export LC_ALL="en_US.UTF-8"

If this works you can add it to your .bashrc.

@anurag, I tried it, but same error. thanks for your inputs…

I’d Google those errors at the top of local output and see if there’s a solution.

Are you using an Amazon image that was made specifically for this course? If so, surely you’re not the only one who will encounter this issue

I did the git pull and updated the conda environment. I ran the lesson 4 -imdb note book. The first cell itself threw an error which got fixed by performing

python -m spacy download en

Then I ran all the cells and still encounter error as described below.

Satheesh has the same issue. It has something to do with your environment not being set up for using UTF-8 by default. I don’t know the solution, other than trying what’s been discussed and linked in this thread, the other thread, and various results from Googling.

Several people have encountered this, though I haven’t seen a magic bullet yet.

Hi, I had this issue at the place.

I modified the library to add the encoding where the file are opened. You can use the following snippet of code to check if it solves your problem.
Please notice the name LanguageModelData_ is different than the library one when using it.

class LanguageModelData_():
    def __init__(self, path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs):
        self.bs = bs
        self.path = path
        self.trn_ds = trn_ds; self.val_ds = val_ds; self.test_ds = test_ds

        field.build_vocab(self.trn_ds, **kwargs)

        self.pad_idx = field.vocab.stoi[field.pad_token]
        self.nt = len(field.vocab)

        self.trn_dl, self.val_dl, self.test_dl = [ LanguageModelLoader(ds, bs, bptt) 
                                                    for ds in (self.trn_ds, self.val_ds, self.test_ds) ]

    def get_model(self, opt_fn, emb_sz, n_hid, n_layers, **kwargs):
        m = get_language_model(self.bs, self.nt, emb_sz, n_hid, n_layers, self.pad_idx, **kwargs)
        model = SingleModel(to_gpu(m))
        return RNN_Learner(self, model, opt_fn=opt_fn)

    @classmethod
    def from_dataframes(cls, path, field, col, train_df, val_df, test_df=None, bs=64, bptt=70, **kwargs):
        # split train, val, and test datasets
        trn_ds, val_ds, test_ds = ConcatTextDatasetFromDataFrames.splits(text_field=field, col=col,
                                    train_df=train_df, val_df=val_df, test_df=test_df)

        return cls(path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs)

    @classmethod
    def from_text_files(cls, path, field, train, validation, test=None, bs=64, bptt=70, **kwargs):
        # split train, val, and test datasets
        trn_ds, val_ds, test_ds = ConcatTextDataset_.splits(
                                    path, text_field=field, train=train, validation=validation, test=test)

        return cls(path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs)

    
class ConcatTextDataset_(torchtext.data.Dataset):
    def __init__(self, path, text_field, newline_eos=True, **kwargs):
        fields = [('text', text_field)]
        text = []
        if os.path.isdir(path): paths=glob(f'{path}/*.*')
        else: paths=[path]
        for p in paths:
            for line in open(p,  encoding="utf-8"): text += text_field.preprocess(line)
            if newline_eos: text.append('<eos>')

        examples = [torchtext.data.Example.fromlist([text], fields)]
        super().__init__(examples, fields, **kwargs)

Maybe if you have a moment you could add an optional encoding parameter to the constructor and send in a PR?

It would only fix half the problem. When I went on with the notebook, I had another Unicode issue with torchtext.datasets.IMDB.splits and the problem is very likely to just repeat with other libraries.
According to the Python developers (as read in the Python Enhancement Proposal that I posted on another thread), the current right solution is to correct the environnement.

I’ll try to provide a tutorial to correct this error as well as my Dockerfile with all the instructions.

2 Likes