IMDB Test Data is different from what is shown in the Class

rob · November 24, 2017, 4:50pm

I think a recent change breaks this notebook. Maybe @wgpubs can suggest a fix for the notebook

rob · November 24, 2017, 4:58pm

I think you want to update the code to do this,

md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

But I’m not at my machine to test it.

wgpubs · November 24, 2017, 5:34pm

That’s it.

Codebase was modified so LanguageModelData objects can be built from text files or dataframes. from_text_files and from_dataframes are class methods to do each respectively.

jeremy · November 24, 2017, 6:20pm

I’ve updated the notebook based on @wgpubs’s changes now, so if you git pull, it should work fine.

wgpubs · November 24, 2017, 6:30pm

Thanks for doing this. I was just getting on today and about to look at the notebooks when I saw all was good.

satheesh · November 25, 2017, 3:03am

@jeremy @wgpubs , I have done the latest git pull, now getting unicode decode error. "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 796: ordinal not in range(128)"
@rob, I have the same line of code as you suggested, still has issue.

Also, I keep getting “Back to the Future Imdb review” and number of words seem to be wrong. I did exactly as suggested by moving files, etc. See below. Would be great, if you all can have a look and help.

rob · November 25, 2017, 3:46am

@satheesh , regarding utf-8, check the threads titled Crestle. One talks about encoding

I think you need to git pull and/or checkout the imdb notebook again to get it fixed

satheesh · November 25, 2017, 4:11am

@rob, I have the latest code pull. I have checked those threads, but nothing worked. I am using Amazon’s fastai AMI , they use utf-8 by default ( check : https://aws.amazon.com/amazon-linux-ami/faqs/ ) . Not sure, what’s going on…two issues, the word counts are mismatching and this ascii error…

rob · November 25, 2017, 4:14am

Sorry I’m on my phone so can’t link the other thread. Did you find it?

I think you need to recreate the notebook, even after getting the latest pull. The Crestle author said the utf-8 fix will work for new notebooks. If you have further trouble I suggest at-ing him directly

satheesh · November 25, 2017, 4:34am

no worries @rob. What do you mean by recreate notebook ? Copy paste each line or just duplicate ? I am assuming this is the thread :Crestle - Spacy installation failed , that @anurag answered, but does not talk about anything like recreate .

anurag · November 25, 2017, 5:50am

@satheesh the other thread applies only to notebooks run on Crestle.

For the ascii issue with Amazon’s AMI, what is the output of the locale command?

satheesh · November 25, 2017, 6:11am

@anurag, it seems to be UTF-8 . Below is what I get.

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

anurag · November 25, 2017, 6:51am

Got it. The only other thing I’d try is what’s recommended in that thread, since LC_ALL seems to be unset for you:

export LC_ALL="en_US.UTF-8"

If this works you can add it to your .bashrc.

satheesh · November 25, 2017, 7:15am

@anurag, I tried it, but same error. thanks for your inputs…

rob · November 25, 2017, 9:20am

I’d Google those errors at the top of local output and see if there’s a solution.

Are you using an Amazon image that was made specifically for this course? If so, surely you’re not the only one who will encounter this issue

pnvijay · November 26, 2017, 9:45am

I did the git pull and updated the conda environment. I ran the lesson 4 -imdb note book. The first cell itself threw an error which got fixed by performing

python -m spacy download en

Then I ran all the cells and still encounter error as described below.

rob · November 26, 2017, 10:20am

Satheesh has the same issue. It has something to do with your environment not being set up for using UTF-8 by default. I don’t know the solution, other than trying what’s been discussed and linked in this thread, the other thread, and various results from Googling.

Several people have encountered this, though I haven’t seen a magic bullet yet.

guthl · November 26, 2017, 12:56pm

Hi, I had this issue at the place.

I modified the library to add the encoding where the file are opened. You can use the following snippet of code to check if it solves your problem.
Please notice the name LanguageModelData_ is different than the library one when using it.

class LanguageModelData_():
    def __init__(self, path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs):
        self.bs = bs
        self.path = path
        self.trn_ds = trn_ds; self.val_ds = val_ds; self.test_ds = test_ds

        field.build_vocab(self.trn_ds, **kwargs)

        self.pad_idx = field.vocab.stoi[field.pad_token]
        self.nt = len(field.vocab)

        self.trn_dl, self.val_dl, self.test_dl = [ LanguageModelLoader(ds, bs, bptt) 
                                                    for ds in (self.trn_ds, self.val_ds, self.test_ds) ]

    def get_model(self, opt_fn, emb_sz, n_hid, n_layers, **kwargs):
        m = get_language_model(self.bs, self.nt, emb_sz, n_hid, n_layers, self.pad_idx, **kwargs)
        model = SingleModel(to_gpu(m))
        return RNN_Learner(self, model, opt_fn=opt_fn)

    @classmethod
    def from_dataframes(cls, path, field, col, train_df, val_df, test_df=None, bs=64, bptt=70, **kwargs):
        # split train, val, and test datasets
        trn_ds, val_ds, test_ds = ConcatTextDatasetFromDataFrames.splits(text_field=field, col=col,
                                    train_df=train_df, val_df=val_df, test_df=test_df)

        return cls(path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs)

    @classmethod
    def from_text_files(cls, path, field, train, validation, test=None, bs=64, bptt=70, **kwargs):
        # split train, val, and test datasets
        trn_ds, val_ds, test_ds = ConcatTextDataset_.splits(
                                    path, text_field=field, train=train, validation=validation, test=test)

        return cls(path, field, trn_ds, val_ds, test_ds, bs, bptt, **kwargs)

    
class ConcatTextDataset_(torchtext.data.Dataset):
    def __init__(self, path, text_field, newline_eos=True, **kwargs):
        fields = [('text', text_field)]
        text = []
        if os.path.isdir(path): paths=glob(f'{path}/*.*')
        else: paths=[path]
        for p in paths:
            for line in open(p,  encoding="utf-8"): text += text_field.preprocess(line)
            if newline_eos: text.append('<eos>')

        examples = [torchtext.data.Example.fromlist([text], fields)]
        super().__init__(examples, fields, **kwargs)

jeremy · November 26, 2017, 10:16pm

Maybe if you have a moment you could add an optional encoding parameter to the constructor and send in a PR?

guthl · November 26, 2017, 11:01pm

It would only fix half the problem. When I went on with the notebook, I had another Unicode issue with torchtext.datasets.IMDB.splits and the problem is very likely to just repeat with other libraries.
According to the Python developers (as read in the Python Enhancement Proposal that I posted on another thread), the current right solution is to correct the environnement.

I’ll try to provide a tutorial to correct this error as well as my Dockerfile with all the instructions.