Understanding LanguageModelData for NLP

Hello everyone,

I’m currently having difficulties with LanguageModelData. I’m trying to replicate Jeremy results but I did not manage to do it.

I managed to find where my notebook failes. Inside LanguageModelData, there is a function called ConcatTextdataset. Inside ConcattextDataset, text_field.preprocess (aka TEXT.preprocess) is called and gets as input imdbEr.txt and imdb.vocab. I don’t understand why it gets those files and I don’t get what the preprocess method is doing.

Has anyone managed to replicate the results ?


1 Like

I was able to run it. Do you have the dataset? I don’t think imdb.vocab or imdbEr.txt are used by fastai code, they’re just part of the data.

Probably worth sharing your code and a screenshot or gist of the error you get so we can see what’s going on.

My data is structured as in the notebook: README imdb.vocab imdbEr.txt test/ train/

Here is the path i provide:

TRN_PATH = 'train/all/'
VAL_PATH = ‘test/all/‘

I get a unicode Error. So, I checked inside the class and saw that text_field.preprocess gets the content of imdb.vocab imdbEr.txt.

Do you know what text_field.preprocess is supposed to get ?


UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2253: ordinal not in range(128)

Please include a screenshot of where you see the error, otherwise it’s really hard to know what’s going on


TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'

README imdb.vocab imdbEr.txt test/ train/

TEXT = data.Field(lower=True, tokenize=spacy_tok)
bs=64; bptt=70
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 916: ordinal not in range(128)
UnicodeDecodeError Traceback (most recent call last)
in ()
----> 1 md = LanguageModelData(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

~/app/DeepLogs/fastai/courses/dl1/fastai/nlp.py in __init__(self, path, field, train, validation, test, bs, bptt, **kwargs)
    191         self.path,self.bs = path,bs
    192         self.trn_ds,self.val_ds,self.test_ds = ConcatTextDataset.splits(
--> 193             path, text_field=field, train=train, validation=validation, test=test)
    194         field.build_vocab(self.trn_ds, **kwargs)
    195         self.pad_idx = field.vocab.stoi[field.pad_token]

/usr/local/lib/python3.6/dist-packages/torchtext/data/dataset.py in splits(cls, path, root, train, validation, test, **kwargs)
     67             path = cls.download(root)
     68         train_data = None if train is None else cls(
---> 69             os.path.join(path, train), **kwargs)
     70         val_data = None if validation is None else cls(
     71             os.path.join(path, validation), **kwargs)

~/app/DeepLogs/fastai/courses/dl1/fastai/nlp.py in __init__(self, path, text_field, newline_eos, **kwargs)
    180         else: paths=[path]
    181         for p in paths:
--> 182             for line in open(p): text += text_field.preprocess(line)
    183             if newline_eos: text.append('<eos>')

/usr/lib/python3.6/encodings/ascii.py in decode(self, input, final)
     24 class IncrementalDecoder(codecs.IncrementalDecoder):
     25     def decode(self, input, final=False):
---> 26         return codecs.ascii_decode(input, self.errors)[0]
     28 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 916: ordinal not in range(128)

If you get this Unicode error, you can actually specify your encoding format when you open the file.

In the open function, pass encoding=‘utf-8’

Open is present in NLP.py and also optionally if you choose to make your own IMDB TEXT instead of downloading a second time (natively within torchtext)…see the Arxiv notebook for hints

1 Like

Thanks very much but I still think that the issue is due to the fact that text_field.preprocess is not getting what it should.

Maybe, I did not get the data from the same place than you. From where did you load the data ?

I used the link in the NB and untarred the tarball

I did have to copy all the files into an ‘all’ directory - there’s an earlier thread here on #part1v2-beg discussing this.

I also made the ‘all’ directory but I only put a few hundred pos and neg text files inside it. The quality of the embeddings/weights were still quite good that I could get close to 90% accuracy after transferring the encoder for the imdb sentiment analysis task.

I’ll try to find out the minimum number of text files required to learn embeddings such that we can get 94.5 on the sentiment task.


Hi @jeremy did you also copy the files from the unsup directory into all?

I’ve posted a new version of the notebook which now links to an archive that includes the ‘all’ folder.


So, eventually, @narvind2003, you were absolutely right.
I found the bug but I did not narrow down the underpining reason. Very likely because I’m running all my tools inside a Nvidia-Docker container.

I needed to change for line in open(p, encoding=“utf-8”).
Also, added a tqdm for debugging purpose.

class ConcatTextDataset(torchtext.data.Dataset):
    def __init__(self, path, text_field, newline_eos=True, **kwargs):
        fields = [('text', text_field)]
        text = []
        if os.path.isdir(path): paths=glob(f'{path}/*.*')
        else: paths=[path]
        for p in tqdm(paths):
            for line in open(p,  encoding="utf-8"): text += text_field.preprocess(line)
            if newline_eos: text.append('<eos>')

        examples = [torchtext.data.Example.fromlist([text], fields)]
        super().__init__(examples, fields, **kwargs)
1 Like

I found the underlying reason this Unicode bug.

As answered on this error and this pep scheduled for 3.7, open() still use whatever encoding it infers from the environnement. In my case, my docker is based on 2.7 but Jupyter runs a 3.6 kernel and I’m not using a virtuaenv. Therefore, my library has been compiled under 3.6 assumptions but get a 2.7 environnement variable.

Most people problems with this encoding must come from similar environnement issue.

I manage to make it work by rebuild a docker without Python2.7. I’ll release the docker very soon as I need to do some refactoring on it.


Solution: original post by @edwardjross

import locale
locale.setlocale(locale.LC_ALL, 'C.UTF-8')