Crestle - Spacy installation failed

Hi @anurag
Spacy is not able to download the “en” module. Can you please help me there. It needs the admin permission

Based on https://spacy.io/usage/#symlink-privilege, I’d recommend installing spacy in a virtualenv directory.

Thanks @anurag. But unfortunately not able to do the same in crestle . I am not getting a error saying pip installer not available .

May be I am doing something wrong . Let me give a try and get back to you

Try: pip install spacy && python -m spacy download en
on spacy.io it showed instructions on installing it.

1 Like

I did that . Unfortunately it needs a root permission to do symlink which I dont have with Crestle

Are you able to do a conda install spacy?

I haven’t used crestle so I’m not sure how the environment is set up

I am getting a error message as Conda command is not availble . So I am struck there as well .

Me too, I’ve read that we should be able to fix it by calling spacy link (https://github.com/explosion/spaCy/issues/924), but it’s not clear to me what we’re supposed to link or where it’s supposed to go :frowning:

BTW don’t forget on Crestle to use pip3, not pip.

Also, I don’t think you need the full en spacy module - I think the tokenizer might work with no additional installation steps…

For now, I’d recommend trying manual download: https://spacy.io/usage/models#download-manual

I was able to get space to load by changing the function to point to a manually downloaded copy:

spacy_en = spacy.load(’~/courses/fastai2/courses/dl1/data/aclImdb/en_core_web_md-2.0.0’

but now I’m getting an error because it’s trying to read the data in ascii, but that’s not what I’ve downloaded. I’m looking for a way to convert it over to UTF, any thoughts appreciated :slight_smile:


UnicodeDecodeError Traceback (most recent call last)
in ()
1 FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
----> 2 md = LanguageModelData(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

~/courses/fastai2/courses/dl1/fastai/nlp.py in init(self, path, field, train, validation, test, bs, bptt, **kwargs)
193 self.trn_ds,self.val_ds,self.test_ds = ConcatTextDataset.splits(
194 path, text_field=field, train=train, validation=validation, test=test)
–> 195 field.build_vocab(self.trn_ds, **kwargs)
196 self.pad_idx = field.vocab.stoi[field.pad_token]
197 self.nt = len(field.vocab)

/usr/local/lib/python3.6/dist-packages/torchtext/data/dataset.py in splits(cls, path, root, train, validation, test, **kwargs)
67 path = cls.download(root)
68 train_data = None if train is None else cls(
—> 69 os.path.join(path, train), **kwargs)
70 val_data = None if validation is None else cls(
71 os.path.join(path, validation), **kwargs)

~/courses/fastai2/courses/dl1/fastai/nlp.py in init(self, path, text_field, newline_eos, **kwargs)
182 for p in paths:
183 for line in open§: text += text_field.preprocess(line)
–> 184 if newline_eos: text.append(’’)
185
186 examples = [torchtext.data.Example.fromlist([text], fields)]

/usr/lib/python3.6/encodings/ascii.py in decode(self, input, final)
24 class IncrementalDecoder(codecs.IncrementalDecoder):
25 def decode(self, input, final=False):
—> 26 return codecs.ascii_decode(input, self.errors)[0]
27
28 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 3680: ordinal not in range(128)

Someone else got that error too in this thread,

It worked when I ran the code on my home machine.

This is a common problem. IMDB dataset text is already Unicode(utf-8).
It’s just that your machine is trying to use the ascii decoder, which won’t work.
When you open the file, you can explicitly specify the encoding to use.

I wonder why any system would use ascii by default. As far as I can tell from Googling, UTF-8 is the default for Python 3.6

It’s torch text I guess.

Based on that, it looks like the root cause is something in the environment since multiple (Crestle?) people are getting the error.

@anurag is the environment set up to use UTF-8 by default, as in the link Arvind mentions?

Turns out it isn’t. I’ll deploy an updated environment later today and post here.

2 Likes

Thanks @anurag for quickly attending to all issues! We are thankful for the wonderful service you provide.

2 Likes

All new notebooks will now use en.UTF-8 as the default.

4 Likes

Up and running, thanks anurag.

1 Like