Spacy is not able to download the “en” module. Can you please help me there. It needs the admin permission
Based on https://spacy.io/usage/#symlink-privilege, I’d recommend installing spacy in a virtualenv directory.
Thanks @anurag. But unfortunately not able to do the same in crestle . I am not getting a error saying pip installer not available .
May be I am doing something wrong . Let me give a try and get back to you
pip install spacy && python -m spacy download en
on spacy.io it showed instructions on installing it.
I did that . Unfortunately it needs a root permission to do symlink which I dont have with Crestle
Are you able to do a
conda install spacy?
I haven’t used crestle so I’m not sure how the environment is set up
I am getting a error message as Conda command is not availble . So I am struck there as well .
Me too, I’ve read that we should be able to fix it by calling spacy link (https://github.com/explosion/spaCy/issues/924), but it’s not clear to me what we’re supposed to link or where it’s supposed to go
BTW don’t forget on Crestle to use
Also, I don’t think you need the full
en spacy module - I think the tokenizer might work with no additional installation steps…
For now, I’d recommend trying manual download: https://spacy.io/usage/models#download-manual
I was able to get space to load by changing the function to point to a manually downloaded copy:
spacy_en = spacy.load(’~/courses/fastai2/courses/dl1/data/aclImdb/en_core_web_md-2.0.0’
but now I’m getting an error because it’s trying to read the data in ascii, but that’s not what I’ve downloaded. I’m looking for a way to convert it over to UTF, any thoughts appreciated
UnicodeDecodeError Traceback (most recent call last)
1 FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
----> 2 md = LanguageModelData(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)
~/courses/fastai2/courses/dl1/fastai/nlp.py in init(self, path, field, train, validation, test, bs, bptt, **kwargs)
193 self.trn_ds,self.val_ds,self.test_ds = ConcatTextDataset.splits(
194 path, text_field=field, train=train, validation=validation, test=test)
–> 195 field.build_vocab(self.trn_ds, **kwargs)
196 self.pad_idx = field.vocab.stoi[field.pad_token]
197 self.nt = len(field.vocab)
/usr/local/lib/python3.6/dist-packages/torchtext/data/dataset.py in splits(cls, path, root, train, validation, test, **kwargs)
67 path = cls.download(root)
68 train_data = None if train is None else cls(
—> 69 os.path.join(path, train), **kwargs)
70 val_data = None if validation is None else cls(
71 os.path.join(path, validation), **kwargs)
~/courses/fastai2/courses/dl1/fastai/nlp.py in init(self, path, text_field, newline_eos, **kwargs)
182 for p in paths:
183 for line in open§: text += text_field.preprocess(line)
–> 184 if newline_eos: text.append(’’)
186 examples = [torchtext.data.Example.fromlist([text], fields)]
/usr/lib/python3.6/encodings/ascii.py in decode(self, input, final)
24 class IncrementalDecoder(codecs.IncrementalDecoder):
25 def decode(self, input, final=False):
—> 26 return codecs.ascii_decode(input, self.errors)
28 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 3680: ordinal not in range(128)
Someone else got that error too in this thread,
It worked when I ran the code on my home machine.
This is a common problem. IMDB dataset text is already Unicode(utf-8).
It’s just that your machine is trying to use the ascii decoder, which won’t work.
When you open the file, you can explicitly specify the encoding to use.
I wonder why any system would use ascii by default. As far as I can tell from Googling, UTF-8 is the default for Python 3.6
It’s torch text I guess.
Based on that, it looks like the root cause is something in the environment since multiple (Crestle?) people are getting the error.
@anurag is the environment set up to use UTF-8 by default, as in the link Arvind mentions?
Turns out it isn’t. I’ll deploy an updated environment later today and post here.
Thanks @anurag for quickly attending to all issues! We are thankful for the wonderful service you provide.
All new notebooks will now use en.UTF-8 as the default.
Up and running, thanks anurag.