ValueError: [E088] Text of length 70563384 exceeds maximum of 1000000

This seems to be an error not mentioned in any of the forum posts, but I noticed when using on a custom dataset.

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/lib/python3.7/multiprocessing/", line 297, in _bootstrap
  File "/home/ec2-user/anaconda3/lib/python3.7/multiprocessing/", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/fastcore/", line 118, in _f_pg
    for i,b in enumerate(obj(batch)): queue.put((start_idx+i,b))
  File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/fastai/text/", line 136, in <genexpr>
    return (L(o).map(self.post_f) for o in self.tok(maps(*self.rules, batch)))
  File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/fastai/text/", line 122, in <genexpr>
    return (L(doc).attrgot('text') for doc in self.pipe(map(str,items), batch_size=self.buf_sz))
  File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/spacy/", line 829, in pipe
    for doc in docs:
  File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/spacy/", line 814, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
ValueError: [E088] Text of length 70563384 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.
  File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/spacy/", line 465, in make_doc
    Errors.E088.format(length=len(text), max_length=self.max_length)


I do not know much about spacy tokenizer, but someone has had the same issue and posted it on Stack Overflow. Maybe this will give you some ideas?

This might also be helpful: