How can I increase the SpacyTokenizer max_length?

Jay_N · February 21, 2021, 9:38pm

Hello everyone,

I am new to NLP and fastai. I was trying to follow along with Chapter 10 in the fastai book, but instead of using the IMDB reviews, I wanted to use reddit data. The corpus itself is very large. When I try to create a dataloader, it gives me the following error:

    Process Process-1:
Traceback (most recent call last):
  File "/home/jayanth/anaconda3/envs/fastai_nlp/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jayanth/anaconda3/envs/fastai_nlp/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jayanth/anaconda3/envs/fastai_nlp/lib/python3.8/site-packages/fastcore/parallel.py", line 118, in _f_pg
    for i,b in enumerate(obj(batch)): queue.put((start_idx+i,b))
  File "/home/jayanth/anaconda3/envs/fastai_nlp/lib/python3.8/site-packages/fastai/text/core.py", line 136, in <genexpr>
    return (L(o).map(self.post_f) for o in self.tok(maps(*self.rules, batch)))
  File "/home/jayanth/anaconda3/envs/fastai_nlp/lib/python3.8/site-packages/fastai/text/core.py", line 122, in <genexpr>
    return (L(doc).attrgot('text') for doc in self.pipe(map(str,items), batch_size=self.buf_sz))
  File "/home/jayanth/anaconda3/envs/fastai_nlp/lib/python3.8/site-packages/spacy/language.py", line 829, in pipe
    for doc in docs:
  File "/home/jayanth/anaconda3/envs/fastai_nlp/lib/python3.8/site-packages/spacy/language.py", line 814, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File "/home/jayanth/anaconda3/envs/fastai_nlp/lib/python3.8/site-packages/spacy/language.py", line 464, in make_doc
    raise ValueError(
ValueError: [E088] Text of length 25282131 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

It seems the number of characters in my corpus of text is higher than what the SpacyTokenizer can handle. I searched online for a solution and it seems the most suggested solution is to:

    nlp = spacy.load('en_core_web_sm')
    nlp.max_length = 10**10

I tried running this line of code before I created my datablock and it still didn’t fix it.

I looked through fastai2/text/core.py and it seems the SpacyTokenizer class initiates by:
nlp = spacy.blank('en', disable=["parser", "tagger", "ner"])

I tried doing

nlp = spacy.blank('en', disable=["parser", "tagger", "ner"])
nlp.max_length = 10**10

And it still didn’t fix this. How can I fix this? I appreciate any pointers. Thanks!

EDIT : To clarify, the max_length still remains 10**6.

cristian.c · March 14, 2021, 8:50am

Hi Jay,

I’m following lesson 10 steps with a big corpus too and ran into the same problem as yours. What I did to solve it was to copy/paste fastai’s SpacyTokenizer into my notebook, tune it to use nlp.max_length = 25_000_000, wrap it in a Tokenizer and then manually create a TextBlock passing the Tokenizer object.

Here’s my code:

from fastai.text.all import *
import spacy
from spacy.symbols import ORTH

# copied from https://github.com/fastai/fastai/blob/ab0c2fe0d54895ddca27b91eb128b8599ba140d3/fastai/text/core.py#L113
class SpacyTokenizer25Mil():
    "Spacy tokenizer for `lang`"
    def __init__(self, lang='en', special_toks=None, buf_sz=5000):
        self.special_toks = ifnone(special_toks, defaults.text_spec_tok)
        nlp = spacy.blank(lang, disable=["parser", "tagger", "ner"])
        nlp.max_length = 25_000_000
        for w in self.special_toks: nlp.tokenizer.add_special_case(w, [{ORTH: w}])
        self.pipe,self.buf_sz = nlp.pipe,buf_sz

    def __call__(self, items):
        return (L(doc).attrgot('text') for doc in self.pipe(map(str,items), batch_size=self.buf_sz))

tkn = Tokenizer(SpacyTokenizer25Mil())

get_files_func = partial(get_files, extensions=['.log'])

dls_lm = DataBlock(
    blocks=TextBlock(tok_tfm=tkn, is_lm=True, min_freq=3, max_vocab=60000),
    get_items=get_files_func, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

Hope it helps

miraco · February 9, 2023, 9:23am

Thank you @ cristian.c, that helped.
For people using the current version, there is a slight change:

# due to spacy max length https://forums.fast.ai/t/how-can-i-increase-the-spacytokenizer-max-length/85991/2
from fastai.text.all import *

class SpacyTokenizer25Mil():
    "Spacy tokenizer for `lang`"
    def __init__(self, lang='en', special_toks=None, buf_sz=5000):
        import spacy
        from spacy.symbols import ORTH
        self.special_toks = ifnone(special_toks, defaults.text_spec_tok)
        nlp = spacy.blank(lang)
        nlp.max_length = 25_000_000
        for w in self.special_toks: nlp.tokenizer.add_special_case(w, [{ORTH: w}])
        self.pipe,self.buf_sz = nlp.pipe,buf_sz

    def __call__(self, items):
        return (L(doc).attrgot('text') for doc in self.pipe(map(str,items), batch_size=self.buf_sz))

Conwyn · February 11, 2024, 9:50pm

Hi
Thank you this also solved my problem. One observation with Colab is to copy your Google drive training and test data to /content/sample_data it seems to go much faster. Obviously it will be lost at the end of the session.
Regards Conwyn