Hello everyone,

I am new to NLP and fastai. I was trying to follow along with Chapter 10 in the fastai book, but instead of using the IMDB reviews, I wanted to use reddit data. The corpus itself is very large. When I try to create a dataloader, it gives me the following error:

ValueError: [E088] Text of length 25282131 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

It seems the number of characters in my corpus of text is higher than what the SpacyTokenizer can handle. I searched online for a solution and it seems the most suggested solution is to:

    nlp = spacy.load('en_core_web_sm')
    nlp.max_length = 10**10

I tried running this line of code before I created my datablock and it still didn’t fix it.

I looked through fastai2/text/ and it seems the SpacyTokenizer class initiates by:
nlp = spacy.blank('en', disable=["parser", "tagger", "ner"])

I tried doing

nlp = spacy.blank('en', disable=["parser", "tagger", "ner"])
nlp.max_length = 10**10

And it still didn’t fix this. How can I fix this? I appreciate any pointers. Thanks!

EDIT : To clarify, the max_length still remains 10**6.

Hi Jay,

I’m following lesson 10 steps with a big corpus too and ran into the same problem as yours. What I did to solve it was to copy/paste fastai’s SpacyTokenizer into my notebook, tune it to use nlp.max_length = 25_000_000, wrap it in a Tokenizer and then manually create a TextBlock passing the Tokenizer object.

Here’s my code:

from fastai.text.all import *
import spacy
from spacy.symbols import ORTH

# copied from
class SpacyTokenizer25Mil():
    "Spacy tokenizer for `lang`"
    def __init__(self, lang='en', special_toks=None, buf_sz=5000):
        self.special_toks = ifnone(special_toks, defaults.text_spec_tok)
        nlp = spacy.blank(lang, disable=["parser", "tagger", "ner"])
        nlp.max_length = 25_000_000
        for w in self.special_toks: nlp.tokenizer.add_special_case(w, [{ORTH: w}])
        self.pipe,self.buf_sz = nlp.pipe,buf_sz

    def __call__(self, items):
        return (L(doc).attrgot('text') for doc in self.pipe(map(str,items), batch_size=self.buf_sz))

tkn = Tokenizer(SpacyTokenizer25Mil())

get_files_func = partial(get_files, extensions=['.log'])

dls_lm = DataBlock(
    blocks=TextBlock(tok_tfm=tkn, is_lm=True, min_freq=3, max_vocab=60000),
    get_items=get_files_func, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

Hope it helps