I get the following error from the spacy tokenizer:
return [t.text for t in self.tok.tokenizer(t)]
File “tokenizer.pyx”, line 78, in spacy.tokenizer.Tokenizer.__call
ValueError: [E025] String is too long: 1741046291 characters. Max is 2**30.
Any ideas about how to deal with content that has GBs of text? Breaking the content into smaller txt files seems to cause a memory error (large number of files). Thank you