Have you thought about writing a custom tokenizer class that lazily reads each file from the disk as it’s needed?
If you want to use pandas specifically, pandas has a chunksize
argument in the pd.read_*
functions, which will then turn it into an iterator that pulls up one chunk at a time.