How to pickle nlp Dataset

mintwurm · November 21, 2018, 11:00pm

I’m working with your LanguageModelData dataset.

But everytime such a dataset is created, it builds the vocabulary. That means scanning the entire dataset and counting the occurences of each token.
Just running the code is fine (painfully slow but ok).
But debugging is simply impossible.
If I want to debug-step through the language model, the dataset has to be created first.
And python in debug mode is just too slow.
Instead of maybe a minute to build the vocabulary, it takes roughly 20 minutes.

To debug my model, I need to have data to feed it.
To have data for the model, I need the dataset.
But building it everytime is not possible.
So I would like to save it.
Pickle doesn’t work, since it can’t save generators.

How are you dealing with this.
Actually, LanguageModelData seems deprecated (it is only accessible in the /old directory).
Is there an updated version that doesn’t have this problem ?