I am trying to create a databunch from wikitext103 which is ~530MB in size. But I am not able to create a databunch, each time MemoryError occurs. I have 16GB RAM and 4GB for GPU.
from fastai import *
from fastai.text import *
from pathlib import Path
path = Path('../../Data/Language_Model/wikitext103/')
data = TextLMDataBunch.from_folder(path)
Traceback (most recent call last):
File "temp.py", line 6, in <module> data = TextLMDataBunch.from_folder(path)
File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/text/data.py", line 230, in from_folder
src = src.label_for_lm() if cls==TextLMDataBunch else src.label_from_folder(classes=classe
s) File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/data_block.py",
line 441, in _inner
self.process() File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/data_block.py",
line 495, in process
for ds,n in zip(self.lists, ['train','valid','test']): ds.process(xp, yp, name=n) File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/data_block.py",
line 667, in process
self.x.process(xp) File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/data_block.py",
line 74, in process for p in self.processor: p.process(self)
File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/data_block.py", line 41, in process
def process(self, ds:Collection): ds.items = array([self.process_one(item) for item
File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/core.py", line
262, in array
return np.array(a, dtype=dtype, **kwargs)
The same thing happens with
# data = TextLMDataBunch.from_folder(path)
data = (TextList.from_folder(path)
Watch your RAM during the processing. The tokenization process can be very memory intensive. You might need to process and train on chunks of the data at a time.
Yeah the RAM is reaching the peak.
Can you tell how to do it? My data is .txt files.
My workaround with these issues were install re-install ubuntu with more SWAP allocation (I did 128GB SWAP).
It’s specially useful when you’re trying kaggle comp(s) with tabular data-those can be very memory hungry sometimes if you’re not careful.
It is a nice workaround(on my next re-install I would definitely do it). Just to clarify, I have not done much Kaggle but when you say it is useful for Kaggle comp(s), do you mean we have no option but to load all the dataset to memory.
Or is there some workaround. Example if I have a 10GB csv file than I have no option but to load the complete csv file to memory using pandas or dask and then work on it. Also, if the size of csv increases increases RAM size than it would throw an error?
You could definitely do that.
But for me it’s not too useful to add more functions for doing that.
I’m not suggesting my approach, it’s definitely a stupid one and 100GB of SWAP is pretty ridiculous. But I prefer doing that to avoid figuring out how to make the batches related to Memory load up to the model.
Our general batching approaches apply to the GPU VRAM and note necessarily to the RAM-so I try to avoid taking on that since Its quite a challenge for me.
Here’s how I deal with it:
I store my text data in a csv file that I can load as a pandas dataframe.
When you run
pd.read_csv you can pass a
chunksize parameter that loads the csv as an iterator rather than putting the whole thing into memory. You can load say 100000 rows at once or whatever is a reasonable chunk. Then you can create a loop where you get a chunk of rows, create a dataloader, train, and repeat.
If your csv can fit into memory but you’re struggling just with the tokenization process, you can probably load the whole thing into memory and just create dataloaders with a subsection at a time.
Ok I think I got it. Thanks @KarlH that is really clever solution.
No you don’t need to bring everything in Memory at once use batches for it and if you just want to play with df, convert everything to category, will drastically reduce your memory footprint.
@ecdrid I know you’re familiar with the d.types as well.
Do you mind sharing some tips of how switiching the types (from float to int for ex.) is helpful?
I agree with Karl, another approach is to parse the file in dask. It is lazy and very efficient at utilizing all those cores. I usually prefer it over pandas with larger files. As for the tokenization, it’s a RAM heavy one time process, but I think you can write a callback while parsing the in memory df to tokenize simultaneously. I haven’t tried that myself but doesn’t sound vague either, also as ecdrid said, explore type casting to reduce memory footprint and use that in the callback.
Does anyone have a working fastai snippet for creating a databunch from the full wikitext103 dataset? Since that uncompressed dataset is only 500MB, I’m surprized it creates so many memory problems, so I suspect that the error probably relates to the usage of the library and something simpler than the actual memory required to process this dataset.
I don’t know too much about memory in the tokenization process, but I consistently see large memory usage while processing that later gets released once tokenization is complete. I’m working with a dataset right now where processing ~250,000 rows of data sucks up 20 GB of RAM during tokenization, but only takes up about 2-3 GB of RAM once tokenization is complete. The memory load appears to be related to processing, not actually holding the tokenized product in memory.
The full dataset I’m working with has about 4,500,000 rows. I was doing a bit of work on a 8x K80 cluster that had almost 500 GB of RAM so I decided to process the entire dataset in one go. RAM usage peaked at 330 GB for processing, then dropped to about 80 GB post-processing.
Point being the memory issues seem to be related to some processing step rather than the final tokenized product. I haven’t looked too much into this but it would be interesting to profile memory usage during tokenization.
Yes there is a high peak of memory usage during the preprocessing, because python sucks at multi-process, so the array with the results is copied across all the processes instead of being shared.
To reduce RAM usage, reduce the number of workers (but then it’s going to be slower).
@KarlH - how long did the tokenization end up taking for you? I was working on something similar and ended up giving up after waiting ~20 hours.
The memory usage issue was one thing but it also seemed to be getting slower and slower the farther it got through the dataset (mine is ~64GB).
I haven’t gotten to the root of the issue yet but there seems to be something >=O(n) going on in there.
Tokenizing the full dataset took about 2 hours.
When you were working through yours, did you max out your memory? I’ve found that if you try to tokenize too much at once and you don’t have enough memory to finish the process, it tends to stall and never complete. You might need to try working with a small chunk at a time.
Originally I was thinking that was the issue too but I spun up a larger cloud instance with a ton of memory (n1-highmem-96 with 624GB) and logged out how much was being used over time; it still had a few hundred GB of memory available throughout.
Do you know if I do it by chunks if there’s a way to combine those afterwards?
I did end up training a language model with a ~4GB subsample and it works great. But I’m dreaming of how good it would be if it was trained on the other 60GB of data as well.
Not that I know of. I’ve been wondering the same thing. Is there a way to concatenate things together on the ItemList level or something? Right now I get around it by iterating through chunks of the data and training on a little bit at a time.
Anything is possible, but you will have to code it
As of now, the
TextList loads everything in memory, but you could write a subclass that loads dynamically the texts you need.
I’ve been trying to follow the code down through the layers to see if I can find where
label_for_lm was getting stuck. The flow bops around several times between
data_block.py and I think I’m losing the thread somewhere.
Can someone help me understand what’s going on?
self.label_const(0) (which is defined in
kwargs['label_cls'] = LMLabelList and calls
self.label_from_func(func=lambda o: const, label_cls=label_cls, **kwargs) which calls
self._label_from_list([func(o) for o in self.items], label_cls=label_cls, **kwargs) which does the following:
def _label_from_list(self, labels:Iterator, label_cls:Callable=None, from_item_lists:bool=False, **kwargs)->'LabelList':
"Label `self.items` with `labels`."
if not from_item_lists:
raise Exception("Your data isn't split, if you don't want a validation set, please use `split_none`.")
labels = array(labels, dtype=object)
label_cls = self.get_label_cls(labels, label_cls=label_cls, **kwargs)
y = label_cls(labels, path=self.path, **kwargs)
res = self._label_list(x=self, y=y)
This ends up passing an array of the
LMLabelList's constructor just calls
EmptyLabelList's constructor which flows through to
ItemList's constructor which just seems to initialize several variables and doesn’t call any other functions.
So I’m confused about where the “magic” is happening in
label_for_lm. Where’s the actual meat of the code? It looks like it might be in
reconstruct but I don’t see that being called from anywhere.