I am trying to create a databunch from wikitext103 which is ~530MB in size. But I am not able to create a databunch, each time MemoryError occurs. I have 16GB RAM and 4GB for GPU.
from fastai import *
from fastai.text import *
from pathlib import Path
path = Path('../../Data/Language_Model/wikitext103/')
data = TextLMDataBunch.from_folder(path)
Traceback (most recent call last):
File "temp.py", line 6, in <module> data = TextLMDataBunch.from_folder(path)
File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/text/data.py", line 230, in from_folder
src = src.label_for_lm() if cls==TextLMDataBunch else src.label_from_folder(classes=classe
s) File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/data_block.py",
line 441, in _inner
self.process() File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/data_block.py",
line 495, in process
for ds,n in zip(self.lists, ['train','valid','test']): ds.process(xp, yp, name=n) File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/data_block.py",
line 667, in process
self.x.process(xp) File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/data_block.py",
line 74, in process for p in self.processor: p.process(self)
File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/data_block.py", line 41, in process
def process(self, ds:Collection): ds.items = array([self.process_one(item) for item
in ds.items])
File "/home/kushaj/anaconda3/envs/PyTorch/lib/python3.7/site-packages/fastai/core.py", line
262, in array
return np.array(a, dtype=dtype, **kwargs)
MemoryError
The same thing happens with
# data = TextLMDataBunch.from_folder(path)
data = (TextList.from_folder(path)
.split_by_folder('train', 'valid')
.label_const(0)
.databunch())
Watch your RAM during the processing. The tokenization process can be very memory intensive. You might need to process and train on chunks of the data at a time.
It is a nice workaround(on my next re-install I would definitely do it). Just to clarify, I have not done much Kaggle but when you say it is useful for Kaggle comp(s), do you mean we have no option but to load all the dataset to memory.
Or is there some workaround. Example if I have a 10GB csv file than I have no option but to load the complete csv file to memory using pandas or dask and then work on it. Also, if the size of csv increases increases RAM size than it would throw an error?
You could definitely do that.
But for me it’s not too useful to add more functions for doing that.
I’m not suggesting my approach, it’s definitely a stupid one and 100GB of SWAP is pretty ridiculous. But I prefer doing that to avoid figuring out how to make the batches related to Memory load up to the model.
Our general batching approaches apply to the GPU VRAM and note necessarily to the RAM-so I try to avoid taking on that since Its quite a challenge for me.
I store my text data in a csv file that I can load as a pandas dataframe.
When you run pd.read_csv you can pass a chunksize parameter that loads the csv as an iterator rather than putting the whole thing into memory. You can load say 100000 rows at once or whatever is a reasonable chunk. Then you can create a loop where you get a chunk of rows, create a dataloader, train, and repeat.
If your csv can fit into memory but you’re struggling just with the tokenization process, you can probably load the whole thing into memory and just create dataloaders with a subsection at a time.
No you don’t need to bring everything in Memory at once use batches for it and if you just want to play with df, convert everything to category, will drastically reduce your memory footprint.
@ecdrid I know you’re familiar with the d.types as well.
Do you mind sharing some tips of how switiching the types (from float to int for ex.) is helpful?
I agree with Karl, another approach is to parse the file in dask. It is lazy and very efficient at utilizing all those cores. I usually prefer it over pandas with larger files. As for the tokenization, it’s a RAM heavy one time process, but I think you can write a callback while parsing the in memory df to tokenize simultaneously. I haven’t tried that myself but doesn’t sound vague either, also as ecdrid said, explore type casting to reduce memory footprint and use that in the callback.
Does anyone have a working fastai snippet for creating a databunch from the full wikitext103 dataset? Since that uncompressed dataset is only 500MB, I’m surprized it creates so many memory problems, so I suspect that the error probably relates to the usage of the library and something simpler than the actual memory required to process this dataset.
I don’t know too much about memory in the tokenization process, but I consistently see large memory usage while processing that later gets released once tokenization is complete. I’m working with a dataset right now where processing ~250,000 rows of data sucks up 20 GB of RAM during tokenization, but only takes up about 2-3 GB of RAM once tokenization is complete. The memory load appears to be related to processing, not actually holding the tokenized product in memory.
The full dataset I’m working with has about 4,500,000 rows. I was doing a bit of work on a 8x K80 cluster that had almost 500 GB of RAM so I decided to process the entire dataset in one go. RAM usage peaked at 330 GB for processing, then dropped to about 80 GB post-processing.
Point being the memory issues seem to be related to some processing step rather than the final tokenized product. I haven’t looked too much into this but it would be interesting to profile memory usage during tokenization.
Yes there is a high peak of memory usage during the preprocessing, because python sucks at multi-process, so the array with the results is copied across all the processes instead of being shared.
To reduce RAM usage, reduce the number of workers (but then it’s going to be slower).
When you were working through yours, did you max out your memory? I’ve found that if you try to tokenize too much at once and you don’t have enough memory to finish the process, it tends to stall and never complete. You might need to try working with a small chunk at a time.
Originally I was thinking that was the issue too but I spun up a larger cloud instance with a ton of memory (n1-highmem-96 with 624GB) and logged out how much was being used over time; it still had a few hundred GB of memory available throughout.
Do you know if I do it by chunks if there’s a way to combine those afterwards?
I did end up training a language model with a ~4GB subsample and it works great. But I’m dreaming of how good it would be if it was trained on the other 60GB of data as well.
Not that I know of. I’ve been wondering the same thing. Is there a way to concatenate things together on the ItemList level or something? Right now I get around it by iterating through chunks of the data and training on a little bit at a time.
Anything is possible, but you will have to code it
As of now, the TextList loads everything in memory, but you could write a subclass that loads dynamically the texts you need.
I’ve been trying to follow the code down through the layers to see if I can find where label_for_lm was getting stuck. The flow bops around several times between text/data.py and data_block.py and I think I’m losing the thread somewhere.
Can someone help me understand what’s going on?
In text/data.py, TextList.label_for_lm calls self.label_const(0) (which is defined in ItemList)
In data_block.py, ItemList.label_const sets kwargs['label_cls'] = LMLabelList and calls self.label_from_func(func=lambda o: const, label_cls=label_cls, **kwargs) which calls self._label_from_list([func(o) for o in self.items], label_cls=label_cls, **kwargs) which does the following:
def _label_from_list(self, labels:Iterator, label_cls:Callable=None, from_item_lists:bool=False, **kwargs)->'LabelList':
"Label `self.items` with `labels`."
if not from_item_lists:
raise Exception("Your data isn't split, if you don't want a validation set, please use `split_none`.")
labels = array(labels, dtype=object)
label_cls = self.get_label_cls(labels, label_cls=label_cls, **kwargs)
y = label_cls(labels, path=self.path, **kwargs)
res = self._label_list(x=self, y=y)
return res
This ends up passing an array of the items to LMLabelList. But LMLabelList's constructor just calls EmptyLabelList's constructor which flows through to ItemList's constructor which just seems to initialize several variables and doesn’t call any other functions.
So I’m confused about where the “magic” is happening in label_for_lm. Where’s the actual meat of the code? It looks like it might be in reconstruct but I don’t see that being called from anywhere.