MemoryError (and abend) from 1.5MM text file LM


#1

Hi,

I am building a language model on a 1.5 million file corpus. The individual files are small, like 200 bytes average perhaps. It’s fastai v.1.0.30 I believe.

On a stock data.py file: tHe following line of code in notebook (see below) fails to find any of the text files. (It’s due to file extension being .out on my corpus it seems.)

On a suitably modified fastai.text.data.py in 2 places in the data.py library code: THe following code in notebook (lmdata=…) will now begin to run, apparently finding my files now, and will use a CPU now (used htop to verify) for a while, then abend with MemoryError:

text_extensions = {’.txt’, ‘.out’, ‘.doc’} # added 2 more

def _join_texts(texts:Collection[str], mark_fields:bool=False):
if not isinstance(texts, np.ndarray): texts = np.array(texts)
if is1d(texts): texts = texts[:,None]
df = pd.DataFrame({i:texts[:,i] for i in range(texts.shape[1])})
#text_col = f’{BOS} {FLD} {1} ’ + df[0] if mark_fields else f’{BOS} ’ + df[0]
text_col = f’{BOS} {FLD} {1} ’ + df[0].astype(str) if mark_fields else f’{BOS} ’ + df[0].astype(str)
for i in range(1,len(df.columns)):
#text_col += (f’ {FLD} {i+1} ’ if mark_fields else ’ ‘) + df[i]
text_col += (f’ {FLD} {i+1} ’ if mark_fields else ’ ') + df[i].astype(str)
return text_col.values

The notebook’s line of code:

lmdata = TextLMDataBunch.from_folder(path=PATH) # MemoryError

SUggest to fastai team to create a similarly sized text file corpus (mine is not public domain) and assess the library code’s capacity for R&D purpose. My host did not exceed about 5GB allocation of RAM while I was watching htop. My host has over 100GB of physical RAM.

HTH.

Update – I use GPU not CPU for deep learning, so maybe I can lower the batch size or bptt and avoid the MEmoryError. Not sure. I’ll go find out now.

Update 2 – It seems to not be GPU-limited for the memory in this particular MemoryError.

TextLMDataBunch.batch_size = 16 # to reduce GPU memory need. Update: Does not fix it. Also, nvidia-smi shows no process.
lmdata = TextLMDataBunch.from_folder(path=PATH)