Are you making sure embedding size is restricted to 50 or less? I was having similar problems but after doing this I was able to create a learner object and training on the full dataset. You can also decrease batch_size (default 64) and increase num_workers (default 4) in your databunch to see if that helps.
Apologies in advance for this not being a “real” answer (I don’t know the generalized answer but I’m interested as well; subscribing to this topic).
To get over this hurdle in the near-term, why not upgrade to 32GB of memory if your system supports it? It’s relatively inexpensive in the long-run – and if you do the calculation of how much time it would save you for any reasonable valuation of your time it’s probably worth it.
(I know it’s a hard value prop to accept; I’m wrestling with the same thing personally for upgrading to a 2080 Ti… I know that objectively the time savings over its lifetime would make it well worth the investment but I haven’t been able to pull the trigger yet.)
One of the few approaches that I found out in the Microsoft Malware Prediction comp is that you can load the df in chunks, process it, write it and then read in the next chunk-this may not be possible depending on your use case.
While I definitely agree with @yeldarb that this would save you the pain of writing these loops everytime and keeping your imagination limited to 16GB batch sizes. Another idea would be to increase the SWAP on your ubuntu (I have a 16GB laptop with 120 GB SWAP-It has a M.2 drive and keeping the SWAP on there is quite helpful. Not helpful when you just have an HDD)
Dask (http://docs.dask.org/en/latest/) is designed for handling dataframes that won’t fit into memory. I’ve not tested it with the fastai library, but it may be worth looking into.
My understanding is that Dask isn’t suited to this task because the fastai library requires Datasets to implement a __get__
function that retrieves an element at an arbitrary position and that Dask doesn’t support this. However, someone here suggested that the zarr
package might work.
Hmm I tried this but I ran into problems because the embeddings that were getting created were of different sizes depending on the chunk that I loaded. (The embeddings are created by looking at all of the categories in the current chunk) It’s very likely I was doing something wrong though.
I’ve been playing around with this myself, but I can’t figure out how to properly subclass the TabularList. What I have done is subclass the torch Dataset, randomly serving up an entire partition of a dask dataframe as a batch. I preprocess the dataframe, which takes time, but should work with out of memory dataframes.
class DaskPartDataset(Dataset):
def __init__(self, df, target_col, cat_names): self.df = df self.target = target_col self.cat_names = cat_names self.cats = {} # Process categories for n in self.cat_names + [self.target]: self.df[n] = self.df[n].astype('category').cat.as_known() self.cats[n] = self.df[n].cat.categories self.df[n] = self.df[n].cat.codes.astype(np.int64) # Process continuous self.cont_names = list(set(self.df.columns) - set(self.cat_names) - set([self.target])) medians, means, stds = {}, {}, {} for n in self.cont_names: medians[n] = self.df[n].quantile(0.5) means[n] = self.df[n].mean() stds[n] = self.df[n].std() self.df[n] = self.df[n].astype(np.float32) self.medians, self.means, self.stds = dask.compute(medians, means, stds) def __len__(self): return self.df.npartitions def __getitem__(self, i): print(i) df = self.df.get_partition(i).compute() # Process continuous on each partition for n in self.cont_names: df[n] = df[n].fillna(self.medians[n]) df[n] = (df.loc[:,n] - self.means[n]) / (1e-7 + self.stds[n]) x_cont = df[self.cont_names].values x_cat = df[self.cat_names].values y = df[self.target].values print('Done', df.shape) return [tensor(x_cat), tensor(x_cont)], tensor(y)
So e.g. for the ADULT_SAMPLE dataset, I’m loading my dask dataframe, df (I know this is a tiny example that doesn’t need dask, but I’m using it to prototype).
df = dd.read_csv(path/‘adult.csv’, blocksize=1e6)
training_set = DaskPartDataset(df, ‘salary’, cat_names)
training_set.c = 2
training_set.classes = [’>=50k’,’<50k’]
I’m then creating my data bunch as:
data = TabularDataBunch.create(training_set, Valid_Set, bs=1, num_workers=0)
and manually setting
data.get_emb_szs = lambda a: [(len(cats[x]), 15) for x in cat_names]
This seems to create the model and learner correctly, but what isn’t working are calls to fit… I’m getting an error around the forward pass of the embeddings.
RuntimeError: index out of range at /opt/conda/conda-bld/pytorch-cpu_1549632688322/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191
Does anyone have any insights?
@JoshVarty and @init_27 I am also working on some dataset that is around 4GB on disk. When I am trying to create a `DataBunch from dataframe it is run out of memory (16GB of RAM).Please suggest how to solve this issue.
Thanks,
Ritika
On a related note, is there a good way to handle processing large text datasets? I find the tokenization/preprocessing takes a huge amount of memory that is released when the preprocessing is finished. Right now I’m working with a dataset where processing just a third of it takes up my full 32 GB of memory, but after processing the actual data object takes up like 1 GB. I’m not really sure how to go about processing the full dataset.
Sorry if this is obvious and you have done so already, but you should first of all make sure that you specify datatypes for the columns when reading in the data with pandas. In one kaggle competition this enabled reducing the memory needed by more than 50%. I have created a kernel about that:
https://www.kaggle.com/marcmuc/large-csv-datasets-with-pandas-use-less-memory
The key is that pandas automatically assigns 64bit versions of int, float to the columns, whereas your data can probably live with 8bit ints sometimes or 32bit floats most of the times. This significantly reduces your memory footprint
Also when running the model, and that fails, try to set the workers to 0, there are still often problems when using workers in pytorch/fastai due to memory consumption, see this thread
You should definitely check out the suggestion by @marcmuc
Other thing I can suggest apart from an actual upgrade is to bump your SWAP up to 60GB.
Hey Josh,
Same problem. I would make sure you are using Dtypes. I got mine down to memory usage: 2.1+ GB from 19 GB. Same advice as @marcmuc at the end of the day.
Someone posted all the dtypes and the smallest value you can put in. Unfortunately, the Kaggle forum seems to be down at this time, but you should be making something like this.
# define dtypes from Kaggle thread
dtypes = {
'MachineIdentifier': 'category',
'ProductName': 'category',
'EngineVersion': 'category',
'AppVersion': 'category',
'AvSigVersion': 'category',
'IsBeta': 'int8'
etc...
df = pd.read_csv(path/'train.csv', dtype=dtypes)
Since yesterday Still haven’t submitted anything yet.
Lol, if it piques your interest and you’d want to team up, please let me know
If anyone is still running into memory problems (I am). There is a great piece of code out there reduce_mem_usage which has been getting me another 10% on my optimized dtypes.
You can find the link https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65 but code is below
import pandas as pd
import numpy as np
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object and col_type.name != 'category':
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
Edit: I am now running into problems with using this is Rossmann. Some alterations with it will ignore datatype, but it will not properly train. So maybe this isn’t a great function.
I had problems with half precision floats (float16
) that gave me NANs when I used them. I suspect this issue depends on your GPU.
I also had points with dtypes
more generally in that it caused some strange issues when I used .get_preds()
and .predict()
. Using dtypes
caused these values to be different for me. I’m not sure why.
@marcmuc I am currently working on Pubg kaggle dataset. Dataset is large upto 667 MB. Will definitely try to look into this method. And post my findings