How to handle dataframes too large to fit in memory?

JoshVarty · February 25, 2019, 8:50pm

I have a tabular dataset that’s about 5GB on disk. I can (barely) load the dataframe into memory using panda.read_csv() but when I try to create a DataBunch from this dataframe I run out of memory (I have 16GB of RAM).

I tried reading the dataframe in chunks using panda’s chunksize parameter. I would then train my model on this chunk and save the weights. On the next chunk I would reload the weights and continue training. However this approach doesn’t work because TabularList.from_df() expects you to provide the entire dataframe so it can create embeddings. (Specifically it needs to know all of the classes for each column).

Does anyone have a good approach for working with dataframes that don’t fit into memory?

alvisanovari · February 25, 2019, 9:04pm

Are you making sure embedding size is restricted to 50 or less? I was having similar problems but after doing this I was able to create a learner object and training on the full dataset. You can also decrease batch_size (default 64) and increase num_workers (default 4) in your databunch to see if that helps.

yeldarb · February 25, 2019, 9:10pm

Apologies in advance for this not being a “real” answer (I don’t know the generalized answer but I’m interested as well; subscribing to this topic).

To get over this hurdle in the near-term, why not upgrade to 32GB of memory if your system supports it? It’s relatively inexpensive in the long-run – and if you do the calculation of how much time it would save you for any reasonable valuation of your time it’s probably worth it.

(I know it’s a hard value prop to accept; I’m wrestling with the same thing personally for upgrading to a 2080 Ti… I know that objectively the time savings over its lifetime would make it well worth the investment but I haven’t been able to pull the trigger yet.)

init_27 · February 25, 2019, 9:44pm

One of the few approaches that I found out in the Microsoft Malware Prediction comp is that you can load the df in chunks, process it, write it and then read in the next chunk-this may not be possible depending on your use case.

While I definitely agree with @yeldarb that this would save you the pain of writing these loops everytime and keeping your imagination limited to 16GB batch sizes. Another idea would be to increase the SWAP on your ubuntu (I have a 16GB laptop with 120 GB SWAP-It has a M.2 drive and keeping the SWAP on there is quite helpful. Not helpful when you just have an HDD)

bjack913 · February 26, 2019, 1:42am

Dask (http://docs.dask.org/en/latest/) is designed for handling dataframes that won’t fit into memory. I’ve not tested it with the fastai library, but it may be worth looking into.

JoshVarty · February 26, 2019, 7:37pm

My understanding is that Dask isn’t suited to this task because the fastai library requires Datasets to implement a __get__ function that retrieves an element at an arbitrary position and that Dask doesn’t support this. However, someone here suggested that the zarr package might work.

JoshVarty · February 26, 2019, 7:38pm

Hmm I tried this but I ran into problems because the embeddings that were getting created were of different sizes depending on the chunk that I loaded. (The embeddings are created by looking at all of the categories in the current chunk) It’s very likely I was doing something wrong though.

sinkie · February 27, 2019, 3:25am

I’ve been playing around with this myself, but I can’t figure out how to properly subclass the TabularList. What I have done is subclass the torch Dataset, randomly serving up an entire partition of a dask dataframe as a batch. I preprocess the dataframe, which takes time, but should work with out of memory dataframes.

class DaskPartDataset(Dataset):

def __init__(self, df, target_col, cat_names):
    self.df = df
    self.target = target_col
    self.cat_names = cat_names
    self.cats = {}
    
    # Process categories
    for n in self.cat_names + [self.target]:
        self.df[n] = self.df[n].astype('category').cat.as_known()
        self.cats[n] = self.df[n].cat.categories
        self.df[n] = self.df[n].cat.codes.astype(np.int64)
    
    # Process continuous
    self.cont_names = list(set(self.df.columns) - set(self.cat_names) - set([self.target]))
    medians, means, stds = {}, {}, {}
    for n in self.cont_names:
        medians[n] = self.df[n].quantile(0.5)
        means[n] = self.df[n].mean()
        stds[n] = self.df[n].std()
        self.df[n] = self.df[n].astype(np.float32)
    
    self.medians, self.means, self.stds = dask.compute(medians, means, stds)

def __len__(self):
    return self.df.npartitions

def __getitem__(self, i):
    print(i)
    df = self.df.get_partition(i).compute()
    
    # Process continuous on each partition
    for n in self.cont_names:
        df[n] = df[n].fillna(self.medians[n])
        df[n] = (df.loc[:,n] - self.means[n]) / (1e-7 + self.stds[n])
    
    x_cont = df[self.cont_names].values
    x_cat = df[self.cat_names].values
    y = df[self.target].values
    print('Done', df.shape)
    return [tensor(x_cat), tensor(x_cont)], tensor(y)

So e.g. for the ADULT_SAMPLE dataset, I’m loading my dask dataframe, df (I know this is a tiny example that doesn’t need dask, but I’m using it to prototype).

df = dd.read_csv(path/‘adult.csv’, blocksize=1e6)
training_set = DaskPartDataset(df, ‘salary’, cat_names)
training_set.c = 2
training_set.classes = [’>=50k’,’<50k’]

I’m then creating my data bunch as:

data = TabularDataBunch.create(training_set, Valid_Set, bs=1, num_workers=0)

and manually setting

data.get_emb_szs = lambda a: [(len(cats[x]), 15) for x in cat_names]

This seems to create the model and learner correctly, but what isn’t working are calls to fit… I’m getting an error around the forward pass of the embeddings.

RuntimeError: index out of range at /opt/conda/conda-bld/pytorch-cpu_1549632688322/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191

Does anyone have any insights?

ritika26 · February 27, 2019, 6:28am

@JoshVarty and @init_27 I am also working on some dataset that is around 4GB on disk. When I am trying to create a `DataBunch from dataframe it is run out of memory (16GB of RAM).Please suggest how to solve this issue.

Thanks,
Ritika

KarlH · February 27, 2019, 7:01am

On a related note, is there a good way to handle processing large text datasets? I find the tokenization/preprocessing takes a huge amount of memory that is released when the preprocessing is finished. Right now I’m working with a dataset where processing just a third of it takes up my full 32 GB of memory, but after processing the actual data object takes up like 1 GB. I’m not really sure how to go about processing the full dataset.

marcmuc · February 27, 2019, 11:08am

Sorry if this is obvious and you have done so already, but you should first of all make sure that you specify datatypes for the columns when reading in the data with pandas. In one kaggle competition this enabled reducing the memory needed by more than 50%. I have created a kernel about that:

https://www.kaggle.com/marcmuc/large-csv-datasets-with-pandas-use-less-memory

The key is that pandas automatically assigns 64bit versions of int, float to the columns, whereas your data can probably live with 8bit ints sometimes or 32bit floats most of the times. This significantly reduces your memory footprint

Also when running the model, and that fails, try to set the workers to 0, there are still often problems when using workers in pytorch/fastai due to memory consumption, see this thread

init_27 · February 27, 2019, 12:38pm

You should definitely check out the suggestion by @marcmuc

Other thing I can suggest apart from an actual upgrade is to bump your SWAP up to 60GB.

ritika26 · February 27, 2019, 12:39pm

Thanks @init_27

mindtrinket · March 4, 2019, 3:03pm

Hey Josh,

Same problem. I would make sure you are using Dtypes. I got mine down to memory usage: 2.1+ GB from 19 GB. Same advice as @marcmuc at the end of the day.

Someone posted all the dtypes and the smallest value you can put in. Unfortunately, the Kaggle forum seems to be down at this time, but you should be making something like this.

# define dtypes from Kaggle thread 
dtypes = {
        'MachineIdentifier': 'category',
        'ProductName': 'category',
        'EngineVersion': 'category',
        'AppVersion': 'category',
        'AvSigVersion': 'category',
        'IsBeta': 'int8'  
       etc...
df = pd.read_csv(path/'train.csv', dtype=dtypes)

init_27 · March 4, 2019, 3:18pm

@mindtrinket I didn’t know you were competing in the Microsoft comp!

mindtrinket · March 4, 2019, 3:26pm

Since yesterday Still haven’t submitted anything yet.

init_27 · March 4, 2019, 3:27pm

Lol, if it piques your interest and you’d want to team up, please let me know

mindtrinket · March 8, 2019, 3:11pm

If anyone is still running into memory problems (I am). There is a great piece of code out there reduce_mem_usage which has been getting me another 10% on my optimized dtypes.

You can find the link https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65 but code is below

import pandas as pd
import numpy as np

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object and col_type.name != 'category':
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

Edit: I am now running into problems with using this is Rossmann. Some alterations with it will ignore datatype, but it will not properly train. So maybe this isn’t a great function.

JoshVarty · March 8, 2019, 7:20pm

I had problems with half precision floats (float16) that gave me NANs when I used them. I suspect this issue depends on your GPU.

I also had points with dtypes more generally in that it caused some strange issues when I used .get_preds() and .predict(). Using dtypes caused these values to be different for me. I’m not sure why.

keyurparalkar · November 28, 2019, 6:55pm

@marcmuc I am currently working on Pubg kaggle dataset. Dataset is large upto 667 MB. Will definitely try to look into this method. And post my findings