Fastai v2 tabular

jeremy · February 25, 2020, 9:26pm

FYI @muellerzr I define this little bash function in my .bashrc:

git_pull_all ()
{
    pushd ~/git;
    parallel -a repos 'echo {} && cd ~/git/{} && git pull';
    popd
}

All the repos I want to keep up to date are in ~/git, and in there is a file ~/git/repos with all the repo names I want to keep up to date listed, one per line. (This assumes you have GNU parallel installed).

muellerzr · February 25, 2020, 9:34pm

Awesome! That’s definitely super helpful! Thanks (had to go learn what the bashrc actually is real quick). In my mind, the easiest way to do this (from a Colab perspective) is to just keep everything in your Google Drive most likely… I’ll write a notebook when I can get to it…

faib · March 4, 2020, 2:40pm

I’m trying to get into fastai2 doing following kaggle competition:

I’m running into memory limitations when trying to load the data into memory.
Especially TabularPandas seems to be quite RAM hungry.
When it’s just in Pandas it clocks in at about 4.6GB of memory.
Is there a way to lazy load the data?

Here is the notebook, pretty much a copy from @muellerzr starter code a couple of months ago

muellerzr · March 4, 2020, 2:47pm

The main issue would be preprocessing the data (I think), you still need it active in memory somewhere. I’m not 100% sure why it takes so much space up (I know this is a thing), one option you could do is split the dataframe into multiple, keep track of the proc statistics and run with those during the TabularPandas creation. I’ll think about making a tutorial on that unless @sgugger has any other ideas for better memory efficiency on large dataframes?

Edit: perhaps preprocess in chunks if it’s over n lines? (Like 1,000,000)

sgugger · March 4, 2020, 3:02pm

Preprocessing requires computing the statistics over the whole dataframe, so it would require quite a lot of custom code to preprocess by chunk. Same for reading it lazily.
This might be something we look into after the release of v2, but we have more pressing matters before.

As for the memory occupation of TabularPandas, be sure to set the inplace argument to True to avoid unnecessary copies of the dataframes. It should normally be the same weight in memory as the dataframe since we only keep one reference to it.

muellerzr · March 4, 2020, 3:07pm

Is this an argument we can pass? I didn’t see this in Tabular‘s parameters (What TabularPandas inherits from) (or is there a different spot we should be passing that into for preprocessing manually)

sgugger · March 4, 2020, 3:13pm

Ah you’re right, it was removed, so it’s always inplace now.

muellerzr · March 4, 2020, 3:15pm

I’m trying to think of where the excess memory is coming from, so TabularPandas uses the original DataFrame reference you brought in? (It doesn’t use another/make a copy?)

Edit: actually I think it’s still loading a copy into memory, though temporarily:

df = df.iloc[sum(splits, [])].copy() (if I’m reading that right?)

faib · March 4, 2020, 3:35pm

If it’s always inplace it would mean that when having a pandas dataframe named train and doing the following:

TabularPandas(train, procs, cat_vars, cont_vars, y_names=dep_var, splits=splits, block_y=RegressionBlock())
type(train)

the type would be fastai2.tabular.core.TabularPandas.
But I still have to assign
train = TabularPandas(...)

muellerzr · March 4, 2020, 3:38pm

What he’s meaning is the reference to the dataframe is what’s used, not a copy of it and so it’s all coming from one memory location essentially.

faib · March 4, 2020, 3:40pm

Got it, thank you! I already thought I was on the wrong track there.

muellerzr · March 4, 2020, 9:08pm

@faib I think the best solution is name your TabularPandas object to the same name as your dataframe (so it overrides), because I can do the following despite deleting the original dataframe:

to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, block_y=block_y, splits=splits)
del df

to.iloc[:5]

And it doesn’t break. Otherwise just delete your old DataFrame from memory

faib · March 5, 2020, 12:31pm

Thanks @muellerzr! This did work for me but fails when I do some additional feature engineering beforehands.
I simply halved the number of training samples to make this easier while still learning

muellerzr · March 5, 2020, 6:15pm

@sgugger I’m actually noticing an exponential growth in how the memory is being used. Take this code for instance, which runs permutation importance:

  def calc_error(self, col:str):
    "Shuffles a column and calculates error on a column"
    temp_df = self.df.copy()
    temp_df[col] = temp_df[col].sample(n=len(temp_df), replace=True).reset_index(drop=True)
    test_dl = self.learn.dls.test_dl(temp_df)
    del temp_df
    return self.learn.validate(dl=test_dl)[1]

self.df is stored away in memory for me to run with. I have 38 columns I work with (it’s Rossmann) and I am trying to essentially shuffle a column in a particular dataframe, make a test_dl with it, and then run it on learn.validate. I make sure to clear the memory of my temp_df each time I use it, but something else is being stored instead because I cannot get passed shuffling 12 variables, and I think this is due to some amount of exponential ram being used (should I be deleting my test_dl too mabye?!

To test this, run the following on a trained Rossmann problem and pass in the training dataframe:

class PermutationImportance():
  "Calculate and plot the permutation importance"
  def __init__(self, df, learn=Learner, metric:callable=None):
    "Initialize with a test dataframe, a learner, and a metric"
    self.learn = learn
    self.df = df if df is not None else learn.dls.valid.dataset.all_cols
    if metric is None:
      self.learn.metrics = accuracy() if learn.dls.c > 1 else MSELossFlat()
    else:
      self.learn.metrics = L(AvgMetric(metric))
    
    self.x_names = learn.dls.x_names.filter(lambda x: '_na' not in x)
    self.y = dls.y_names
    self.results = self.calc_feat_importance()
    self.plot_importance(self.ord_dic_to_df(self.results))

  def calc_feat_importance(self):
    "Calculates permutation importance by shuffling a column on a percentage scale"
    test_dl = self.learn.dls.test_dl(self.df)
    print('Getting base error')
    base_error = self.learn.validate(dl=test_dl)[1]
    self.importance = {}
    pbar = progress_bar(self.x_names)
    print('Calculating Permutation Importance')
    for col in pbar:
      self.importance[col] = self.calc_error(col)
    for key, value in self.importance.items():
      self.importance[key] = (base_error-value)/base_error #this can be adjusted
    return OrderedDict(sorted(self.importance.items(), key=lambda kv: kv[1], reverse=True))

  def calc_error(self, col:str):
    "Shuffles a column and calculates error on a column"
    temp_df = self.df.copy()
    temp_df[col] = temp_df[col].sample(n=len(temp_df), replace=True).reset_index(drop=True)
    test_dl = self.learn.dls.test_dl(temp_df)
    del temp_df
    return self.learn.validate(dl=test_dl)[1]

  def ord_dic_to_df(self, dict:OrderedDict):
    return pd.DataFrame([[k, v] for k, v in ord_dict.items()], columns=['Feature', 'Relative importance'])

  def plot_importance(self, df:pd.DataFrame, limit=20, asc=False, **kwargs):
    "Plot importance with an optional limit to how many variables shown"
    df_copy = df.copy()
    df_copy['feature'] = df_copy['feature'].str.slice(0,25)
    df_copy = df_copy.sort_values(by='importance', ascending=asc)[:limit].sort_values(by='importance', ascending=not(asc))
    ax = df_copy.plot.barh(x='feature', y='importance', sort_columns=True, **kwargs)
    for p in ax.patches:
      ax.annotate(f'{p.get_width():.4f}', ((p.get_width() * 1.005), p.get_y()  * 1.005))

Should I instead operate with one TabularPandas test object or something? (Or is there a way to look into the memory usage as I go?)

Another thing to note, when I generate Rossmann on my machine, my RAM usage goes from 1.5gb to 3.03gb

Just trying to figure out how to find a solution for these memory issues

sgugger · March 5, 2020, 6:19pm

I have no idea. There is some limitation in dataloaders with num_workers > 0 in some cases where the data is copied several times, leading to memory leaking. Maybe this is due to that? Otherwise I’m afraid you’ll have to profile on your own.

muellerzr · March 5, 2020, 6:20pm

I’ll take a look and investigate (as I think that may be what’s happening, atleast the general memory usage hints at it)

muellerzr · March 5, 2020, 6:41pm

It’s actually before this even. What I’m noticing is after the TabularPandas creation, an extra 2gb is being utilized, when my original dataframe was only ~860mb’s. Here’s a history of my results from !free -m:

Baseline (just loading library in):

              total        used        free      shared  buff/cache   available
Mem:          13022         654        9708           0        2660       12134
Swap:             0           0           0

Loading in Rossmann train_df:

              total        used        free      shared  buff/cache   available
Mem:          13022        1521        8839           0        2661       11690
Swap:             0           0           0

(867mb used by dataframe now)

TabularPandas:

to = TabularPandas(train_df, procs=procs, cat_names=cat_vars, cont_names=cont_vars,
                   y_names=dep_var, block_y=TransformBlock(), splits=splits, device='cuda')

              total        used        free      shared  buff/cache   available
Mem:          13022        3537        6822           0        2662       11081
Swap:             0           0           0

Which now we see our added 2gb.

Also, attempting to delete the TabularPandas and run garbage collection doesn’t do anything. I’ll need to dig deeper into the TabularPandas creation to see what’s going on but this is what I have so far. I’m assuming this much memory usage is not what we want, as we’ve essentially more than doubled what our dataframe’s footprint is

fmobrj75 · March 5, 2020, 6:50pm

I had the memory leaking problem with num_workers > 1. But it was only disturbing when processing a very large text dataset (20MM records). In this case, I set shuffle_train = False and the issue was severely reduced, and the memory hog was not a problem anymore.

dbunch_lm = dsrc.databunch(bs=bs, seq_len=sl, val_bs=bs, num_workers=2, pin_memory=True, shuffle_train=False)

jeremy · March 5, 2020, 6:55pm

Thanks a lot for looking into this. It’s not something we’ve had a chance to optimize yet, so your help is much appreciated. I wouldn’t be at all surprised to find that there’s places we’re not using memory efficiently in fastai2.tabular…

muellerzr · March 5, 2020, 6:58pm

My pleasure I’ll report back what I can find