@sgugger Iām actually noticing an exponential growth in how the memory is being used. Take this code for instance, which runs permutation importance:
def calc_error(self, col:str):
"Shuffles a column and calculates error on a column"
temp_df = self.df.copy()
temp_df[col] = temp_df[col].sample(n=len(temp_df), replace=True).reset_index(drop=True)
test_dl = self.learn.dls.test_dl(temp_df)
del temp_df
return self.learn.validate(dl=test_dl)[1]
self.df
is stored away in memory for me to run with. I have 38 columns I work with (itās Rossmann) and I am trying to essentially shuffle a column in a particular dataframe, make a test_dl with it, and then run it on learn.validate
. I make sure to clear the memory of my temp_df
each time I use it, but something else is being stored instead because I cannot get passed shuffling 12 variables, and I think this is due to some amount of exponential ram being used (should I be deleting my test_dl
too mabye?!
To test this, run the following on a trained Rossmann problem and pass in the training dataframe:
class PermutationImportance():
"Calculate and plot the permutation importance"
def __init__(self, df, learn=Learner, metric:callable=None):
"Initialize with a test dataframe, a learner, and a metric"
self.learn = learn
self.df = df if df is not None else learn.dls.valid.dataset.all_cols
if metric is None:
self.learn.metrics = accuracy() if learn.dls.c > 1 else MSELossFlat()
else:
self.learn.metrics = L(AvgMetric(metric))
self.x_names = learn.dls.x_names.filter(lambda x: '_na' not in x)
self.y = dls.y_names
self.results = self.calc_feat_importance()
self.plot_importance(self.ord_dic_to_df(self.results))
def calc_feat_importance(self):
"Calculates permutation importance by shuffling a column on a percentage scale"
test_dl = self.learn.dls.test_dl(self.df)
print('Getting base error')
base_error = self.learn.validate(dl=test_dl)[1]
self.importance = {}
pbar = progress_bar(self.x_names)
print('Calculating Permutation Importance')
for col in pbar:
self.importance[col] = self.calc_error(col)
for key, value in self.importance.items():
self.importance[key] = (base_error-value)/base_error #this can be adjusted
return OrderedDict(sorted(self.importance.items(), key=lambda kv: kv[1], reverse=True))
def calc_error(self, col:str):
"Shuffles a column and calculates error on a column"
temp_df = self.df.copy()
temp_df[col] = temp_df[col].sample(n=len(temp_df), replace=True).reset_index(drop=True)
test_dl = self.learn.dls.test_dl(temp_df)
del temp_df
return self.learn.validate(dl=test_dl)[1]
def ord_dic_to_df(self, dict:OrderedDict):
return pd.DataFrame([[k, v] for k, v in ord_dict.items()], columns=['Feature', 'Relative importance'])
def plot_importance(self, df:pd.DataFrame, limit=20, asc=False, **kwargs):
"Plot importance with an optional limit to how many variables shown"
df_copy = df.copy()
df_copy['feature'] = df_copy['feature'].str.slice(0,25)
df_copy = df_copy.sort_values(by='importance', ascending=asc)[:limit].sort_values(by='importance', ascending=not(asc))
ax = df_copy.plot.barh(x='feature', y='importance', sort_columns=True, **kwargs)
for p in ax.patches:
ax.annotate(f'{p.get_width():.4f}', ((p.get_width() * 1.005), p.get_y() * 1.005))
Should I instead operate with one TabularPandas
test object or something? (Or is there a way to look into the memory usage as I go?)
Another thing to note, when I generate Rossmann on my machine, my RAM usage goes from 1.5gb to 3.03gb
Just trying to figure out how to find a solution for these memory issues