Some useful functions for tabular models

Pak · April 18, 2019, 12:18pm

Added: The complete notebook with all these functions in use is now online

Getting predictions for a model on a new dataset

There could be a lot of reasons for getting predictions on a another (nor test, nor validation) set of data (dataframe).
The obvious one is when you just want to predict on a new dataset, when you want preditions themselves.
The second group of reasons lays in the field of data exploration. What if we would like to do some feature importance or partial dependence analysis, in each of these cases we need an instrument to get predictions on a bunch of new (altered) dataframes. Definitely we can use learn.predict(row) for each of well… row in each of dataframe, but it is a pretty long process (in fact in my setup learn.predict in for-loop for 200 rows lasts 45+ seconds, and the process of calculating error for whole dataframe consists of 10,000 rows with standard tools take less than a second).
So I’ve accepted a challenge © and was wrestling with this problem for 1.5 weeks (ok, to be frank, for a half of dozen of evenings, but it’s still much longer than I was expecting ). Eventually, after discovering of editable install’s hackery and some print-debugging, I’ve managed to understand where does the info on categorification and normalizing parameters (this was muuuuch harder) are stored (as we have to apply these exact transformations to a new dataframe).

So now I present you a number of function that can help you to predict on a new dataframe almost as quick as original predition on trining set.
Presumption is that you use standard tabular learning cycle (for ex. procs=[FillMissing, Categorify, Normalize] ).

The only trick here is to split standard databunch creation process into two phases:

First you apply all the functions before .databunch e.g.:
data_prep = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True))
data_prep is now a valid LabelLists object, that can be used to get data processing parameters

Then you can apply .databunch as well, to get DataBunch object (that is needed for learning process itself) e.g. data = data_prep.databunch(bs=BS)
data_prep is what we do the split for. It will be used in our function.

  def get_model_real_input(df:DataFrame, data_prep:LabelLists, bs:int=None)->Tensor:
      df_copy = df.copy()
      fill, catf, norm = None, None, None    
      cats, conts = None, None
      is_alone = True if (len(df) == 1) else False
      

      proc = data_prep.get_processors()[0][0]   
      if (is_alone):
          df_copy = df_copy.append(df_copy.iloc[0])
      
      for prc in proc.procs:
          if (type(prc) == FillMissing):
              fill = prc
          elif (type(prc) == Categorify):
              catf = prc
          elif (type(prc) == Normalize):
              norm = prc
      if (fill is not None):
          fill.apply_test(df_copy)
      if (catf is not None):
          catf.apply_test(df_copy)
          for c in catf.cat_names:
              df_copy[c] = (df_copy[c].cat.codes).astype(np.int64) + 1
          cats = df_copy[catf.cat_names].to_numpy()
          
      if (norm is not None):
          norm.apply_test(df_copy)
          conts = df_copy[norm.cont_names].to_numpy().astype('float32')
      
      # ugly workaround as apperently catf.apply_test doesn't work with lone row
      if (is_alone):
          xs = [torch.tensor([cats[0]], device=learn.data.device), torch.tensor([conts[0]], device=learn.data.device)]
      else:
          if (bs is None):
              xs = [torch.tensor(cats, device=learn.data.device), torch.tensor(conts, device=learn.data.device)]
          elif (bs > 0):
              xs = [list(chunks(l=torch.tensor(cats, device=learn.data.device), n=bs)), 
                    list(chunks(l=torch.tensor(conts, device=learn.data.device), n=bs))]

      return xs    


  def get_cust_preds(df:DataFrame, data_prep:LabelLists, learn:Learner, bs:int=None, parent=None)->Tensor: 
      '''
      Using existing model to predict output (learn.model) on a new dataframe at once (learn.predict does it for 
      one row which is pretty slow). 
      data_prep is a LabelLists object which you can get if you split standard databunch creation 
      process into two halfs. First you apply all the functions  before .databunch e.g.: 
      data_prep = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                     .split_by_idx(valid_idx)
                     .label_from_df(cols=dep_var, label_cls=FloatList, log=True))
      data_prep is now a valid LabelLists object
      Then you can apply .databunch as well to get DataBunch object (that is needed for learning process itself)
      e.g.
      data = data_prep.databunch(bs=BS)
      '''    
      xs = get_model_real_input(df=df, data_prep=data_prep, bs=bs) 
      learn.model.eval();
      if (bs is None):
          return to_np(learn.model(x_cat=xs[0], x_cont=xs[1]))
      elif (bs > 0):
          res = []
          for ca, co in zip(xs[0], xs[1]):
              res.append(to_np(learn.model(x_cat=ca, x_cont=co)))
          return np.concatenate(res, axis=0)

  def convert_dep_col(df:DataFrame, dep_col:AnyStr, learn:Learner, data_prep:LabelLists)->Tensor:
      '''
      Converts dataframe column, named "depended column", into tensor, that can later be used to compare with predictions.
      Log will be applied if it was done in a training dataset
      '''
      actls = df[dep_col].T.to_numpy()[np.newaxis].T.astype('float32')
      actls = np.log(actls) if data_prep.log else actls
      return torch.tensor(actls, device=learn.data.device)

  def calc_loss(func:Callable, pred:Tensor, targ:Tensor, device=None)->Rank0Tensor:
      '''
      Calculates error from predictions and actuals with a given metrics function
      '''
      if (device is None):
          return func(pred, targ)
      else:        
          return func(torch.tensor(pred, device=device), targ)


  def calc_error(df:DataFrame, data_prep:LabelLists, learn:Learner, dep_col:AnyStr, 
                 func:Callable, bs:int=None)->float:
      '''
      Wrapping function to calculate error for new dataframe on existing learner (learn.model)
      See following functions' docstrings for details
      '''
      preds = get_cust_preds(df=df, data_prep=data_prep, learn=learn, bs=bs)
      actls = convert_dep_col(df, dep_col, learn, data_prep)
      error = calc_loss(func, pred=preds, targ=actls, device=learn.data.device)
      return float(error)

The main function here is – get_cust_preds
We use it for new dataset prediction. Parameters there are:

df – New dataframe which you want to predict on
data_prep – LabelLists object that can be obtained during databunch creating process (see above)
learn – learner with trained model inside

Function calc_error will help you if your goal is to determine error rather than preditions themselves (it is useful when you for ex. want explore you data with feature importance or partial dependence technics).
The parameters are the same, except:

dep_col – string with column name of depended variable (df, obviously, should contain this column, in fact you can use the same dataframe even for get_cust_preds if you wish, as it uses only categorical and continuous columns from training and ignores the rest)
func – function that is used to calculate an error (our metrics). This function (standard or written beforehand) should take 2 parameters: predictions and actuals, and calculate an error (float scalar).

Hope you will find this useful for your own experiments.
As for me, I plan to implement some interpretation technics (partial dependence, feature importance, euclidean distance for embeddings and maybe some dendrograms) for tabular data.

PS I’ve updated the code to add batch support and for some refactoring reasons

Pak · April 23, 2019, 11:26am

Use learnt embedding in Random Forest

But I would like to start with something else (not exactly data interpretation).
As I’ve managed to get the inner representation of the dataframe I’ve immediatily thought if I could use it as an input for Random Forest (to use embedding info in RF).

So I’ve made a couple of functions that can do exactly that (remember that it uses functions from the previous post)

   def emb_fwrd_sim(model, x_cat:Tensor, x_cont:Tensor)->Tensor:
        '''
        Part that was completely taking from fastai Tabular model source :)
        Gets inner representation of input dataframe (Catigorified, Filled and Normalized) 
        and process it with embeddings 'prelayer'. Also continuous variables are processed with BatchNorm if needed.
        As a result output is model gets on it's layers as input (embedding in fact are not layers, but before them)
        '''
        if model.n_emb != 0:
            x = [e(x_cat[:,i]) for i,e in enumerate(model.embeds)]
            x = torch.cat(x, 1)
            x = model.emb_drop(x)
        if model.n_cont != 0:
            x_cont = model.bn_cont(x_cont)
            x = torch.cat([x, x_cont], 1) if model.n_emb != 0 else x_cont
        return x


    def get_inner_repr(df:DataFrame, data_prep:LabelLists, learn:Learner)->Tensor: 
        '''
        Gets new dataframe that has categorical and continuous columns the learner war learnt with 
        (are being taken from learner automatically)
        And outputs inner representation of these data -- what model gets after embeddings
        Is useful for ex. to use learnt embeddings in random forest
        This output can be directly feed to RF learner (after turning it to numpy if needed)
        '''
        xs = get_model_real_input(df=df, data_prep=data_prep)
        return emb_fwrd_sim(model=learn.model, x_cat=xs[0], x_cont=xs[1])  


    def list_diff(list_1, list_2):
        diff = set(list_1) - set(list_2)
        return [item for item in list_1 if item in diff]

So the full cycle (for as an ex. Rossmann data) is the following:
(I assume that you’ve already got df, valid_idx, cat_vars, cont_vars, dep_var and BS)
Embeddings’ training:

data_pre =  (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                   .split_by_idx(valid_idx)
                   .label_from_df(cols=dep_var, label_cls=FloatList, log=True))
data = data_pre.databunch(bs=BS)

max_log_y = np.log(np.max(train_df['Sales'])*1.2)
y_range = torch.tensor([0, max_log_y], device=defaults.device)

np.random.seed(1001)
learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, 
                        y_range=y_range, metrics=exp_rmspe)

Then you train with parameters you wish to (learn.fit_one_cycle(6, 1e-2, wd=0.2) was my staring point)

Now we prepare data for RF-learning

from sklearn.ensemble import RandomForestRegressor

all_vars = cat_vars + cont_vars
ln = len(df)
train_idx = list_diff(list_1=range(ln-1), list_2=valid_idx)
tr_df = df.iloc[train_idx]
val_df = df.iloc[valid_idx]

Now turn data into inner representation (embeddings’ output)

tr_data_inner = to_np(get_inner_repr(df=tr_df[all_vars], data_prep=data_pre, learn=learn))
val_data_inner = to_np(get_inner_repr(df=val_df[all_vars], data_prep=data_pre, learn=learn))

X_train = tr_data_inner
y_train = np.log(tr_df[dep_var].to_numpy())
len(X_train), len(y_train)

X_valid = val_data_inner
y_valid = np.log(val_df[dep_var].to_numpy())
len(X_valid), len(y_valid)

def print_score(m, func):
    print(f'Train error is {func(torch.tensor(m.predict(X_train)), torch.tensor(y_train))}')
    print(f'Validation error is {func(torch.tensor(m.predict(X_valid)), torch.tensor(y_valid))}')          
    print(f'Train score is {m.score(torch.tensor(X_train), torch.tensor(y_train))}')
    print(f'Validation score is {m.score(torch.tensor(X_valid), torch.tensor(y_valid))}')

Learn RF model itself:

m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=False)
m.fit(X_train, y_train)
print_score(m=m, func=exp_rmspe)

And that’s all. print_score shows you accuracy and error with func you choose

By the way, get_inner_repr not only turns categorical data into embeddings output and fill NA’s in continuous data, but also Normalizes cat columns, what, strictly speaking, is not necessary for RF, but it makes no harm and is more coherent with real inner representation in NN.

And I would be very interesting in results you’ve achieved in Embbs+RF, cause I had no luck here. Pure NN worked better in cases I’ve managed to test
Hope that’s only my case.

Pak · April 24, 2019, 11:46am

Dendrogram and correlations

With these instruments we will try to get correlations out of actual data (not our model), that can, potentially, help us to get rid of some redundancy and just understand our data better.

from scipy.cluster import hierarchy as hc

I’ve used Cramers V here because it gives much better results for categorical data. Also it can handle NA values without manual preprocessing.

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = scipy.stats.chi2_contingency(confusion_matrix)[0]
    if (chi2 == 0):
        return 0.0
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

def get_cramer_v_matr(df:DataFrame)->np.ndarray:
    '''
    Calculate Cramers V statistic for every pair in df's columns
    '''
    cols = list(df.columns)
    corrM = np.zeros((len(cols), len(cols)))
    pbar = master_bar(list(itertools.combinations(cols, 2)))
    for col1, col2 in pbar:
        _ = progress_bar(range(1), parent=pbar) #looks like fastprogress doesn't work without 2nd bar :(
        idx1, idx2 = cols.index(col1), cols.index(col2)
        corrM[idx1, idx2] = cramers_corrected_stat(pd.crosstab(df[col1], df[col2]))
        corrM[idx2, idx1] = corrM[idx1, idx2]
    np.fill_diagonal(corrM, 1.0)
    return corrM


def get_top_corr_df(df:DataFrame, corr_thr:float=0.8, corr_matr:array=None)->DataFrame:
    if (corr_matr is not None):
        corr = corr_matr
    else:
        corr = build_correlation_matr(df=df)
    corr = np.where(abs(corr)<corr_thr, 0, corr)
    idxs = []
    for i in range(corr.shape[0]):
        if (corr[i, :].sum() + corr[:, i].sum() > 2):
            idxs.append(i)
    cols = df.columns[idxs]
    return pd.DataFrame(corr[np.ix_(idxs, idxs)], columns=cols, index=cols)

def get_top_corr_dict_corrs(top_corrs:DataFrame)->OrderedDict:
    cols = top_corrs.columns
    top_corrs_np = top_corrs.to_numpy()
    corr_dict = {}
    for i in range(top_corrs_np.shape[0]):
        for j in range(i+1, top_corrs_np.shape[0]):
            if (top_corrs_np[i, j] > 0):
                corr_dict[cols[i]+' vs '+cols[j]] = np.round(top_corrs_np[i, j], 3)
    return collections.OrderedDict(sorted(corr_dict.items(), key=lambda kv: abs(kv[1]), reverse=True))    

def get_top_corr_dict(df:DataFrame, corr_thr:float=0.8, corr_matr:array=None)->OrderedDict:
    '''
    Outputs top pairs of correlation in a given dataframe with a given correlation matrix
    Filters output mith minimal correlation of corr_thr
    '''
    top_corrs = get_top_corr_df(df, corr_thr, corr_matr)
    return get_top_corr_dict_corrs(top_corrs)

Here we calculate correlation matrix separately as it can be pretty long process.

corr_v = get_cramer_v_matr(df[all_vars])

List of pairs of most correlated features

get_top_corr_dict(df[all_vars], corr_thr=0.9, corr_matr=corr_v)

And now we can mike some nice dendrograms

plot_dendrogram_corr(corr_v, df[all_vars].columns)

like that

jeremyeast · April 24, 2019, 5:18pm

Hi Pavel, thanks for these great code shares, what kind of speed increase did you get before & after ?

Also other question is : did you use the m.export() file as the learner file, or did you use the m.save() file for future predictions ?

For the plot_dendogram_corr, you use the following, but I don’t see the use of get_top_corr_dict() because you do not use its return value ?

get_top_corr_dict(df[all_vars], corr_thr=0.9, corr_matr=corr_v)

Pak · April 24, 2019, 7:53pm

If you referring to this

Then… predict with learn.predict(row) is very ineffective (here is another example How to predict large unseen tabular datas with trained model? ) so in this case it’s thousands of times faster, but it’s just because learn.predict wasn’t meant for this.
In fact maybe there is a more convenient way to do that, probably involving something like .add_test, and making sure results wouldn’t be shuffled, but I ended up writing my own functions, because I couldn’t make it work.

I have used just
learn.save('xxx')
and
learn = learn.load('xxx');
for saving the learner between sessions

jeremyeast:

For the plot_dendogram_corr, you use the following, but I don’t see the use of get_top_corr_dict() because you do not use its return value ?
get_top_corr_dict(df[all_vars], corr_thr=0.9, corr_matr=corr_v)

That function outputs (or prints in this case) top pairs of correlation in a given dataframe.
I did not include output itself there. But it looks like this:

OrderedDict([('Store vs CompetitionDistance', 1.0),
                 ('Store vs StoreType', 0.999),
                 ('Store vs Assortment', 0.999),
                 ('Store vs PromoInterval', 0.999),
                 ('Store vs CompetitionOpenSinceYear', 0.999),
                 ('Store vs Promo2SinceYear', 0.999),
                 ('Store vs State', 0.999),
                 ('Month vs Week', 0.965),
                 ('StateHoliday vs AfterStateHoliday', 0.962)])

rbunn80130 · April 25, 2019, 5:49pm

Do you have a notebook of some problem that one could run this from start to finish? I’m having trouble trying to integrate your ideas into my current work.

Thanks,

Bob

Pak · April 25, 2019, 9:31pm

I’m working on it, my current notebook is pretty messy so I will post it as I will make a clean one. I hope it will be in a couple of days.
But if you show you one or tell me the details maybe I could help you before that.

Pak · April 26, 2019, 6:28pm

After your post I’ve decided to finish my notebook for Rosmann data
So here it is

There I have additionally implemented Feature Importance (2 variants), Partial Dependence and Features closeness

kachun1017 · May 7, 2019, 5:07am

Hi thanks a lot, but I think it’s broke now.
because when I run data.get_processors, it said AttributeError: get_processors

Pak · May 8, 2019, 10:51am

That’s mostly probably because you’ve used data object, that is actually doesn’t contain get_processors, instead of a LabelLists object.
That’s exactly why one should split data creation process into 2 phases.
As it said in notebook as well as in doc-string:

data_prep is a LabelLists object which you can get if you split standard databunch creation 
process into two phases. First you apply all the functions  before .databunch e.g.: 
data_prep = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
               .split_by_idx(valid_idx)
               .label_from_df(cols=dep_var, label_cls=FloatList, log=True))
data_prep is now a valid LabelLists object
Then you can apply .databunch as well to get DataBunch object (that is needed for learning process itself)
e.g.
data = data_prep.databunch(bs=BS)

So you should use something like
proc = data_prep.get_processors()[0][0]

kachun1017 · May 13, 2019, 9:45am

thanks a lot! Pak. I will try it later.

Dayane · July 18, 2019, 1:19am

Just to thank Pavel here for this thread. It really is a gem.

Pak · July 18, 2019, 11:36am

Thank you.
By the way I have updated the notebook getting rid off all the data_pre hackery and made code to work again (as it stopped doing so after some changes in fastai)