Fastai v2 tabular

foobar8675 · March 10, 2020, 8:26pm

I’m trying to make a starter kernel for https://www.kaggle.com/c/liverpool-ion-switching where the only data field is one continuous field.

when making the kernel, which is here https://www.kaggle.com/matthewchung/fastai2?scriptVersionId=29967701 i get this error

/opt/conda/lib/python3.6/site-packages/fastai2/tabular/model.py in <listcomp>(.0)
     23 def get_emb_sz(to, sz_dict=None):
     24     "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
---> 25     return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
     26 
     27 # Cell

/opt/conda/lib/python3.6/site-packages/fastcore/foundation.py in __getattr__(self, k)
    221             attr = getattr(self,self._default,None)
    222             if attr is not None: return getattr(attr, k)
--> 223         raise AttributeError(k)
    224     def __dir__(self): return custom_dir(self, self._dir() if self._xtra is None else self._dir())
    225 #     def __getstate__(self): return self.__dict__

AttributeError: classes

which is because it’s expecting classes from categorical data. I tried passing in an emb_szs on None as well, which gives a different error. Suggestions?

muellerzr · March 10, 2020, 8:28pm

Most likely FillMissing is making categorical binary columns (it will do this for continuous variables). Include Categorify and you should see this behavior, explore to.cat_names to see this (I brought this up in the WWF2 lecture too IIRC )

foobar8675 · March 10, 2020, 10:09pm

ah. thanks!

muellerzr · March 15, 2020, 3:18pm

I posted about this here but I thought I’d do a cross-post as it’s relevant:

Tabular Memory in `fastai2` - Reducing Memory Size

Just submitted (and got merged) a PR which reduces the memory overhead astronomically We now have a few methods for dealing with memory usage.

Set inplace to True. This will make fastai2 not create a copy of your dataframe into memory.
Set reduce_memory to True (which is enabled by default). This will pre-process the dataframe chosen (inplace or not) and set the categorical variables to pd.categorical and continuous variables to float32's.

Both of these options can be set to False if needed, but for an example of it’s memory reduction, here is Rossmann:
Before inplace: 3.5gb total
After inplace: 2.6gb
After reduce_memory and inplace: 2.15gb

As you can see we reduced it by almost 40% (25% just with inplace)

Note: It's not quite linear so I think we're still missing something but I'm not 100% sure. For example (measured in gb from starting dataframe size):

7.6gb to 11.4gb (3.8gb added, 49%)

4.3gb to 9.3gb (5gb added, 116%)

2.4gb to 4.5gb (2.1gb added, 85%)

1.5gb to 2.4gb (.9gb added, 56%)

I have no idea how to explain this behavior but it seems that fastai2 can handle smaller and larger dataframes better than medium sized ones

MicPie · March 15, 2020, 6:48pm

I am not sure if this useful for your optimization, but I saved away this script to reduce the sizes of pandas data frames (from https://www.kaggle.com/davidsalazarv95/home-credit-data-processing-for-neural-networks/notebook but link seems to be dead):

def reduce_mem_usage(df):
    '''iterate through all the columns of a df and modify
    the data type to reduce memory usage.'''
    
    start_mem = df.memory_usage().sum() / 1024**2 # 1024 * 1024 for byte to mega byte
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem)/start_mem))
    
    return df

muellerzr · March 15, 2020, 6:49pm

Sadly it is not reducing it to the float32 seems to be the maximum we can do, I looked into exactly that Works in v1, not in v2 (so something in the background mabye, spent the last day and a half on this )

MicPie · March 15, 2020, 6:52pm

Ah, ok, if it is on the GPU then we need FP32 (or FP16 for mixed precision).

So, yeah, the script from above then only is helpful for reducing the needed CPU RAM.

muellerzr · March 15, 2020, 7:31pm

Thanks @MicPie for making me realize I was actually doing something wrong! (though not explicitly)

I forgot to essentially save away a large dataframe instead of doing df = df.append(df) (fun fact your memory usage goes up astronomically by doing so)

Here’s my new numbers (with still good reduction):
3 million rows went from 6.5gb in memory to 8.1 gb vs 10.2 - 2gb saved
2 million: 4.5 to 5.7, saving 1gb

Much more rational numbers

muellerzr · March 15, 2020, 9:19pm

Question: is it possible to adjust the values inside of a TabularPandas object? I’ve tried doing something like so:

dl.items[col] = x
dl.process
but when I do dl.items.head() more than just that one column was adjusted
(I included the process because otherwise dl.items.head never changed)

Edit: when I do dl.items.iloc[0] it does show this change so I’m unsure why head didn’t catch it

For those wondering why the heck that matters, meet how simple permutation importance is (without any copies data frames ):

def measure_col(self, name:str):
    "Measures change after column shuffle"
    col = [name]
    if f'{name}_na' in self.na: col.append(name)
    orig = self.dl.items[col].values
    perm = np.random.permutation(len(orig))
    self.dl.items[col] = self.dl.items[col].values[perm]
    metric = learn.validate(dl=self.dl)[1]
    self.dl.items[col] = orig
    return metric

foobar8675 · March 16, 2020, 12:04am

from a quick glance at the pandas source, it is odd that head() would return something diff from iloc, since head seems to use iloc https://github.com/pandas-dev/pandas/blob/fd2e002e87eaabff3bd8d05bfaa037df468cd752/pandas/core/generic.py#L4685

muellerzr · March 16, 2020, 12:07am

It makes sense but I can’t quite explain it’s behavior except perhaps two different memory locations are changed Maybe

(For the record the above code does work, it’s not an issue moreso a strange set of interactions )

foobar8675 · March 16, 2020, 6:32am

you are probably right. the only other thing i could think of is maybe it’s a threading thing? maybe you are in a diff thread when you call

dl.items[col] = x

opposed to

dl.items.head()

DmitryG · March 19, 2020, 2:05pm

Hi,

while trying to do regression on my own data with tabular_learner I am getting AssertionError: Could not infer loss function from the data, please pass a loss function (see detailed error output below)
Following to the previous broblem, when I specify loss_func=mse I get extreme train_loss and valid_loss, while fastai v1 works fine with the same data.
TabularPandas object takes ages to create, while fastai v1 TabularList is quick.

I would appreciate your help. Thank you!

from fastai2.tabular.all import *

path = Path('tutorial_learn_path')

cont,cat = cont_cat_split(df, max_card=700, dep_var='price')
valid_inxs = set_.sample(int(len(df)/5)).index
splits = IndexSplitter(list(valid_inxs))(range_of(df))
procs = [Categorify, FillMissing, Normalize]

to = TabularPandas(df, procs, cat, cont, y_names=name, splits=splits)
dls = to.dataloaders(bs=64)
learn = tabular_learner(dls, metrics=rmse)
learn.fit_one_cycle(10, 1e-3)

error message:

AssertionError Traceback (most recent call last)
in
----> 1 learn = tabular_learner(dls, metrics=rmse)

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/tabular/learner.py in tabular_learner(dls, layers, emb_szs, config, n_out, y_range, ps, embed_p, use_bn, bn_final, bn_cont, **kwargs)
35 model = TabularModel(emb_szs, len(dls.cont_names), n_out, layers, ps=ps, embed_p=embed_p,
36 y_range=y_range, use_bn=use_bn, bn_final=bn_final, bn_cont=bn_cont, **config)
—> 37 return TabularLearner(dls, model, **kwargs)
38
39 # Cell

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py in init(self, dls, model, loss_func, opt_func, lr, splitter, cbs, metrics, path, model_dir, wd, wd_bn_bias, train_bn, moms)
78 if loss_func is None:
79 loss_func = getattr(dls.train_ds, ‘loss_func’, None)
—> 80 assert loss_func is not None, “Could not infer loss function from the data, please pass a loss function.”
81 self.loss_func = loss_func
82 self.path = path if path is not None else getattr(dls, ‘path’, Path(’.’))

AssertionError: Could not infer loss function from the data, please pass a loss function.

muellerzr · March 19, 2020, 2:07pm

As you’re doing regression, you should us a RegressionBlock like so:

block_y = RegressionBlock() (in your TabularPandas object)

DmitryG · March 19, 2020, 2:12pm

I have tried the following, but got the same ‘Could not infer loss function’ error statement:

to = TabularPandas(set_, procs, cat, cont, y_names=name, splits=splits, block_y = RegressionBlock())

muellerzr · March 19, 2020, 2:17pm

TabularPandas is different from TabularList as a whole. We’ve done what we can to reduce the memory overhead and time, but this is not going to change much. You can set reduce_memory to False and inplace to True and it should speed things up a bit.

Also, what does dls.show_batch() show?

DmitryG · March 19, 2020, 2:20pm

dls.show_batch() shows 10 rows of data, as expected

muellerzr · March 19, 2020, 2:22pm

For the temporary, here is what I would recommend:

github.com

fastai/fastai2/blob/master/nbs/course/lesson6-rossmann.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai2.tabular.all import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Rossmann"
   ]
  },
  {
   "cell_type": "markdown",

This file has been truncated. show original

Follow the rossmann example and generate a y_range to help your model narrow down the outputs, this should help with the losses, and declare your loss function as well (not just the metric). I’ll look more into this issue on my side and see if I can recreate your bug

DmitryG · March 19, 2020, 2:30pm

Oh, I forgot to normalize my dependant variable, thus inadequate losses. Not it is reasonable. Thank you!

muellerzr · March 19, 2020, 2:31pm

Actually we shouldn’t do this BTW If your y’s are very large, you can get the log of them though!

(Or at the very least it’s not normally done)