Fastai v2 tabular

I’m trying to make a starter kernel for https://www.kaggle.com/c/liverpool-ion-switching where the only data field is one continuous field.

when making the kernel, which is here https://www.kaggle.com/matthewchung/fastai2?scriptVersionId=29967701 i get this error

/opt/conda/lib/python3.6/site-packages/fastai2/tabular/model.py in <listcomp>(.0)
     23 def get_emb_sz(to, sz_dict=None):
     24     "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
---> 25     return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
     26 
     27 # Cell

/opt/conda/lib/python3.6/site-packages/fastcore/foundation.py in __getattr__(self, k)
    221             attr = getattr(self,self._default,None)
    222             if attr is not None: return getattr(attr, k)
--> 223         raise AttributeError(k)
    224     def __dir__(self): return custom_dir(self, self._dir() if self._xtra is None else self._dir())
    225 #     def __getstate__(self): return self.__dict__

AttributeError: classes

which is because it’s expecting classes from categorical data. I tried passing in an emb_szs on None as well, which gives a different error. Suggestions?

2 Likes

Most likely FillMissing is making categorical binary columns (it will do this for continuous variables). Include Categorify and you should see this behavior, explore to.cat_names to see this (I brought this up in the WWF2 lecture too IIRC :wink: )

3 Likes

ah. thanks!

I posted about this here but I thought I’d do a cross-post as it’s relevant:

Tabular Memory in `fastai2` - Reducing Memory Size

Just submitted (and got merged) a PR which reduces the memory overhead astronomically :slight_smile: We now have a few methods for dealing with memory usage.

  1. Set inplace to True. This will make fastai2 not create a copy of your dataframe into memory.
  2. Set reduce_memory to True (which is enabled by default). This will pre-process the dataframe chosen (inplace or not) and set the categorical variables to pd.categorical and continuous variables to float32's.

Both of these options can be set to False if needed, but for an example of it’s memory reduction, here is Rossmann:
Before inplace: 3.5gb total
After inplace: 2.6gb
After reduce_memory and inplace: 2.15gb

As you can see we reduced it by almost 40% :slight_smile: (25% just with inplace)

Note: It's not quite linear so I think we're still missing something but I'm not 100% sure. For example (measured in gb from starting dataframe size):
  • 7.6gb to 11.4gb (3.8gb added, 49%)
  • 4.3gb to 9.3gb (5gb added, 116%)
  • 2.4gb to 4.5gb (2.1gb added, 85%)
  • 1.5gb to 2.4gb (.9gb added, 56%)

I have no idea how to explain this behavior but it seems that fastai2 can handle smaller and larger dataframes better than medium sized ones

2 Likes

I am not sure if this useful for your optimization, but I saved away this script to reduce the sizes of pandas data frames (from https://www.kaggle.com/davidsalazarv95/home-credit-data-processing-for-neural-networks/notebook but link seems to be dead):

def reduce_mem_usage(df):
    '''iterate through all the columns of a df and modify
    the data type to reduce memory usage.'''
    
    start_mem = df.memory_usage().sum() / 1024**2 # 1024 * 1024 for byte to mega byte
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem)/start_mem))
    
    return df
2 Likes

Sadly it is not :confused: reducing it to the float32 seems to be the maximum we can do, I looked into exactly that :wink: Works in v1, not in v2 (so something in the background mabye, spent the last day and a half on this :slight_smile: )

Ah, ok, if it is on the GPU then we need FP32 (or FP16 for mixed precision).

So, yeah, the script from above then only is helpful for reducing the needed CPU RAM.

Thanks @MicPie for making me realize I was actually doing something wrong! (though not explicitly)

I forgot to essentially save away a large dataframe instead of doing df = df.append(df) (fun fact your memory usage goes up astronomically by doing so)

Here’s my new numbers (with still good reduction):
3 million rows went from 6.5gb in memory to 8.1 gb vs 10.2 - 2gb saved
2 million: 4.5 to 5.7, saving 1gb

Much more rational numbers :slight_smile:

1 Like

Question: is it possible to adjust the values inside of a TabularPandas object? I’ve tried doing something like so:

dl.items[col] = x
dl.process
but when I do dl.items.head() more than just that one column was adjusted
(I included the process because otherwise dl.items.head never changed)

Edit: when I do dl.items.iloc[0] it does show this change so I’m unsure why head didn’t catch it

For those wondering why the heck that matters, meet how simple permutation importance is (without any copies data frames :slight_smile: ):

def measure_col(self, name:str):
    "Measures change after column shuffle"
    col = [name]
    if f'{name}_na' in self.na: col.append(name)
    orig = self.dl.items[col].values
    perm = np.random.permutation(len(orig))
    self.dl.items[col] = self.dl.items[col].values[perm]
    metric = learn.validate(dl=self.dl)[1]
    self.dl.items[col] = orig
    return metric

from a quick glance at the pandas source, it is odd that head() would return something diff from iloc, since head seems to use iloc https://github.com/pandas-dev/pandas/blob/fd2e002e87eaabff3bd8d05bfaa037df468cd752/pandas/core/generic.py#L4685

It makes sense but I can’t quite explain it’s behavior except perhaps two different memory locations are changed Maybe

(For the record the above code does work, it’s not an issue moreso a strange set of interactions :slight_smile: )

you are probably right. the only other thing i could think of is maybe it’s a threading thing? maybe you are in a diff thread when you call

dl.items[col] = x

opposed to

dl.items.head() 

Hi,

  1. while trying to do regression on my own data with tabular_learner I am getting AssertionError: Could not infer loss function from the data, please pass a loss function (see detailed error output below)

  2. Following to the previous broblem, when I specify loss_func=mse I get extreme train_loss and valid_loss, while fastai v1 works fine with the same data.

  3. TabularPandas object takes ages to create, while fastai v1 TabularList is quick.

I would appreciate your help. Thank you!

from fastai2.tabular.all import *

path = Path('tutorial_learn_path')

cont,cat = cont_cat_split(df, max_card=700, dep_var='price')
valid_inxs = set_.sample(int(len(df)/5)).index
splits = IndexSplitter(list(valid_inxs))(range_of(df))
procs = [Categorify, FillMissing, Normalize]

to = TabularPandas(df, procs, cat, cont, y_names=name, splits=splits)
dls = to.dataloaders(bs=64)
learn = tabular_learner(dls, metrics=rmse)
learn.fit_one_cycle(10, 1e-3)

error message:

AssertionError Traceback (most recent call last)
in
----> 1 learn = tabular_learner(dls, metrics=rmse)

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/tabular/learner.py in tabular_learner(dls, layers, emb_szs, config, n_out, y_range, ps, embed_p, use_bn, bn_final, bn_cont, **kwargs)
35 model = TabularModel(emb_szs, len(dls.cont_names), n_out, layers, ps=ps, embed_p=embed_p,
36 y_range=y_range, use_bn=use_bn, bn_final=bn_final, bn_cont=bn_cont, **config)
—> 37 return TabularLearner(dls, model, **kwargs)
38
39 # Cell

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py in init(self, dls, model, loss_func, opt_func, lr, splitter, cbs, metrics, path, model_dir, wd, wd_bn_bias, train_bn, moms)
78 if loss_func is None:
79 loss_func = getattr(dls.train_ds, ‘loss_func’, None)
—> 80 assert loss_func is not None, “Could not infer loss function from the data, please pass a loss function.”
81 self.loss_func = loss_func
82 self.path = path if path is not None else getattr(dls, ‘path’, Path(’.’))

AssertionError: Could not infer loss function from the data, please pass a loss function.

1 Like

As you’re doing regression, you should us a RegressionBlock like so:

block_y = RegressionBlock() (in your TabularPandas object)

I have tried the following, but got the same ‘Could not infer loss function’ error statement:

to = TabularPandas(set_, procs, cat, cont, y_names=name, splits=splits, block_y = RegressionBlock())

TabularPandas is different from TabularList as a whole. We’ve done what we can to reduce the memory overhead and time, but this is not going to change much. You can set reduce_memory to False and inplace to True and it should speed things up a bit.

Also, what does dls.show_batch() show?

dls.show_batch() shows 10 rows of data, as expected

For the temporary, here is what I would recommend:

Follow the rossmann example and generate a y_range to help your model narrow down the outputs, this should help with the losses, and declare your loss function as well (not just the metric). I’ll look more into this issue on my side and see if I can recreate your bug :slight_smile:

Oh, I forgot to normalize my dependant variable, thus inadequate losses. Not it is reasonable. Thank you!

Actually we shouldn’t do this BTW :slight_smile: If your y’s are very large, you can get the log of them though!

(Or at the very least it’s not normally done)