Fastai v2 code walk-thru 8

Notes thanks to @pnvijay

We will be looking at tabular to start with. I will import all the modules required for running the notebook while I attempt to recreate what happens in the

from local.torch_basics import *
from local.test import *
from local.core import *
from local.data.all import *
from local.notebook.showdoc import show_doc
from local.tabular.core import *

We start with 40_tabular_core.ipynb notebook. Tabular is a cool and fun note book says Jeremy. We look at the ADULTS dataset. There are 32561 rows and 15 columns.

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
len(df),len(df.columns)
(32561, 15)
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

To create a model from this dataset, we need to take the categorical variables and convert them into ints. We also check for missing values and then fill the value with normally the median. We also add a column which holds a binary value, whether the column value was missing or not. Therefore we need to find out what are the categorical variables and which ones are the continuous variables. We can then apply the appropriate transforms there. We need to understand how to split our validation and training set. We also need to know our dependent or target variable.

## The categorical and continous variables are listed here
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
## The transforms that are necessary are listed here
procs = [Categorify, FillMissing, Normalize]
## This is how to split the dataframe into train and valid
splits = RandomSplitter()(range_of(df_main))
to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary")
%time dsrc = to.datasource(splits=splits)
CPU times: user 211 ms, sys: 11.9 ms, total: 222 ms
Wall time: 229 ms

Please refer to the code in the above cell

We have defined a class called Tabular that takes in the data frame, processes needed to convert integer to strings, fill missing values, normalize etc. It has the continuous,categorical and dependent variables. It creates a tabular object. From this tabular object, you will get a datasource if you pass in the splits method. Now that we have a DataSource we can create a DataLoader and then show batch from the same.

dl = TabDataLoader(dsrc.valid, bs=16)
dl.show_batch()
age fnlwgt education-num workclass education marital-status occupation relationship race age_na fnlwgt_na education-num_na salary
0 26.000000 247024.998680 10.0 Private Some-college Never-married #na# Not-in-family White False False True <50k
1 26.000000 39212.001122 9.0 Private HS-grad Married-civ-spouse Machine-op-inspct Husband White False False False <50k
2 65.999999 66007.999407 9.0 Private HS-grad Widowed Priv-house-serv Not-in-family White False False False <50k
3 50.000000 305147.004464 10.0 Private Bachelors Married-civ-spouse Craft-repair Husband White False False True <50k
4 33.000000 91811.002559 9.0 Private HS-grad Separated Transport-moving Not-in-family White False False False <50k
5 24.999999 57511.997294 10.0 Private Some-college Never-married Sales Not-in-family White False False False <50k
6 38.000000 205359.000280 7.0 Private 11th Married-civ-spouse Adm-clerical Wife White False False False <50k
7 38.000000 80770.997691 13.0 Private Bachelors Married-civ-spouse Prof-specialty Wife White False False False >=50k
8 17.000000 132635.997488 10.0 Private 11th Never-married #na# Own-child White False False True <50k
9 47.000000 217161.000352 9.0 Private HS-grad Divorced Other-service Not-in-family Black False False False <50k

Now you can talk in a test set and then say that the test set has the same categorical, continuous and dependent variables. Therefore the same preprocessing that was done for the training set can be done here. to.new creates a new Tabular object. The processing (same pre-processing as for the training set) is invoked via the .process() method or call. As you can see here, we are using a sub class of Tabular called TabularPandas. It is not sure if it will remain same. We are also working on another subclass called TabularRapids which is a result of RAPIDS from nvidia that offers GPU acclerated dataframes.

to_tst = to.new(df_test)
to_tst.process()
to_tst.all_cols.head()
age fnlwgt education-num workclass education marital-status occupation relationship race age_na fnlwgt_na education-num_na salary
10000 0.457314 1.335777 1.157855 5 10 3 2 1 2 1 1 1 1
10001 -0.927409 1.248882 -0.427637 5 12 3 15 1 4 1 1 1 1
10002 1.040355 0.147819 -1.220383 5 2 1 9 2 5 1 1 1 1
10003 0.530194 -0.284650 -0.427637 5 12 7 2 5 5 1 1 1 1
10004 0.748835 1.438161 0.365109 6 9 3 5 1 5 1 1 1 2

There was a question on how to speed up inference on tabular learner predictions. Jeremy mentions that this will be done via RAPIDS. Jeremy mentions a medium article written by Even Oldridge who we know from forums as @Even. Even is now with NVidia and the article mentions as to how using RAPIDS, PyTorch and fastai, he was placed 15th in a competition. The article explains how a deep learning based recommender system was accelerated by over 15x using the combination of RAPIDS,PyTorch and fast.ai. Jeremy mentions that they are working with Even on using RAPIDS with fastai.

Let’s look at the Tabular class. The idea behind the class was to build one that has all the information and methods required. The dataframe needs information on categorical and continuous variables, what preprocessing is to be done and what is the dependent variable. The Tabular class, if you see, starts with that information in the __init__ method. It also needs to know if the dependent or target variable is categorical or not. There could be more than dependent variable here. There was a question on scenarios wherein we can have more than one dependent variable. Jeremy mentions that it could be in the case of the destination of a taxi ride wherein we need to know the x & y co-ordinates. It could also be a case wherein we have multi label classification.

The code of Tabular with just __init__ method is provided below to follow the notes.

class Tabular(CollBase, GetAttr):
    "A `DataFrame` wrapper that knows which cols are cont/cat/y, and returns rows in `__getitem__`"
    def __init__(self, df, procs=None, cat_names=None, cont_names=None, y_names=None, is_y_cat=True):
        super().__init__(df)
        store_attr(self, 'y_names,is_y_cat')
        self.cat_names,self.cont_names,self.procs = L(cat_names),L(cont_names),Pipeline(procs, as_item=True)
        self.cat_y  = None if not is_y_cat else y_names
        self.cont_y = None if     is_y_cat else y_names

The preprocessing functions mentioned as procs in the code is actually a list of transforms. Therefore we can create a Pipeline with them. The good part is that we are using all the foundations that we learnt before in the walkthroughs. These foundations are used throughout the fastai v2 which is a good sign. Unlike TfmdDS, TfmdDL and TfmdList we don’t do transforms lazily in tabular. There are three reasons for that.

  1. Unlike opening an image, it doesn’t take a long time to grab a row of tabular data. So it is fine to read the whole lot of rows unless it is a big dataset.
  2. Most tabular stuff is designed to work on lots of rows quickly at a time.
  3. Most pre-processing here in tabular is not data augmentation, but more like cleaning of labels and things like that.

So all pre-processing is done ahead of time here in tabular data rather than lazily. But it is still a Pipeline of transforms. We store cat_y and cont_y depending upon the whether the dependent variable is categorical or continuous. We (Tabular) are inheriting from CollBase which basically defines the basic things required in a collection and implements them by compositions. CollBase is defined in 01_core.ipynb notebook.

class CollBase:
    "Base class for composing a list of `items`"
    def __init__(self, items): self.items = items
    def __len__(self): return len(self.items)
    def __getitem__(self, k): return self.items[k]
    def __setitem__(self, k, v): self.items[list(k) if isinstance(k,CollBase) else k] = v
    def __delitem__(self, i): del(self.items[i])
    def __repr__(self): return self.items.__repr__()
    def __iter__(self): return self.items.__iter__()

In the __init__ of Tabular we do super().__init__(df) to inherit from CollBase so that we can attributes like items, length, setitem, getitem, delitem etc. This is is what is seen in the test. The fact that we can .items on t and to is because of this inheritance.

df = pd.DataFrame({'a':[0,1,2,0,2], 'b':[0,0,0,0,1]})
to = TabularPandas(df, cat_names='a')
t = pickle.loads(pickle.dumps(to))
test_eq(t.items,to.items)
test_eq(to.all_cols,to[['a']])
to.show() # only shows 'a' since that's the only col in `TabularPandas`

We have other useful attributes like all categorical names, all continuous names and all column names. In the all continuous names you can see that it is an addition of self.cont_names and self.cont_y. This will not work but for use of L class in fastai v2. If you add None to L, it doesn’t change L which is what we want most of the time. So things in L are more convenient than list class in Python.

## example of different behaviours
## Let's start with normal list in python
[1,2]+None
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-9-341aae41e0f9> in <module>
      1 ## example of different behaviours
      2 ## Let's start with normal list in python
----> 3 [1,2]+None


TypeError: can only concatenate list (not "NoneType") to list
## Now L in fastai v2
L([1,2])+None
## You can see below that in L, it prints the numbers of items (#2) and then the items [1,2]
(#2) [1,2]

You will find that we dont have all_cols defined in Tabular but is still called in the test

test_eq(to.all_cols,to[['a']])

This is because of the function called _add_prop that adds the necessary attributes and names for us.

def _add_prop(cls, nm):
    prop = property(lambda o: o[list(getattr(o,nm+'_names'))])
    setattr(cls, nm+'s', prop)
    def _f(o,v): o[getattr(o,nm+'_names')] = v
    setattr(cls, nm+'s', prop.setter(_f))

We then look at TabularProc which is a subclass on InplaceTransform. InplaceTransform is something that returns itself. We call and we return the original thing.

class InplaceTransform(Transform):
    "A `Transform` that modifies in-place and just returns whatever it's passed"
    def _call(self, fn, x, filt=None, **kwargs):
        super()._call(fn,x,filt,**kwargs)
        return x

Unlike other transform which returns different to what we passed, processes in tabular do things to the stored data which is why we return what we started with. So TabularProc returns itself when you call it. When you do setup it does setup but returns self or itself. Let’s see that with an example using Categorify which is a subclass of TabularProc. In setups of Categorify it creates a dictionary or a vocab from the categorical column names to their items or values contained in their column names. But the encodes in Categorify we change the categorical columns to integers using the vocab we created in setups. We want them to remain two separate things because at inference we dont want to run setups, we want to run encodes. At training time we want to do both. That’s why in TabularProc we override setup to return the encodes via self. It is a transform when you set it up calls the encodes straight away. So Categorify is similar to the Categorize transform for dependent variables in Image Processing but somewhat different as it is a tabular process.

class Categorify(TabularProc):
    "Transform the categorical variables to that type."
    order = 1
    def setups(self, dsrc):
        self.classes = {n:CategoryMap(getattr(dsrc,'train',dsrc).iloc[:,n].items, add_na=True) for n in dsrc.all_cat_names}

    def _apply_cats (self, c): return c.cat.codes+1 if is_categorical_dtype(c) else c.map(self[c.name].o2i)
    def _decode_cats(self, c): return c.map(dict(enumerate(self[c.name].items)))
    def encodes(self, to): to.transform(to.all_cat_names, self._apply_cats)
    def decodes(self, to): to.transform(to.all_cat_names, self._decode_cats)
    def __getitem__(self,k): return self.classes[k]

There were a couple of questions in the meanwhile. One was if patch_property method defined in fastai v2 calls _add_prop since the person asking the question couldn’t add a setter using patch_property. Jeremy explains that both patch_to and patch_property don’t add any setters. Another question was if we are doing object detection in fastai v2. Jeremy mentions that we touch upon it briefly. We then go to see tests on Categorify.

df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.setup()

When we call to.setup we call the setups method in Categorify which in turn calls the encodes method which calls the _apply_cats method. The _apply_cats method maps the column values with the vocab that was created earlier. map is a good function in python. Here we are using a dictionary to map, though we can use a function also. Functions are slower though when used in map. Let’s look at the vocab created. When the vocab is created it always starts with #na to account for any items not seen that could be presented later.

cat = to.procs.categorify
test_eq(cat['a'], ['#na#',0,1,2])
test_eq(to.a, [1,2,3,1,3])

So in this dataframe df = pd.DataFrame({'a':[0,1,2,0,2]}) we create a vocab which is ['#na#',0,1,2]. Then 0 in dataframe maps to 1 (index value) in vocab, 1 maps to 2 and 2 maps to 3 as per the vocab. One of the things that Jeremy has recently added in Pipeline is getattr. So if the pipeline comes across an attribute that is not defined within it, it tries to find it in any of the transforms inside the pipeline which is what we want. So for the below

cat = to.procs.categorify

we know that to.procs is a pipeline. This can be seen in the below code.

df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.setup()
type(to.procs)
local.data.pipeline.Pipeline

So if it comes across to.procs.categorify it looks for type categorify in its transforms. Here we note that there is a conversion to snake case Categorify becomes categorify. We use getattr to do this and this is consistent in v1, the recently concluded part 2 course and now in v2. In the below code

to = TabularPandas(df, Categorify, 'a')

we have added the type Categorify as a procs but have not instantiated it. We do that when we call cat = to.procs.categorify. Now that we have instantiated it and ask for cat['a'] it looks for __getitem__ in Categorify which returns self.classes[k] which is a vocab for that column. We can see the vocab, find the type using type() and also the reverse mapping using o2i.

cat = to.procs.categorify
cat['a']
(#4) [#na#,0,1,2]
type(cat['a'])
local.data.core.CategoryMap
cat['a'].o2i
defaultdict(int, {'#na#': 0, 0: 1, 1: 2, 2: 3})

There was a question on whether this would take care of the mapping for test set too and what happens if it comes across something new. The answer is yes. It would take care of the mapping for test set too and it will use ‘#na#’ for some thing new that it has not encountered before. Now we come to inference time. We want to use the same meta data but pass new values. So we use the existing tabular object to and create a new tabular object t1 but use the same metadata (vocab, categorical/continuous column names, dependent variable etc) from to.

df1 = pd.DataFrame({'a':[1,0,3,-1,2]})
to1 = to.new(df1)

Then we call to1.process(). The code for process in Tabular is

def process(self): self.procs(self)

This means that it is calling all the (preprocessing) transforms defined as procs. In the data passed for inference df1 = pd.DataFrame({'a':[1,0,3,-1,2]}), we can see that the values 3 and -1 are new. This was not there in the vocab of the training set. The vocab as we know is (#4) [#na#,0,1,2]. So when we call to1.a it returns [2,1,0,0,3] where in 3 and -1 are now mapped to 0 (which is ‘#na#’).

to1.process()
to1['a']
0    2
1    1
2    0
3    0
4    3
Name: a, dtype: int64
list(to1.a)
[2, 1, 0, 0, 3]

So when call decode on the same, we get back a slightly changed data where 3 and -1 are now represented as ‘#na#’. So decodes could give a different value from what you started with depending upon the transforms. In cases like Normalize transform it would give back the same data but in Categorify the values could change.

to2=cat.decode(to1)
list(to2['a'])
[1, 0, '#na#', '#na#', 2]

Another way to use the Categorify transform is to instantiate it and then use it. Here we instantiate it by cat=Categorify() and then calling it in the Tabular object call. We can also convert it to a datasource using datasource defined in Tabular. We only need to specify the splits method. Since we split the dataframe into two using the indexes, you will find that the values in the validation set are not reflected in the vocab as they are different from the ones in the training set.

Training - [0,1,2]

Validation - [3,2]

vocab - [’#na#’,0,1,2]

to[‘a’] = [1,2,3,0,3] with value of 3 being substituted by index value of ‘#na#’ which is 0.

cat = Categorify()
df = pd.DataFrame({'a':[0,1,2,3,2]})
to = TabularPandas(df, cat, 'a')
dsrc = to.datasource([[0,1,2],[3,4]])
test_eq(cat['a'], ['#na#',0,1,2])
test_eq(to['a'], [1,2,3,0,3])

Here we did not call setup because in the datasource it calls setup. So if we look at the datasource code

def datasource(self, splits=None):
        if splits is None: splits=[range_of(self)]
        self.items = self.items.iloc[sum(splits, [])].copy()
        res = DataSource(self, filts=[range(len(splits[0])), range(len(splits[0]), len(self))], tfms=[None])
        self.procs.setup(res)
        return res

We have all the values we need like the dataframe, the train, test values etc. So we call setup either during the tabular object or during the datasource. We do this to prevent multiple copies and other inefficiencies to come into the system. Jeremy explains that the code for Tabular is just nice with only the __init__ and datasource parts having longer lines of code. The datasource part has more lines, since in RAPIDS on the GPU it takes a long time to index into dataframes with arbitrary indexes. So you have to pass a list of contiguous indexes to make RAPIDS fast. So when you pass splits, it is concatenated into a single list and then we index into the dataframe with that list. Then the datasource is created with a contiguous list passed in as a range. That’s why the code is there for more than one line.

In python we cannot use @property decorator in the same line as the code to which we want to add the decorator. For example

@property def process(self): self.procs(self)

The above code is not valid. So in fastai v2 there is an alternative created called properties which achieves the objective. The code has been made to make Tabular look a lot like dataframe. That’s why we inherit from GetAttr. This means that any unknown items will be passed on to default property which is self.items in the code.

def default(self): return self.items

So it behaves a lot like dataframe because in dataframe anything unknown it will pass to the dataframe. Also here we want to index using row numbers and column names together which is not possible in dataframe. So here there is a iloc method written to get that kind of indexing possible. The method that helps get this done is _TabIloc. The code for that is

class _TabIloc:
    "Get/set rows by iloc and cols by name"
    def __init__(self,to): self.to = to
    def __getitem__(self, idxs):
        df = self.to.items
        if isinstance(idxs,tuple):
            rows,cols = idxs
            cols = df.columns.isin(cols) if is_listy(cols) else df.columns.get_loc(cols)
        else: rows,cols = idxs,slice(None)
        return self.to.new(df.iloc[rows, cols])

It will also wrap the result back into a tabular object. Then we look at the encodes code of Categorify. Here we call transform on the variables. We already saw the _apply_cats method which provides the mapping. The mapping is not done if the columns are Pandas Categorical datatypes. Because in this case, Pandas would have done the mapping already for us. The way the _apply_cats method is applied on all_cat_names columns is via the transform method. This is defined in TabularPandas class.

class TabularPandas(Tabular):
    def transform(self, cols, f): self[cols] = self[cols].transform(f)

Here self[cols].transform(f) has a transform call which is actually the Pandas inherent transform call for series. Then we look at a test example where in Pandas Categorical columns are used. And here instead of the mapping we use the Pandas inherent category codes as is visible in the _apply_cats method.

def _apply_cats (self, c): return c.cat.codes+1 if is_categorical_dtype(c) else c.map(self[c.name].o2i)
df = pd.DataFrame({'a':pd.Categorical(['M','H','L','M'], categories=['H','M','L'], ordered=True)})
to = TabularPandas(df, Categorify, 'a')
cat = to.procs.categorify
to.setup()
test_eq(cat['a'], ['#na#','H','M','L'])
test_eq(to.a, [2,1,3,2])
to2 = cat.decode(to)
test_eq(to2.a, ['M','H','L','M'])

Now we look at Normalize. Here in encodes we subtract the values from the means and divide by the standard deviation. Here we have a line of code (partially shown here)

df = getattr(dsrc,'train',dsrc)

The above code is the same as

df = dsrc.train if hasattr(dsrc,'train') else dsrc

We have this code because we want to enable this so that if we have datasource that has ‘train’ and ‘valid’ splits, it will take from train else it will take the complete datasource. Also the complete code is

df = getattr(dsrc,'train',dsrc).conts

Which mean we take the continuous column values and then get their means and standard deviation. We now look at a test case wherein we have intialised Normalize and then created the tabular object. We then call setup on it. We create a similar array so that their mean and standard are the same as that of the dataframe used in tabular object.

norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a')
to.setup()
x = np.array([0,1,2,3,4])
m,s = x.mean(),x.std()

We then check if norm.means['a'] is the same as x.mean(). We are able to norm.means['a'] because of the code in setups

class Normalize(TabularProc):
    "Normalize the continuous variables."
    order = 2
    def setups(self, dsrc):
        df = getattr(dsrc,'train',dsrc).conts
        self.means,self.stds = df.mean(),df.std(ddof=0)+1e-7

    def encodes(self, to): to.conts = (to.conts-self.means) / self.stds
    def decodes(self, to): to.conts = (to.conts*self.stds ) + self.means

Wherein we set self.means to be equal to df.mean(). Pandas dataframe return a series when we call mean which can be indexed into using columns. That is why we are able to use norm.means['a'].

We see the tests for Normalize using tabular object setup and inference using the metadata from the train data. We also see an example where datasource is created and setup is used there.

norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a')
to.setup()
x = np.array([0,1,2,3,4])
m,s = x.mean(),x.std()
df1 = pd.DataFrame({'a':[5,6,7]})
to1 = to.new(df1)
to1.process()
test_close(to1['a'].values, (np.array([5,6,7])-m)/s)
to2 = norm.decode(to1)
test_close(to2.a.values, [5,6,7])
norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a')
dsrc = to.datasource([[0,1,2],[3,4]])
x = np.array([0,1,2])
m,s = x.mean(),x.std()
test_eq(norm.means['a'], m)
test_close(norm.stds['a'], s)
test_close(to['a'].values, (np.array([0,1,2,3,4])-m)/s)

We now look at FillMissing and FillStrategy. FillMissing goes through each of the continuous columns and notes down column values that have null in them. It then creates a dictionary where these column values are mapped against the FillStrategy chosen. We have three options to chose in FillStrategy which is median, constant and mode. By default the median option is used in the FillStrategy. In the encodes of FillMissing we fill the missing values with the result of our FillMissing operation.

class FillMissing(TabularProc):
    "Fill the missing values in continuous columns."
    def __init__(self, fill_strategy=FillStrategy.median, add_col=True, fill_vals=None):
        if fill_vals is None: fill_vals = defaultdict(int)
        store_attr(self, 'fill_strategy,add_col,fill_vals')

    def setups(self, dsrc):
        df = getattr(dsrc,'train',dsrc).conts
        self.na_dict = {n:self.fill_strategy(df[n], self.fill_vals[n])
                        for n in pd.isnull(df).any().keys()}

    def encodes(self, to):
        missing = pd.isnull(to.conts)
        for n in missing.any().keys():
            assert n in self.na_dict, f"nan values in `{n}` but not in setup training set"
            to[n].fillna(self.na_dict[n], inplace=True)
            if self.add_col:
                to.loc[:,n+'_na'] = missing[n]
                if n+'_na' not in to.cat_names: to.cat_names.append(n+'_na')
                    
class FillStrategy:
    "Namespace containing the various filling strategies."
    def median  (c,fill): return c.median()
    def constant(c,fill): return fill
    def mode    (c,fill): return c.dropna().value_counts().idxmax()

We can also create mutliple tabular objects using different FillStrategy options as is clear in this example. Here we create three fill methods using each of three options in FillStrategy. Then we can create different tabular objects from each of fill methods.

fill1,fill2,fill3 = (FillMissing(fill_strategy=s) 
                     for s in [FillStrategy.median, FillStrategy.constant, FillStrategy.mode])
df = pd.DataFrame({'a':[0,1,np.nan,1,2,3,4]})
df1 = df.copy(); df2 = df.copy()
tos = TabularPandas(df, fill1, cont_names='a'),TabularPandas(df1, fill2, cont_names='a'),TabularPandas(df2, fill3, cont_names='a')
for t in tos: t.setup()
test_eq(fill1.na_dict, {'a': 1.5})
test_eq(fill2.na_dict, {'a': 0})
test_eq(fill3.na_dict, {'a': 1.0})

There was a question on whether setups should be called by constructor. The answer is no as we call setup via the TypeDispatch method onlf after we have enough information about we want to setup with. This is available usually when we create the Datasource and we know the training and validation set. We will also know what to do with those sets. That’s why it is there in the datasource. But it is also available to use before the datasource as well.

We see examples wherein we have included all procs like Normalize, Categorify, FillMissing and noop. They can be used in a Pipeline and they will be correctly used on the relevant columns only. Like Normalize will be used only on continuous columns.

procs = [Normalize, Categorify, FillMissing, noop]
df = pd.DataFrame({'a':[0,1,2,1,1,2,0], 'b':[0,1,np.nan,1,2,3,4]})
to = TabularPandas(df, procs, cat_names='a', cont_names='b')
to.setup()

#Test setup and apply on df_main
test_eq(to.cat_names, ['a', 'b_na'])
test_eq(to.a, [1,2,3,2,2,3,1])
test_eq(to.b_na, [1,1,2,1,1,1,1])
x = np.array([0,1,1.5,1,2,3,4])
m,s = x.mean(),x.std()
test_close(to.b.values, (x-m)/s)
test_eq(to.procs.classes, {'a': ['#na#',0,1,2], 'b_na': ['#na#',False,True]})

We now get to a dataset that has categorical columns, continuous columns and dependent variables. so for processing them we need three different tensors as each of them are different data types. This is achieved by the ReadTabBatch transform which is an ItemTransform. Here we take the tabular object and then convert the categorical column values to Long Tensor, continuous column values to Float Tensor, dependent variables to Long Tensor (if they are categorical) and Float Tensor if they are continuous via the encodes. The categorical and continuous column values are returned as a tuple after they are converted to appropriate Tensors

class ReadTabBatch(ItemTransform):
    def __init__(self, to): self.to = to
    # TODO: use float for cont targ
    def encodes(self, to): return (tensor(to.cats).long(),tensor(to.conts).float()), tensor(to.targ).long()
df = pd.DataFrame({'a':[0,1,2,1,1,2,0], 'b':[0,np.nan,1,1,2,3,4], 'c': ['b','a','b','a','a','b','a']})
to = TabularPandas(df, procs, cat_names='a', cont_names='b', y_names='c')
to.datasource(splits=[[0,1,4,6], [2,3,5]])

test_eq(to.cat_names, ['a', 'b_na'])
test_eq(to.a, [1,2,2,1,0,2,0])
test_eq(df.a.dtype,int)
test_eq(to.b_na, [1,2,1,1,1,1,1])
test_eq(to.c, [2,1,1,1,2,1,2])

We look at the example involving the ADULTS dataset but here we are not using ReadTabBatch transform. That is because we are using TabDataLoader which is a subclass of TfmdDL. Here in the after_batch transforms ReadTabBatch is automatically added to the other transforms.

@delegates()
class TabDataLoader(TfmdDL):
    do_item = noops
    def __init__(self, dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, **kwargs):
        after_batch = L(after_batch)+ReadTabBatch(dataset.items)
        super().__init__(dataset, bs=bs, shuffle=shuffle, after_batch=after_batch, num_workers=num_workers, **kwargs)

    def create_batch(self, b): return self.dataset.items.iloc[b]

Here in tabular data and also for RAPIDS, we want to take data one batch at a time and not individual rows. That’s why we have set do_item=noops in the code to prevent single rows being taken up instead of a batch. Then we replace create_batch to get items from the dataset using iloc from the batch. RAPIDS also used similar approach. This is also one of the reasons why we replace PyTorch DataLoader to be able to use batch based dataloader in Tabular. Jeremy will add an example to show this to be done using a databunch as well. The rest of the code on the ADULTS example follows the same thing explained in this lecture.

5 Likes
#export 
def _add_prop(cls, nm):
    prop = property(lambda o: o[list(getattr(o,nm+'_names'))])
    setattr(cls, nm+'s', prop)
    def _f(o,v): o[getattr(o,nm+'_names')] = v
    setattr(cls, nm+'s', prop.setter(_f))

_add_prop(Tabular, 'cat')

Trying to understand how this code works, but not making much progress tbh.

I think I understand that essentially it is replacing nm+'names' with nm+'s'. So cat_names can also be accesed with cats.

However cat_names inside obj returns a list of categorical var names but cats is returning a Series.

df = pd.DataFrame({'a':[0,1,2,0,2], 'b':[0,0,0,0,1]})
to = TabularPandas(df, cat_names='a')

to.cat_names
>>> to.cat_names

to.cats
>>> 

    a
0	0
1	1
2	2
3	0
4	2

If you do understand, then help is appreciated :slight_smile:

Thank you!

Try stepping through it in a debugger, and look closely at the values of each intermediate step to see what you find.

1 Like

The add_prop is a convenient method. Jeremy explains it line by line here _add_prop. Probably you just missed that.

1 Like

Thanks for the explanation @fabris. From the code, my understanding is that the attribute for Tabular.all_cols is set to the attribute of Tabular.all_col_names. I can see this a property attribute when I call

Tabular.__dict__

I see

mappingproxy({'__module__': 'local.tabular.core',
              '__doc__': 'A `DataFrame` wrapper that knows which cols are cont/cat/y, and returns rows in `__getitem__`',
              '__init__': <function local.tabular.core.Tabular.__init__(self, df, procs=None, cat_names=None, cont_names=None, y_names=None, is_y_cat=True)>,
              'datasource': <function local.tabular.core.Tabular.datasource(self, splits=None)>,
              'copy': <function local.tabular.core.Tabular.copy(self)>,
              'new': <function local.tabular.core.Tabular.new(self, df)>,
              'show': <function local.tabular.core.Tabular.show(self, max_n=10, **kwargs)>,
              'setup': <function local.tabular.core.Tabular.setup(self)>,
              'process': <function local.tabular.core.Tabular.process(self)>,
              'iloc': <property at 0x1253f08f0>,
              'targ': <property at 0x1253f0950>,
              'all_cont_names': <property at 0x1253f09b0>,
              'all_cat_names': <property at 0x1253f0a10>,
              'all_col_names': <property at 0x1253f0a70>,
              'default': <property at 0x1253f0ad0>,
              'cats': <property at 0x1253f0bf0>,
              'all_cats': <property at 0x1253f0c50>,
              'conts': <property at 0x1253f0cb0>,
              'all_conts': <property at 0x1253f0d10>,
              'all_cols': <property at 0x1253f0d70>})

I think you might be missing the o[...] bit in the definition.

Thanks Jeremy for pointing this out. I did some debugging based on this. This is what I found out.

Tabular is a class that we define with some functions. But then we add

properties(Tabular,'iloc','targ','all_cont_names','all_cat_names','all_col_names','default')

to it. As a result the methods/attributes iloc,targ,all_cont_names,all_col_names and default becomes property. We can see this via the properties code in 01_core.ipynb.

def properties(cls, *ps):
    "Change attrs in `cls` with names in `ps` to properties"
    for p in ps: setattr(cls,p,property(getattr(cls,p)))

Now a property has methods like fset,fget,fdel and doc as can be seen in the signature.

property(fget=None, fset=None, fdel=None, doc=None)

When we create the _add_prop function and then pass Tabular and cat into it, we see that the code initially creates a property with fget.

prop = property(lambda o: o[list(getattr(o,nm+'_names'))])

Then it sets the attribute of cls.nm+‘s’ to prop.fget(). So now whenever we will cls.nm+‘s’ it will get cls[list(cls.nm+’_names’)]. In the case of Tabular it will Tabular[(Tabular.cat_names)] which is value of all Tabular categorical columns.

setattr(cls, nm+'s', prop)

We then define a function_f which sets the value of cls.nm+‘s’ to the prop.fset().

def _f(o,v): o[getattr(o,nm+'_names')] = v
setattr(cls, nm+'s', prop.setter(_f))

Conceptually, It looks like Tabular.cats will be set to the values of Tabular.cat_names but I am not able to understand how o,v are passed in prop.setter(_f).

When I run Tabular.cats?? I see this output.

Type:        property
String form: <property object at 0x12f60dbf0>
Source:     
# Tabular.cats.fget
prop = property(lambda o: o[list(getattr(o,nm+'_names'))])

# Tabular.cats.fset
def _f(o,v): o[getattr(o,nm+'_names')] = v

Hope my understanding is correct. It would be good to understand the prop.setter(_f) better.

Hi @pnvijay & @jeremy , thanks very much for your help! I believe have understood this piece of code :slight_smile:


@pnvijay the answer to this questions lies in the Python data model here.

From the docs itself,

class Parrot:
    def __init__(self):
        self._voltage = 100000

    @property
    def voltage(self):
        """Get the current voltage."""
        return self._voltage

The @property decorator turns the voltage() method into a “getter” for a read-only attribute with the same name, and it sets the docstring for voltage to “Get the current voltage.”

Read-only attribute prop

So, let’s update _add_prop's as (I have replaced prop with f, sorry I was being lazy!):

def _add_prop(cls, nm):
    @property
    def f(o): return o[list(getattr(o,nm+'_names'))]
    setattr(cls, nm+'s', f)
    def _f(o,v): o[getattr(o,nm+'_names')] = v
    setattr(cls, nm+'s', f.setter(_f))

This does the exact same thing as before, I have just replaced the lambda with function f and also the decorator @property followed by f is the same as property(f).
Therefore, when we setattr of nm+'s' to f, exactly as the Parrot example, we are converting nm+'s' to a “read-only” attribute.

We can’t set it’s value and this can be confirmed by doing something like:

def _add_prop(cls, nm):
    @property
    def f(o): return o[list(getattr(o,nm+'_names'))]
    setattr(cls, nm+'s', f)
    #def _f(o,v): o[getattr(o,nm+'_names')] = v
    #setattr(cls, nm+'s', f.setter(_f))

If you do something like

df = pd.DataFrame({'a':[0,1,2,0,2], 'b':[0,0,0,0,1]})
to = TabularPandas(df, cat_names='a')
to.cats = None

This will raise an attribute error because right now cats is only a read-only attribute.

Add setter

To be able to set this attributes value, we will have to add a setter.

From the docs, this can be done like so:

class C:
    def __init__(self):
        self._x = None

    @property
    def x(self):
        """I'm the 'x' property."""
        return self._x

    @x.setter
    def x(self, value):
        self._x = value

Let’s do some refactoring again, our _add_prop is the same as:

def _add_prop(cls, nm):
    @property
    def f(o): return o[list(getattr(o,nm+'_names'))]
    setattr(cls, nm+'s', f)
    
    @f.setter
    def fset(o, v):
        o[getattr(o,nm+'_names')] = v
    setattr(cls, nm+'s', fset)

As you can see, now we have added a getter and a setter for cats attribute.

Hope this helps! :slight_smile:

1 Like

Thanks @arora_aman. This explains things well. But I have two queries still.

  1. Why do we need a setter object here unless we want the possibility to set the value of these attributes to some other things later on?

I ran the following test after creating Tabular.

def _add_prop_1(cls, nm):
    prop = property(lambda o: o[list(getattr(o,nm+'_names'))])
    setattr(cls, nm+'s', prop)

_add_prop_1(Tabular, 'cat')

df = pd.DataFrame({'a':[0,1,2,0,2], 'b':[0,0,0,0,1]})
to = Tabular(df, cat_names='a')

to.cats

The output was

	a
0	0
1	1
2	2
3	0
4	2

This means that to.cats is still able to reference to to.cat_names. When I run to.cat_names I get (#1) [a]. So the getter is working well without a setter in place. So I presume that we are doing the setter for a possible reassignment if needed.

  1. I still don’t understand the setter function _f code. The code is
def _f(o,v): o[getattr(o,nm+'_names')] = v

When we call the below function

setattr(cls,nm+'s',prop.setter(_f))

How are o and v are passed? What is o and what is v? It would be great if you can help provide insights into this.

In my humble opinion, because processes like fix_missing need to update the values, so you need a setter.

So when you call the setter on cats per se, it will get the value of df[cat_names] if you look at the code closely. This is what cats represents, thus you get those columns of df[cat_names] and set it to some value v

1 Like

Thanks Aman!

1 Like

setattr(cls,nm+'s',prop.setter(_f)) is doing nothing but setting the setter definition as _f which is the same as

    @f.setter
    def fset(o, v):
        o[getattr(o,nm+'_names')] = v
    setattr(cls, nm+'s', fset)

Very happy to provide more explanation :slight_smile:

1 Like

Good job @arora_aman - your rewritten version is clearer than what I had, so I’m going to replace my version with yours :slight_smile:

5 Likes

@jeremy
This is a big deal for me! Thank you! :slight_smile:

1 Like