Notes thanks to @pnvijay
We will be looking at tabular to start with. I will import all the modules required for running the notebook while I attempt to recreate what happens in the
from local.torch_basics import *
from local.test import *
from local.core import *
from local.data.all import *
from local.notebook.showdoc import show_doc
from local.tabular.core import *
We start with 40_tabular_core.ipynb
notebook. Tabular is a cool and fun note book says Jeremy. We look at the ADULTS
dataset. There are 32561 rows and 15 columns.
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
len(df),len(df.columns)
(32561, 15)
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
To create a model from this dataset, we need to take the categorical variables and convert them into ints. We also check for missing values and then fill the value with normally the median. We also add a column which holds a binary value, whether the column value was missing or not. Therefore we need to find out what are the categorical variables and which ones are the continuous variables. We can then apply the appropriate transforms there. We need to understand how to split our validation and training set. We also need to know our dependent or target variable.
## The categorical and continous variables are listed here
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
## The transforms that are necessary are listed here
procs = [Categorify, FillMissing, Normalize]
## This is how to split the dataframe into train and valid
splits = RandomSplitter()(range_of(df_main))
to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary")
%time dsrc = to.datasource(splits=splits)
CPU times: user 211 ms, sys: 11.9 ms, total: 222 ms
Wall time: 229 ms
Please refer to the code in the above cell
We have defined a class called Tabular
that takes in the data frame, processes needed to convert integer to strings, fill missing values, normalize etc. It has the continuous,categorical and dependent variables. It creates a tabular object. From this tabular object, you will get a datasource if you pass in the splits method. Now that we have a DataSource
we can create a DataLoader
and then show batch
from the same.
dl = TabDataLoader(dsrc.valid, bs=16)
dl.show_batch()
age | fnlwgt | education-num | workclass | education | marital-status | occupation | relationship | race | age_na | fnlwgt_na | education-num_na | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26.000000 | 247024.998680 | 10.0 | Private | Some-college | Never-married | #na# | Not-in-family | White | False | False | True | <50k |
1 | 26.000000 | 39212.001122 | 9.0 | Private | HS-grad | Married-civ-spouse | Machine-op-inspct | Husband | White | False | False | False | <50k |
2 | 65.999999 | 66007.999407 | 9.0 | Private | HS-grad | Widowed | Priv-house-serv | Not-in-family | White | False | False | False | <50k |
3 | 50.000000 | 305147.004464 | 10.0 | Private | Bachelors | Married-civ-spouse | Craft-repair | Husband | White | False | False | True | <50k |
4 | 33.000000 | 91811.002559 | 9.0 | Private | HS-grad | Separated | Transport-moving | Not-in-family | White | False | False | False | <50k |
5 | 24.999999 | 57511.997294 | 10.0 | Private | Some-college | Never-married | Sales | Not-in-family | White | False | False | False | <50k |
6 | 38.000000 | 205359.000280 | 7.0 | Private | 11th | Married-civ-spouse | Adm-clerical | Wife | White | False | False | False | <50k |
7 | 38.000000 | 80770.997691 | 13.0 | Private | Bachelors | Married-civ-spouse | Prof-specialty | Wife | White | False | False | False | >=50k |
8 | 17.000000 | 132635.997488 | 10.0 | Private | 11th | Never-married | #na# | Own-child | White | False | False | True | <50k |
9 | 47.000000 | 217161.000352 | 9.0 | Private | HS-grad | Divorced | Other-service | Not-in-family | Black | False | False | False | <50k |
Now you can talk in a test set and then say that the test set has the same categorical, continuous and dependent variables. Therefore the same preprocessing that was done for the training set can be done here. to.new
creates a new Tabular
object. The processing (same pre-processing as for the training set) is invoked via the .process()
method or call. As you can see here, we are using a sub class of Tabular
called TabularPandas
. It is not sure if it will remain same. We are also working on another subclass called TabularRapids
which is a result of RAPIDS from nvidia that offers GPU acclerated dataframes.
to_tst = to.new(df_test)
to_tst.process()
to_tst.all_cols.head()
age | fnlwgt | education-num | workclass | education | marital-status | occupation | relationship | race | age_na | fnlwgt_na | education-num_na | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10000 | 0.457314 | 1.335777 | 1.157855 | 5 | 10 | 3 | 2 | 1 | 2 | 1 | 1 | 1 | 1 |
10001 | -0.927409 | 1.248882 | -0.427637 | 5 | 12 | 3 | 15 | 1 | 4 | 1 | 1 | 1 | 1 |
10002 | 1.040355 | 0.147819 | -1.220383 | 5 | 2 | 1 | 9 | 2 | 5 | 1 | 1 | 1 | 1 |
10003 | 0.530194 | -0.284650 | -0.427637 | 5 | 12 | 7 | 2 | 5 | 5 | 1 | 1 | 1 | 1 |
10004 | 0.748835 | 1.438161 | 0.365109 | 6 | 9 | 3 | 5 | 1 | 5 | 1 | 1 | 1 | 2 |
There was a question on how to speed up inference on tabular learner predictions. Jeremy mentions that this will be done via RAPIDS. Jeremy mentions a medium article written by Even Oldridge who we know from forums as @Even. Even is now with NVidia and the article mentions as to how using RAPIDS, PyTorch and fastai, he was placed 15th in a competition. The article explains how a deep learning based recommender system was accelerated by over 15x using the combination of RAPIDS,PyTorch and fast.ai. Jeremy mentions that they are working with Even on using RAPIDS with fastai.
Letâs look at the Tabular
class. The idea behind the class was to build one that has all the information and methods required. The dataframe needs information on categorical and continuous variables, what preprocessing is to be done and what is the dependent variable. The Tabular
class, if you see, starts with that information in the __init__
method. It also needs to know if the dependent or target variable is categorical or not. There could be more than dependent variable here. There was a question on scenarios wherein we can have more than one dependent variable. Jeremy mentions that it could be in the case of the destination of a taxi ride wherein we need to know the x & y co-ordinates. It could also be a case wherein we have multi label classification.
The code of Tabular
with just __init__
method is provided below to follow the notes.
class Tabular(CollBase, GetAttr):
"A `DataFrame` wrapper that knows which cols are cont/cat/y, and returns rows in `__getitem__`"
def __init__(self, df, procs=None, cat_names=None, cont_names=None, y_names=None, is_y_cat=True):
super().__init__(df)
store_attr(self, 'y_names,is_y_cat')
self.cat_names,self.cont_names,self.procs = L(cat_names),L(cont_names),Pipeline(procs, as_item=True)
self.cat_y = None if not is_y_cat else y_names
self.cont_y = None if is_y_cat else y_names
The preprocessing functions mentioned as procs
in the code is actually a list of transforms. Therefore we can create a Pipeline
with them. The good part is that we are using all the foundations that we learnt before in the walkthroughs. These foundations are used throughout the fastai v2 which is a good sign. Unlike TfmdDS
, TfmdDL
and TfmdList
we donât do transforms lazily in tabular. There are three reasons for that.
- Unlike opening an image, it doesnât take a long time to grab a row of tabular data. So it is fine to read the whole lot of rows unless it is a big dataset.
- Most tabular stuff is designed to work on lots of rows quickly at a time.
- Most pre-processing here in tabular is not data augmentation, but more like cleaning of labels and things like that.
So all pre-processing is done ahead of time here in tabular data rather than lazily. But it is still a Pipeline
of transforms. We store cat_y
and cont_y
depending upon the whether the dependent variable is categorical or continuous. We (Tabular
) are inheriting from CollBase
which basically defines the basic things required in a collection and implements them by compositions. CollBase
is defined in 01_core.ipynb
notebook.
class CollBase:
"Base class for composing a list of `items`"
def __init__(self, items): self.items = items
def __len__(self): return len(self.items)
def __getitem__(self, k): return self.items[k]
def __setitem__(self, k, v): self.items[list(k) if isinstance(k,CollBase) else k] = v
def __delitem__(self, i): del(self.items[i])
def __repr__(self): return self.items.__repr__()
def __iter__(self): return self.items.__iter__()
In the __init__
of Tabular
we do super().__init__(df)
to inherit from CollBase
so that we can attributes like items, length, setitem, getitem, delitem etc. This is is what is seen in the test. The fact that we can .items
on t
and to
is because of this inheritance.
df = pd.DataFrame({'a':[0,1,2,0,2], 'b':[0,0,0,0,1]})
to = TabularPandas(df, cat_names='a')
t = pickle.loads(pickle.dumps(to))
test_eq(t.items,to.items)
test_eq(to.all_cols,to[['a']])
to.show() # only shows 'a' since that's the only col in `TabularPandas`
We have other useful attributes like all categorical names, all continuous names and all column names. In the all continuous names you can see that it is an addition of self.cont_names and self.cont_y. This will not work but for use of L
class in fastai v2. If you add None
to L
, it doesnât change L
which is what we want most of the time. So things in L
are more convenient than list
class in Python.
## example of different behaviours
## Let's start with normal list in python
[1,2]+None
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-341aae41e0f9> in <module>
1 ## example of different behaviours
2 ## Let's start with normal list in python
----> 3 [1,2]+None
TypeError: can only concatenate list (not "NoneType") to list
## Now L in fastai v2
L([1,2])+None
## You can see below that in L, it prints the numbers of items (#2) and then the items [1,2]
(#2) [1,2]
You will find that we dont have all_cols
defined in Tabular
but is still called in the test
test_eq(to.all_cols,to[['a']])
This is because of the function called _add_prop
that adds the necessary attributes and names for us.
def _add_prop(cls, nm):
prop = property(lambda o: o[list(getattr(o,nm+'_names'))])
setattr(cls, nm+'s', prop)
def _f(o,v): o[getattr(o,nm+'_names')] = v
setattr(cls, nm+'s', prop.setter(_f))
We then look at TabularProc
which is a subclass on InplaceTransform
. InplaceTransform
is something that returns itself. We call and we return the original thing.
class InplaceTransform(Transform):
"A `Transform` that modifies in-place and just returns whatever it's passed"
def _call(self, fn, x, filt=None, **kwargs):
super()._call(fn,x,filt,**kwargs)
return x
Unlike other transform which returns different to what we passed, processes in tabular do things to the stored data which is why we return what we started with. So TabularProc
returns itself when you call it. When you do setup
it does setup
but returns self or itself. Letâs see that with an example using Categorify
which is a subclass of TabularProc
. In setups
of Categorify
it creates a dictionary or a vocab from the categorical column names to their items or values contained in their column names. But the encodes
in Categorify
we change the categorical columns to integers using the vocab we created in setups
. We want them to remain two separate things because at inference we dont want to run setups, we want to run encodes. At training time we want to do both. Thatâs why in TabularProc we override setup to return the encodes via self. It is a transform when you set it up calls the encodes straight away. So Categorify
is similar to the Categorize
transform for dependent variables in Image Processing but somewhat different as it is a tabular process.
class Categorify(TabularProc):
"Transform the categorical variables to that type."
order = 1
def setups(self, dsrc):
self.classes = {n:CategoryMap(getattr(dsrc,'train',dsrc).iloc[:,n].items, add_na=True) for n in dsrc.all_cat_names}
def _apply_cats (self, c): return c.cat.codes+1 if is_categorical_dtype(c) else c.map(self[c.name].o2i)
def _decode_cats(self, c): return c.map(dict(enumerate(self[c.name].items)))
def encodes(self, to): to.transform(to.all_cat_names, self._apply_cats)
def decodes(self, to): to.transform(to.all_cat_names, self._decode_cats)
def __getitem__(self,k): return self.classes[k]
There were a couple of questions in the meanwhile. One was if patch_property
method defined in fastai v2 calls _add_prop
since the person asking the question couldnât add a setter
using patch_property
. Jeremy explains that both patch_to
and patch_property
donât add any setters. Another question was if we are doing object detection in fastai v2. Jeremy mentions that we touch upon it briefly. We then go to see tests on Categorify
.
df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.setup()
When we call to.setup
we call the setups
method in Categorify
which in turn calls the encodes
method which calls the _apply_cats
method. The _apply_cats
method maps the column values with the vocab that was created earlier. map is a good function in python. Here we are using a dictionary to map, though we can use a function also. Functions are slower though when used in map. Letâs look at the vocab created. When the vocab is created it always starts with #na
to account for any items not seen that could be presented later.
cat = to.procs.categorify
test_eq(cat['a'], ['#na#',0,1,2])
test_eq(to.a, [1,2,3,1,3])
So in this dataframe df = pd.DataFrame({'a':[0,1,2,0,2]})
we create a vocab which is ['#na#',0,1,2]
. Then 0 in dataframe maps to 1 (index value) in vocab, 1 maps to 2 and 2 maps to 3 as per the vocab. One of the things that Jeremy has recently added in Pipeline is getattr
. So if the pipeline comes across an attribute that is not defined within it, it tries to find it in any of the transforms inside the pipeline which is what we want. So for the below
cat = to.procs.categorify
we know that to.procs
is a pipeline. This can be seen in the below code.
df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.setup()
type(to.procs)
local.data.pipeline.Pipeline
So if it comes across to.procs.categorify
it looks for type categorify
in its transforms. Here we note that there is a conversion to snake case Categorify
becomes categorify
. We use getattr
to do this and this is consistent in v1, the recently concluded part 2 course and now in v2. In the below code
to = TabularPandas(df, Categorify, 'a')
we have added the type Categorify
as a procs but have not instantiated it. We do that when we call cat = to.procs.categorify
. Now that we have instantiated it and ask for cat['a']
it looks for __getitem__
in Categorify
which returns self.classes[k]
which is a vocab for that column. We can see the vocab, find the type using type() and also the reverse mapping using o2i.
cat = to.procs.categorify
cat['a']
(#4) [#na#,0,1,2]
type(cat['a'])
local.data.core.CategoryMap
cat['a'].o2i
defaultdict(int, {'#na#': 0, 0: 1, 1: 2, 2: 3})
There was a question on whether this would take care of the mapping for test set too and what happens if it comes across something new. The answer is yes. It would take care of the mapping for test set too and it will use â#na#â for some thing new that it has not encountered before. Now we come to inference time. We want to use the same meta data but pass new values. So we use the existing tabular object to
and create a new tabular object t1
but use the same metadata (vocab, categorical/continuous column names, dependent variable etc) from to
.
df1 = pd.DataFrame({'a':[1,0,3,-1,2]})
to1 = to.new(df1)
Then we call to1.process()
. The code for process
in Tabular
is
def process(self): self.procs(self)
This means that it is calling all the (preprocessing) transforms defined as procs. In the data passed for inference df1 = pd.DataFrame({'a':[1,0,3,-1,2]})
, we can see that the values 3 and -1 are new. This was not there in the vocab of the training set. The vocab as we know is (#4) [#na#,0,1,2]
. So when we call to1.a
it returns [2,1,0,0,3]
where in 3 and -1 are now mapped to 0 (which is â#na#â).
to1.process()
to1['a']
0 2
1 1
2 0
3 0
4 3
Name: a, dtype: int64
list(to1.a)
[2, 1, 0, 0, 3]
So when call decode
on the same, we get back a slightly changed data where 3 and -1 are now represented as â#na#â. So decodes
could give a different value from what you started with depending upon the transforms. In cases like Normalize
transform it would give back the same data but in Categorify
the values could change.
to2=cat.decode(to1)
list(to2['a'])
[1, 0, '#na#', '#na#', 2]
Another way to use the Categorify
transform is to instantiate it and then use it. Here we instantiate it by cat=Categorify()
and then calling it in the Tabular object call. We can also convert it to a datasource using datasource
defined in Tabular
. We only need to specify the splits
method. Since we split the dataframe into two using the indexes, you will find that the values in the validation set are not reflected in the vocab as they are different from the ones in the training set.
Training - [0,1,2]
Validation - [3,2]
vocab - [â#na#â,0,1,2]
to[âaâ] = [1,2,3,0,3] with value of 3 being substituted by index value of â#na#â which is 0.
cat = Categorify()
df = pd.DataFrame({'a':[0,1,2,3,2]})
to = TabularPandas(df, cat, 'a')
dsrc = to.datasource([[0,1,2],[3,4]])
test_eq(cat['a'], ['#na#',0,1,2])
test_eq(to['a'], [1,2,3,0,3])
Here we did not call setup
because in the datasource
it calls setup
. So if we look at the datasource
code
def datasource(self, splits=None):
if splits is None: splits=[range_of(self)]
self.items = self.items.iloc[sum(splits, [])].copy()
res = DataSource(self, filts=[range(len(splits[0])), range(len(splits[0]), len(self))], tfms=[None])
self.procs.setup(res)
return res
We have all the values we need like the dataframe, the train, test values etc. So we call setup either during the tabular object or during the datasource. We do this to prevent multiple copies and other inefficiencies to come into the system. Jeremy explains that the code for Tabular
is just nice with only the __init__
and datasource
parts having longer lines of code. The datasource
part has more lines, since in RAPIDS on the GPU it takes a long time to index into dataframes with arbitrary indexes. So you have to pass a list of contiguous indexes to make RAPIDS fast. So when you pass splits, it is concatenated into a single list and then we index into the dataframe with that list. Then the datasource is created with a contiguous list passed in as a range. Thatâs why the code is there for more than one line.
In python we cannot use @property decorator in the same line as the code to which we want to add the decorator. For example
@property def process(self): self.procs(self)
The above code is not valid. So in fastai v2 there is an alternative created called properties
which achieves the objective. The code has been made to make Tabular
look a lot like dataframe. Thatâs why we inherit from GetAttr
. This means that any unknown items will be passed on to default
property which is self.items
in the code.
def default(self): return self.items
So it behaves a lot like dataframe because in dataframe anything unknown it will pass to the dataframe. Also here we want to index using row numbers and column names together which is not possible in dataframe. So here there is a iloc
method written to get that kind of indexing possible. The method that helps get this done is _TabIloc
. The code for that is
class _TabIloc:
"Get/set rows by iloc and cols by name"
def __init__(self,to): self.to = to
def __getitem__(self, idxs):
df = self.to.items
if isinstance(idxs,tuple):
rows,cols = idxs
cols = df.columns.isin(cols) if is_listy(cols) else df.columns.get_loc(cols)
else: rows,cols = idxs,slice(None)
return self.to.new(df.iloc[rows, cols])
It will also wrap the result back into a tabular object. Then we look at the encodes
code of Categorify
. Here we call transform
on the variables. We already saw the _apply_cats
method which provides the mapping. The mapping is not done if the columns are Pandas Categorical datatypes. Because in this case, Pandas would have done the mapping already for us. The way the _apply_cats
method is applied on all_cat_names
columns is via the transform
method. This is defined in TabularPandas
class.
class TabularPandas(Tabular):
def transform(self, cols, f): self[cols] = self[cols].transform(f)
Here self[cols].transform(f)
has a transform call which is actually the Pandas inherent transform call for series. Then we look at a test example where in Pandas Categorical columns are used. And here instead of the mapping we use the Pandas inherent category codes as is visible in the _apply_cats
method.
def _apply_cats (self, c): return c.cat.codes+1 if is_categorical_dtype(c) else c.map(self[c.name].o2i)
df = pd.DataFrame({'a':pd.Categorical(['M','H','L','M'], categories=['H','M','L'], ordered=True)})
to = TabularPandas(df, Categorify, 'a')
cat = to.procs.categorify
to.setup()
test_eq(cat['a'], ['#na#','H','M','L'])
test_eq(to.a, [2,1,3,2])
to2 = cat.decode(to)
test_eq(to2.a, ['M','H','L','M'])
Now we look at Normalize
. Here in encodes
we subtract the values from the means and divide by the standard deviation. Here we have a line of code (partially shown here)
df = getattr(dsrc,'train',dsrc)
The above code is the same as
df = dsrc.train if hasattr(dsrc,'train') else dsrc
We have this code because we want to enable this so that if we have datasource that has âtrainâ and âvalidâ splits, it will take from train else it will take the complete datasource. Also the complete code is
df = getattr(dsrc,'train',dsrc).conts
Which mean we take the continuous column values and then get their means and standard deviation. We now look at a test case wherein we have intialised Normalize and then created the tabular object. We then call setup on it. We create a similar array so that their mean and standard are the same as that of the dataframe used in tabular object.
norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a')
to.setup()
x = np.array([0,1,2,3,4])
m,s = x.mean(),x.std()
We then check if norm.means['a']
is the same as x.mean(). We are able to norm.means['a']
because of the code in setups
class Normalize(TabularProc):
"Normalize the continuous variables."
order = 2
def setups(self, dsrc):
df = getattr(dsrc,'train',dsrc).conts
self.means,self.stds = df.mean(),df.std(ddof=0)+1e-7
def encodes(self, to): to.conts = (to.conts-self.means) / self.stds
def decodes(self, to): to.conts = (to.conts*self.stds ) + self.means
Wherein we set self.means
to be equal to df.mean()
. Pandas dataframe return a series when we call mean which can be indexed into using columns. That is why we are able to use norm.means['a']
.
We see the tests for Normalize
using tabular object setup and inference using the metadata from the train data. We also see an example where datasource is created and setup
is used there.
norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a')
to.setup()
x = np.array([0,1,2,3,4])
m,s = x.mean(),x.std()
df1 = pd.DataFrame({'a':[5,6,7]})
to1 = to.new(df1)
to1.process()
test_close(to1['a'].values, (np.array([5,6,7])-m)/s)
to2 = norm.decode(to1)
test_close(to2.a.values, [5,6,7])
norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a')
dsrc = to.datasource([[0,1,2],[3,4]])
x = np.array([0,1,2])
m,s = x.mean(),x.std()
test_eq(norm.means['a'], m)
test_close(norm.stds['a'], s)
test_close(to['a'].values, (np.array([0,1,2,3,4])-m)/s)
We now look at FillMissing
and FillStrategy
. FillMissing
goes through each of the continuous columns and notes down column values that have null in them. It then creates a dictionary where these column values are mapped against the FillStrategy
chosen. We have three options to chose in FillStrategy
which is median, constant and mode. By default the median option is used in the FillStrategy
. In the encodes
of FillMissing
we fill the missing values with the result of our FillMissing
operation.
class FillMissing(TabularProc):
"Fill the missing values in continuous columns."
def __init__(self, fill_strategy=FillStrategy.median, add_col=True, fill_vals=None):
if fill_vals is None: fill_vals = defaultdict(int)
store_attr(self, 'fill_strategy,add_col,fill_vals')
def setups(self, dsrc):
df = getattr(dsrc,'train',dsrc).conts
self.na_dict = {n:self.fill_strategy(df[n], self.fill_vals[n])
for n in pd.isnull(df).any().keys()}
def encodes(self, to):
missing = pd.isnull(to.conts)
for n in missing.any().keys():
assert n in self.na_dict, f"nan values in `{n}` but not in setup training set"
to[n].fillna(self.na_dict[n], inplace=True)
if self.add_col:
to.loc[:,n+'_na'] = missing[n]
if n+'_na' not in to.cat_names: to.cat_names.append(n+'_na')
class FillStrategy:
"Namespace containing the various filling strategies."
def median (c,fill): return c.median()
def constant(c,fill): return fill
def mode (c,fill): return c.dropna().value_counts().idxmax()
We can also create mutliple tabular objects using different FillStrategy
options as is clear in this example. Here we create three fill methods using each of three options in FillStrategy
. Then we can create different tabular objects from each of fill methods.
fill1,fill2,fill3 = (FillMissing(fill_strategy=s)
for s in [FillStrategy.median, FillStrategy.constant, FillStrategy.mode])
df = pd.DataFrame({'a':[0,1,np.nan,1,2,3,4]})
df1 = df.copy(); df2 = df.copy()
tos = TabularPandas(df, fill1, cont_names='a'),TabularPandas(df1, fill2, cont_names='a'),TabularPandas(df2, fill3, cont_names='a')
for t in tos: t.setup()
test_eq(fill1.na_dict, {'a': 1.5})
test_eq(fill2.na_dict, {'a': 0})
test_eq(fill3.na_dict, {'a': 1.0})
There was a question on whether setups
should be called by constructor. The answer is no as we call setup via the TypeDispatch method onlf after we have enough information about we want to setup with. This is available usually when we create the Datasource and we know the training and validation set. We will also know what to do with those sets. Thatâs why it is there in the datasource. But it is also available to use before the datasource as well.
We see examples wherein we have included all procs like Normalize, Categorify, FillMissing and noop. They can be used in a Pipeline and they will be correctly used on the relevant columns only. Like Normalize will be used only on continuous columns.
procs = [Normalize, Categorify, FillMissing, noop]
df = pd.DataFrame({'a':[0,1,2,1,1,2,0], 'b':[0,1,np.nan,1,2,3,4]})
to = TabularPandas(df, procs, cat_names='a', cont_names='b')
to.setup()
#Test setup and apply on df_main
test_eq(to.cat_names, ['a', 'b_na'])
test_eq(to.a, [1,2,3,2,2,3,1])
test_eq(to.b_na, [1,1,2,1,1,1,1])
x = np.array([0,1,1.5,1,2,3,4])
m,s = x.mean(),x.std()
test_close(to.b.values, (x-m)/s)
test_eq(to.procs.classes, {'a': ['#na#',0,1,2], 'b_na': ['#na#',False,True]})
We now get to a dataset that has categorical columns, continuous columns and dependent variables. so for processing them we need three different tensors as each of them are different data types. This is achieved by the ReadTabBatch
transform which is an ItemTransform
. Here we take the tabular object and then convert the categorical column values to Long Tensor, continuous column values to Float Tensor, dependent variables to Long Tensor (if they are categorical) and Float Tensor if they are continuous via the encodes
. The categorical and continuous column values are returned as a tuple after they are converted to appropriate Tensors
class ReadTabBatch(ItemTransform):
def __init__(self, to): self.to = to
# TODO: use float for cont targ
def encodes(self, to): return (tensor(to.cats).long(),tensor(to.conts).float()), tensor(to.targ).long()
df = pd.DataFrame({'a':[0,1,2,1,1,2,0], 'b':[0,np.nan,1,1,2,3,4], 'c': ['b','a','b','a','a','b','a']})
to = TabularPandas(df, procs, cat_names='a', cont_names='b', y_names='c')
to.datasource(splits=[[0,1,4,6], [2,3,5]])
test_eq(to.cat_names, ['a', 'b_na'])
test_eq(to.a, [1,2,2,1,0,2,0])
test_eq(df.a.dtype,int)
test_eq(to.b_na, [1,2,1,1,1,1,1])
test_eq(to.c, [2,1,1,1,2,1,2])
We look at the example involving the ADULTS dataset but here we are not using ReadTabBatch
transform. That is because we are using TabDataLoader
which is a subclass of TfmdDL
. Here in the after_batch transforms ReadTabBatch
is automatically added to the other transforms.
@delegates()
class TabDataLoader(TfmdDL):
do_item = noops
def __init__(self, dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, **kwargs):
after_batch = L(after_batch)+ReadTabBatch(dataset.items)
super().__init__(dataset, bs=bs, shuffle=shuffle, after_batch=after_batch, num_workers=num_workers, **kwargs)
def create_batch(self, b): return self.dataset.items.iloc[b]
Here in tabular data and also for RAPIDS, we want to take data one batch at a time and not individual rows. Thatâs why we have set do_item=noops
in the code to prevent single rows being taken up instead of a batch. Then we replace create_batch
to get items from the dataset using iloc from the batch. RAPIDS also used similar approach. This is also one of the reasons why we replace PyTorch DataLoader to be able to use batch based dataloader in Tabular. Jeremy will add an example to show this to be done using a databunch as well. The rest of the code on the ADULTS example follows the same thing explained in this lecture.