Notes thanks to @pnvijay
We will be looking at tabular to start with. I will import all the modules required for running the notebook while I attempt to recreate what happens in the
from local.torch_basics import *
from local.test import *
from local.core import *
from local.data.all import *
from local.notebook.showdoc import show_doc
from local.tabular.core import *
We start with 40_tabular_core.ipynb notebook. Tabular is a cool and fun note book says Jeremy. We look at the ADULTS dataset. There are 32561 rows and 15 columns.
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
len(df),len(df.columns)
(32561, 15)
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
| 1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
| 2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
| 3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
| 4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
To create a model from this dataset, we need to take the categorical variables and convert them into ints. We also check for missing values and then fill the value with normally the median. We also add a column which holds a binary value, whether the column value was missing or not. Therefore we need to find out what are the categorical variables and which ones are the continuous variables. We can then apply the appropriate transforms there. We need to understand how to split our validation and training set. We also need to know our dependent or target variable.
## The categorical and continous variables are listed here
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
## The transforms that are necessary are listed here
procs = [Categorify, FillMissing, Normalize]
## This is how to split the dataframe into train and valid
splits = RandomSplitter()(range_of(df_main))
to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary")
%time dsrc = to.datasource(splits=splits)
CPU times: user 211 ms, sys: 11.9 ms, total: 222 ms
Wall time: 229 ms
Please refer to the code in the above cell
We have defined a class called Tabular that takes in the data frame, processes needed to convert integer to strings, fill missing values, normalize etc. It has the continuous,categorical and dependent variables. It creates a tabular object. From this tabular object, you will get a datasource if you pass in the splits method. Now that we have a DataSource we can create a DataLoader and then show batch from the same.
dl = TabDataLoader(dsrc.valid, bs=16)
dl.show_batch()
| age | fnlwgt | education-num | workclass | education | marital-status | occupation | relationship | race | age_na | fnlwgt_na | education-num_na | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 26.000000 | 247024.998680 | 10.0 | Private | Some-college | Never-married | #na# | Not-in-family | White | False | False | True | <50k |
| 1 | 26.000000 | 39212.001122 | 9.0 | Private | HS-grad | Married-civ-spouse | Machine-op-inspct | Husband | White | False | False | False | <50k |
| 2 | 65.999999 | 66007.999407 | 9.0 | Private | HS-grad | Widowed | Priv-house-serv | Not-in-family | White | False | False | False | <50k |
| 3 | 50.000000 | 305147.004464 | 10.0 | Private | Bachelors | Married-civ-spouse | Craft-repair | Husband | White | False | False | True | <50k |
| 4 | 33.000000 | 91811.002559 | 9.0 | Private | HS-grad | Separated | Transport-moving | Not-in-family | White | False | False | False | <50k |
| 5 | 24.999999 | 57511.997294 | 10.0 | Private | Some-college | Never-married | Sales | Not-in-family | White | False | False | False | <50k |
| 6 | 38.000000 | 205359.000280 | 7.0 | Private | 11th | Married-civ-spouse | Adm-clerical | Wife | White | False | False | False | <50k |
| 7 | 38.000000 | 80770.997691 | 13.0 | Private | Bachelors | Married-civ-spouse | Prof-specialty | Wife | White | False | False | False | >=50k |
| 8 | 17.000000 | 132635.997488 | 10.0 | Private | 11th | Never-married | #na# | Own-child | White | False | False | True | <50k |
| 9 | 47.000000 | 217161.000352 | 9.0 | Private | HS-grad | Divorced | Other-service | Not-in-family | Black | False | False | False | <50k |
Now you can talk in a test set and then say that the test set has the same categorical, continuous and dependent variables. Therefore the same preprocessing that was done for the training set can be done here. to.new creates a new Tabular object. The processing (same pre-processing as for the training set) is invoked via the .process() method or call. As you can see here, we are using a sub class of Tabular called TabularPandas. It is not sure if it will remain same. We are also working on another subclass called TabularRapids which is a result of RAPIDS from nvidia that offers GPU acclerated dataframes.
to_tst = to.new(df_test)
to_tst.process()
to_tst.all_cols.head()
| age | fnlwgt | education-num | workclass | education | marital-status | occupation | relationship | race | age_na | fnlwgt_na | education-num_na | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10000 | 0.457314 | 1.335777 | 1.157855 | 5 | 10 | 3 | 2 | 1 | 2 | 1 | 1 | 1 | 1 |
| 10001 | -0.927409 | 1.248882 | -0.427637 | 5 | 12 | 3 | 15 | 1 | 4 | 1 | 1 | 1 | 1 |
| 10002 | 1.040355 | 0.147819 | -1.220383 | 5 | 2 | 1 | 9 | 2 | 5 | 1 | 1 | 1 | 1 |
| 10003 | 0.530194 | -0.284650 | -0.427637 | 5 | 12 | 7 | 2 | 5 | 5 | 1 | 1 | 1 | 1 |
| 10004 | 0.748835 | 1.438161 | 0.365109 | 6 | 9 | 3 | 5 | 1 | 5 | 1 | 1 | 1 | 2 |
There was a question on how to speed up inference on tabular learner predictions. Jeremy mentions that this will be done via RAPIDS. Jeremy mentions a medium article written by Even Oldridge who we know from forums as @Even. Even is now with NVidia and the article mentions as to how using RAPIDS, PyTorch and fastai, he was placed 15th in a competition. The article explains how a deep learning based recommender system was accelerated by over 15x using the combination of RAPIDS,PyTorch and fast.ai. Jeremy mentions that they are working with Even on using RAPIDS with fastai.
Letâs look at the Tabular class. The idea behind the class was to build one that has all the information and methods required. The dataframe needs information on categorical and continuous variables, what preprocessing is to be done and what is the dependent variable. The Tabular class, if you see, starts with that information in the __init__ method. It also needs to know if the dependent or target variable is categorical or not. There could be more than dependent variable here. There was a question on scenarios wherein we can have more than one dependent variable. Jeremy mentions that it could be in the case of the destination of a taxi ride wherein we need to know the x & y co-ordinates. It could also be a case wherein we have multi label classification.
The code of Tabular with just __init__ method is provided below to follow the notes.
class Tabular(CollBase, GetAttr):
"A `DataFrame` wrapper that knows which cols are cont/cat/y, and returns rows in `__getitem__`"
def __init__(self, df, procs=None, cat_names=None, cont_names=None, y_names=None, is_y_cat=True):
super().__init__(df)
store_attr(self, 'y_names,is_y_cat')
self.cat_names,self.cont_names,self.procs = L(cat_names),L(cont_names),Pipeline(procs, as_item=True)
self.cat_y = None if not is_y_cat else y_names
self.cont_y = None if is_y_cat else y_names
The preprocessing functions mentioned as procs in the code is actually a list of transforms. Therefore we can create a Pipeline with them. The good part is that we are using all the foundations that we learnt before in the walkthroughs. These foundations are used throughout the fastai v2 which is a good sign. Unlike TfmdDS, TfmdDL and TfmdList we donât do transforms lazily in tabular. There are three reasons for that.
- Unlike opening an image, it doesnât take a long time to grab a row of tabular data. So it is fine to read the whole lot of rows unless it is a big dataset.
- Most tabular stuff is designed to work on lots of rows quickly at a time.
- Most pre-processing here in tabular is not data augmentation, but more like cleaning of labels and things like that.
So all pre-processing is done ahead of time here in tabular data rather than lazily. But it is still a Pipeline of transforms. We store cat_y and cont_y depending upon the whether the dependent variable is categorical or continuous. We (Tabular) are inheriting from CollBase which basically defines the basic things required in a collection and implements them by compositions. CollBase is defined in 01_core.ipynb notebook.
class CollBase:
"Base class for composing a list of `items`"
def __init__(self, items): self.items = items
def __len__(self): return len(self.items)
def __getitem__(self, k): return self.items[k]
def __setitem__(self, k, v): self.items[list(k) if isinstance(k,CollBase) else k] = v
def __delitem__(self, i): del(self.items[i])
def __repr__(self): return self.items.__repr__()
def __iter__(self): return self.items.__iter__()
In the __init__ of Tabular we do super().__init__(df) to inherit from CollBase so that we can attributes like items, length, setitem, getitem, delitem etc. This is is what is seen in the test. The fact that we can .items on t and to is because of this inheritance.
df = pd.DataFrame({'a':[0,1,2,0,2], 'b':[0,0,0,0,1]})
to = TabularPandas(df, cat_names='a')
t = pickle.loads(pickle.dumps(to))
test_eq(t.items,to.items)
test_eq(to.all_cols,to[['a']])
to.show() # only shows 'a' since that's the only col in `TabularPandas`
We have other useful attributes like all categorical names, all continuous names and all column names. In the all continuous names you can see that it is an addition of self.cont_names and self.cont_y. This will not work but for use of L class in fastai v2. If you add None to L, it doesnât change L which is what we want most of the time. So things in L are more convenient than list class in Python.
## example of different behaviours
## Let's start with normal list in python
[1,2]+None
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-341aae41e0f9> in <module>
1 ## example of different behaviours
2 ## Let's start with normal list in python
----> 3 [1,2]+None
TypeError: can only concatenate list (not "NoneType") to list
## Now L in fastai v2
L([1,2])+None
## You can see below that in L, it prints the numbers of items (#2) and then the items [1,2]
(#2) [1,2]
You will find that we dont have all_cols defined in Tabular but is still called in the test
test_eq(to.all_cols,to[['a']])
This is because of the function called _add_prop that adds the necessary attributes and names for us.
def _add_prop(cls, nm):
prop = property(lambda o: o[list(getattr(o,nm+'_names'))])
setattr(cls, nm+'s', prop)
def _f(o,v): o[getattr(o,nm+'_names')] = v
setattr(cls, nm+'s', prop.setter(_f))
We then look at TabularProc which is a subclass on InplaceTransform. InplaceTransform is something that returns itself. We call and we return the original thing.
class InplaceTransform(Transform):
"A `Transform` that modifies in-place and just returns whatever it's passed"
def _call(self, fn, x, filt=None, **kwargs):
super()._call(fn,x,filt,**kwargs)
return x
Unlike other transform which returns different to what we passed, processes in tabular do things to the stored data which is why we return what we started with. So TabularProc returns itself when you call it. When you do setup it does setup but returns self or itself. Letâs see that with an example using Categorify which is a subclass of TabularProc. In setups of Categorify it creates a dictionary or a vocab from the categorical column names to their items or values contained in their column names. But the encodes in Categorify we change the categorical columns to integers using the vocab we created in setups. We want them to remain two separate things because at inference we dont want to run setups, we want to run encodes. At training time we want to do both. Thatâs why in TabularProc we override setup to return the encodes via self. It is a transform when you set it up calls the encodes straight away. So Categorify is similar to the Categorize transform for dependent variables in Image Processing but somewhat different as it is a tabular process.
class Categorify(TabularProc):
"Transform the categorical variables to that type."
order = 1
def setups(self, dsrc):
self.classes = {n:CategoryMap(getattr(dsrc,'train',dsrc).iloc[:,n].items, add_na=True) for n in dsrc.all_cat_names}
def _apply_cats (self, c): return c.cat.codes+1 if is_categorical_dtype(c) else c.map(self[c.name].o2i)
def _decode_cats(self, c): return c.map(dict(enumerate(self[c.name].items)))
def encodes(self, to): to.transform(to.all_cat_names, self._apply_cats)
def decodes(self, to): to.transform(to.all_cat_names, self._decode_cats)
def __getitem__(self,k): return self.classes[k]
There were a couple of questions in the meanwhile. One was if patch_property method defined in fastai v2 calls _add_prop since the person asking the question couldnât add a setter using patch_property. Jeremy explains that both patch_to and patch_property donât add any setters. Another question was if we are doing object detection in fastai v2. Jeremy mentions that we touch upon it briefly. We then go to see tests on Categorify.
df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.setup()
When we call to.setup we call the setups method in Categorify which in turn calls the encodes method which calls the _apply_cats method. The _apply_cats method maps the column values with the vocab that was created earlier. map is a good function in python. Here we are using a dictionary to map, though we can use a function also. Functions are slower though when used in map. Letâs look at the vocab created. When the vocab is created it always starts with #na to account for any items not seen that could be presented later.
cat = to.procs.categorify
test_eq(cat['a'], ['#na#',0,1,2])
test_eq(to.a, [1,2,3,1,3])
So in this dataframe df = pd.DataFrame({'a':[0,1,2,0,2]}) we create a vocab which is ['#na#',0,1,2]. Then 0 in dataframe maps to 1 (index value) in vocab, 1 maps to 2 and 2 maps to 3 as per the vocab. One of the things that Jeremy has recently added in Pipeline is getattr. So if the pipeline comes across an attribute that is not defined within it, it tries to find it in any of the transforms inside the pipeline which is what we want. So for the below
cat = to.procs.categorify
we know that to.procs is a pipeline. This can be seen in the below code.
df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.setup()
type(to.procs)
local.data.pipeline.Pipeline
So if it comes across to.procs.categorify it looks for type categorify in its transforms. Here we note that there is a conversion to snake case Categorify becomes categorify. We use getattr to do this and this is consistent in v1, the recently concluded part 2 course and now in v2. In the below code
to = TabularPandas(df, Categorify, 'a')
we have added the type Categorify as a procs but have not instantiated it. We do that when we call cat = to.procs.categorify. Now that we have instantiated it and ask for cat['a'] it looks for __getitem__ in Categorify which returns self.classes[k] which is a vocab for that column. We can see the vocab, find the type using type() and also the reverse mapping using o2i.
cat = to.procs.categorify
cat['a']
(#4) [#na#,0,1,2]
type(cat['a'])
local.data.core.CategoryMap
cat['a'].o2i
defaultdict(int, {'#na#': 0, 0: 1, 1: 2, 2: 3})
There was a question on whether this would take care of the mapping for test set too and what happens if it comes across something new. The answer is yes. It would take care of the mapping for test set too and it will use â#na#â for some thing new that it has not encountered before. Now we come to inference time. We want to use the same meta data but pass new values. So we use the existing tabular object to and create a new tabular object t1 but use the same metadata (vocab, categorical/continuous column names, dependent variable etc) from to.
df1 = pd.DataFrame({'a':[1,0,3,-1,2]})
to1 = to.new(df1)
Then we call to1.process(). The code for process in Tabular is
def process(self): self.procs(self)
This means that it is calling all the (preprocessing) transforms defined as procs. In the data passed for inference df1 = pd.DataFrame({'a':[1,0,3,-1,2]}), we can see that the values 3 and -1 are new. This was not there in the vocab of the training set. The vocab as we know is (#4) [#na#,0,1,2]. So when we call to1.a it returns [2,1,0,0,3] where in 3 and -1 are now mapped to 0 (which is â#na#â).
to1.process()
to1['a']
0 2
1 1
2 0
3 0
4 3
Name: a, dtype: int64
list(to1.a)
[2, 1, 0, 0, 3]
So when call decode on the same, we get back a slightly changed data where 3 and -1 are now represented as â#na#â. So decodes could give a different value from what you started with depending upon the transforms. In cases like Normalize transform it would give back the same data but in Categorify the values could change.
to2=cat.decode(to1)
list(to2['a'])
[1, 0, '#na#', '#na#', 2]
Another way to use the Categorify transform is to instantiate it and then use it. Here we instantiate it by cat=Categorify() and then calling it in the Tabular object call. We can also convert it to a datasource using datasource defined in Tabular. We only need to specify the splits method. Since we split the dataframe into two using the indexes, you will find that the values in the validation set are not reflected in the vocab as they are different from the ones in the training set.
Training - [0,1,2]
Validation - [3,2]
vocab - [â#na#â,0,1,2]
to[âaâ] = [1,2,3,0,3] with value of 3 being substituted by index value of â#na#â which is 0.
cat = Categorify()
df = pd.DataFrame({'a':[0,1,2,3,2]})
to = TabularPandas(df, cat, 'a')
dsrc = to.datasource([[0,1,2],[3,4]])
test_eq(cat['a'], ['#na#',0,1,2])
test_eq(to['a'], [1,2,3,0,3])
Here we did not call setup because in the datasource it calls setup. So if we look at the datasource code
def datasource(self, splits=None):
if splits is None: splits=[range_of(self)]
self.items = self.items.iloc[sum(splits, [])].copy()
res = DataSource(self, filts=[range(len(splits[0])), range(len(splits[0]), len(self))], tfms=[None])
self.procs.setup(res)
return res
We have all the values we need like the dataframe, the train, test values etc. So we call setup either during the tabular object or during the datasource. We do this to prevent multiple copies and other inefficiencies to come into the system. Jeremy explains that the code for Tabular is just nice with only the __init__ and datasource parts having longer lines of code. The datasource part has more lines, since in RAPIDS on the GPU it takes a long time to index into dataframes with arbitrary indexes. So you have to pass a list of contiguous indexes to make RAPIDS fast. So when you pass splits, it is concatenated into a single list and then we index into the dataframe with that list. Then the datasource is created with a contiguous list passed in as a range. Thatâs why the code is there for more than one line.
In python we cannot use @property decorator in the same line as the code to which we want to add the decorator. For example
@property def process(self): self.procs(self)
The above code is not valid. So in fastai v2 there is an alternative created called properties which achieves the objective. The code has been made to make Tabular look a lot like dataframe. Thatâs why we inherit from GetAttr. This means that any unknown items will be passed on to default property which is self.items in the code.
def default(self): return self.items
So it behaves a lot like dataframe because in dataframe anything unknown it will pass to the dataframe. Also here we want to index using row numbers and column names together which is not possible in dataframe. So here there is a iloc method written to get that kind of indexing possible. The method that helps get this done is _TabIloc. The code for that is
class _TabIloc:
"Get/set rows by iloc and cols by name"
def __init__(self,to): self.to = to
def __getitem__(self, idxs):
df = self.to.items
if isinstance(idxs,tuple):
rows,cols = idxs
cols = df.columns.isin(cols) if is_listy(cols) else df.columns.get_loc(cols)
else: rows,cols = idxs,slice(None)
return self.to.new(df.iloc[rows, cols])
It will also wrap the result back into a tabular object. Then we look at the encodes code of Categorify. Here we call transform on the variables. We already saw the _apply_cats method which provides the mapping. The mapping is not done if the columns are Pandas Categorical datatypes. Because in this case, Pandas would have done the mapping already for us. The way the _apply_cats method is applied on all_cat_names columns is via the transform method. This is defined in TabularPandas class.
class TabularPandas(Tabular):
def transform(self, cols, f): self[cols] = self[cols].transform(f)
Here self[cols].transform(f) has a transform call which is actually the Pandas inherent transform call for series. Then we look at a test example where in Pandas Categorical columns are used. And here instead of the mapping we use the Pandas inherent category codes as is visible in the _apply_cats method.
def _apply_cats (self, c): return c.cat.codes+1 if is_categorical_dtype(c) else c.map(self[c.name].o2i)
df = pd.DataFrame({'a':pd.Categorical(['M','H','L','M'], categories=['H','M','L'], ordered=True)})
to = TabularPandas(df, Categorify, 'a')
cat = to.procs.categorify
to.setup()
test_eq(cat['a'], ['#na#','H','M','L'])
test_eq(to.a, [2,1,3,2])
to2 = cat.decode(to)
test_eq(to2.a, ['M','H','L','M'])
Now we look at Normalize. Here in encodes we subtract the values from the means and divide by the standard deviation. Here we have a line of code (partially shown here)
df = getattr(dsrc,'train',dsrc)
The above code is the same as
df = dsrc.train if hasattr(dsrc,'train') else dsrc
We have this code because we want to enable this so that if we have datasource that has âtrainâ and âvalidâ splits, it will take from train else it will take the complete datasource. Also the complete code is
df = getattr(dsrc,'train',dsrc).conts
Which mean we take the continuous column values and then get their means and standard deviation. We now look at a test case wherein we have intialised Normalize and then created the tabular object. We then call setup on it. We create a similar array so that their mean and standard are the same as that of the dataframe used in tabular object.
norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a')
to.setup()
x = np.array([0,1,2,3,4])
m,s = x.mean(),x.std()
We then check if norm.means['a'] is the same as x.mean(). We are able to norm.means['a'] because of the code in setups
class Normalize(TabularProc):
"Normalize the continuous variables."
order = 2
def setups(self, dsrc):
df = getattr(dsrc,'train',dsrc).conts
self.means,self.stds = df.mean(),df.std(ddof=0)+1e-7
def encodes(self, to): to.conts = (to.conts-self.means) / self.stds
def decodes(self, to): to.conts = (to.conts*self.stds ) + self.means
Wherein we set self.means to be equal to df.mean(). Pandas dataframe return a series when we call mean which can be indexed into using columns. That is why we are able to use norm.means['a'].
We see the tests for Normalize using tabular object setup and inference using the metadata from the train data. We also see an example where datasource is created and setup is used there.
norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a')
to.setup()
x = np.array([0,1,2,3,4])
m,s = x.mean(),x.std()
df1 = pd.DataFrame({'a':[5,6,7]})
to1 = to.new(df1)
to1.process()
test_close(to1['a'].values, (np.array([5,6,7])-m)/s)
to2 = norm.decode(to1)
test_close(to2.a.values, [5,6,7])
norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a')
dsrc = to.datasource([[0,1,2],[3,4]])
x = np.array([0,1,2])
m,s = x.mean(),x.std()
test_eq(norm.means['a'], m)
test_close(norm.stds['a'], s)
test_close(to['a'].values, (np.array([0,1,2,3,4])-m)/s)
We now look at FillMissing and FillStrategy. FillMissing goes through each of the continuous columns and notes down column values that have null in them. It then creates a dictionary where these column values are mapped against the FillStrategy chosen. We have three options to chose in FillStrategy which is median, constant and mode. By default the median option is used in the FillStrategy. In the encodes of FillMissing we fill the missing values with the result of our FillMissing operation.
class FillMissing(TabularProc):
"Fill the missing values in continuous columns."
def __init__(self, fill_strategy=FillStrategy.median, add_col=True, fill_vals=None):
if fill_vals is None: fill_vals = defaultdict(int)
store_attr(self, 'fill_strategy,add_col,fill_vals')
def setups(self, dsrc):
df = getattr(dsrc,'train',dsrc).conts
self.na_dict = {n:self.fill_strategy(df[n], self.fill_vals[n])
for n in pd.isnull(df).any().keys()}
def encodes(self, to):
missing = pd.isnull(to.conts)
for n in missing.any().keys():
assert n in self.na_dict, f"nan values in `{n}` but not in setup training set"
to[n].fillna(self.na_dict[n], inplace=True)
if self.add_col:
to.loc[:,n+'_na'] = missing[n]
if n+'_na' not in to.cat_names: to.cat_names.append(n+'_na')
class FillStrategy:
"Namespace containing the various filling strategies."
def median (c,fill): return c.median()
def constant(c,fill): return fill
def mode (c,fill): return c.dropna().value_counts().idxmax()
We can also create mutliple tabular objects using different FillStrategy options as is clear in this example. Here we create three fill methods using each of three options in FillStrategy. Then we can create different tabular objects from each of fill methods.
fill1,fill2,fill3 = (FillMissing(fill_strategy=s)
for s in [FillStrategy.median, FillStrategy.constant, FillStrategy.mode])
df = pd.DataFrame({'a':[0,1,np.nan,1,2,3,4]})
df1 = df.copy(); df2 = df.copy()
tos = TabularPandas(df, fill1, cont_names='a'),TabularPandas(df1, fill2, cont_names='a'),TabularPandas(df2, fill3, cont_names='a')
for t in tos: t.setup()
test_eq(fill1.na_dict, {'a': 1.5})
test_eq(fill2.na_dict, {'a': 0})
test_eq(fill3.na_dict, {'a': 1.0})
There was a question on whether setups should be called by constructor. The answer is no as we call setup via the TypeDispatch method onlf after we have enough information about we want to setup with. This is available usually when we create the Datasource and we know the training and validation set. We will also know what to do with those sets. Thatâs why it is there in the datasource. But it is also available to use before the datasource as well.
We see examples wherein we have included all procs like Normalize, Categorify, FillMissing and noop. They can be used in a Pipeline and they will be correctly used on the relevant columns only. Like Normalize will be used only on continuous columns.
procs = [Normalize, Categorify, FillMissing, noop]
df = pd.DataFrame({'a':[0,1,2,1,1,2,0], 'b':[0,1,np.nan,1,2,3,4]})
to = TabularPandas(df, procs, cat_names='a', cont_names='b')
to.setup()
#Test setup and apply on df_main
test_eq(to.cat_names, ['a', 'b_na'])
test_eq(to.a, [1,2,3,2,2,3,1])
test_eq(to.b_na, [1,1,2,1,1,1,1])
x = np.array([0,1,1.5,1,2,3,4])
m,s = x.mean(),x.std()
test_close(to.b.values, (x-m)/s)
test_eq(to.procs.classes, {'a': ['#na#',0,1,2], 'b_na': ['#na#',False,True]})
We now get to a dataset that has categorical columns, continuous columns and dependent variables. so for processing them we need three different tensors as each of them are different data types. This is achieved by the ReadTabBatch transform which is an ItemTransform. Here we take the tabular object and then convert the categorical column values to Long Tensor, continuous column values to Float Tensor, dependent variables to Long Tensor (if they are categorical) and Float Tensor if they are continuous via the encodes. The categorical and continuous column values are returned as a tuple after they are converted to appropriate Tensors
class ReadTabBatch(ItemTransform):
def __init__(self, to): self.to = to
# TODO: use float for cont targ
def encodes(self, to): return (tensor(to.cats).long(),tensor(to.conts).float()), tensor(to.targ).long()
df = pd.DataFrame({'a':[0,1,2,1,1,2,0], 'b':[0,np.nan,1,1,2,3,4], 'c': ['b','a','b','a','a','b','a']})
to = TabularPandas(df, procs, cat_names='a', cont_names='b', y_names='c')
to.datasource(splits=[[0,1,4,6], [2,3,5]])
test_eq(to.cat_names, ['a', 'b_na'])
test_eq(to.a, [1,2,2,1,0,2,0])
test_eq(df.a.dtype,int)
test_eq(to.b_na, [1,2,1,1,1,1,1])
test_eq(to.c, [2,1,1,1,2,1,2])
We look at the example involving the ADULTS dataset but here we are not using ReadTabBatch transform. That is because we are using TabDataLoader which is a subclass of TfmdDL. Here in the after_batch transforms ReadTabBatch is automatically added to the other transforms.
@delegates()
class TabDataLoader(TfmdDL):
do_item = noops
def __init__(self, dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, **kwargs):
after_batch = L(after_batch)+ReadTabBatch(dataset.items)
super().__init__(dataset, bs=bs, shuffle=shuffle, after_batch=after_batch, num_workers=num_workers, **kwargs)
def create_batch(self, b): return self.dataset.items.iloc[b]
Here in tabular data and also for RAPIDS, we want to take data one batch at a time and not individual rows. Thatâs why we have set do_item=noops in the code to prevent single rows being taken up instead of a batch. Then we replace create_batch to get items from the dataset using iloc from the batch. RAPIDS also used similar approach. This is also one of the reasons why we replace PyTorch DataLoader to be able to use batch based dataloader in Tabular. Jeremy will add an example to show this to be done using a databunch as well. The rest of the code on the ADULTS example follows the same thing explained in this lecture.
