Fastai v2 tabular

@jeremy fastai v1 gets higher LB score (higher roc_auc_score)

1 Like

Hi. I’m currently working on a multiclass classifcation problem with tabular data.

I’m using TabularPandas to prepare the data but I want a specific mapping from target class to integer, so I use: y_block=CategoryBlock(vocab=my_mapping, sort=False) as an argument to TabularPandas. I have noticed that the resulting, transformed target does not match the my_mapping vocab.

If I’m correct, the cause is, when reduce_memory=True in TabularPandas, df_shrink is called and transforms the target variable (if object type) independently to the y_block specified. I’m not sure if this is intentional and if not, what the best alternative is, but a note about this in the docs or code might be helpful in the future.

I’m also willing to help where I can (I’m new to v2) if this is something that needs fixing.

I am also working on a similar problem. When I pass y_block= CategoryBlock to either TabularPandas or tabular_learner the model cannot be trained anymore and fails with ValueError: Expected input batch_size (64) to match target batch_size (13184). I also cannot find an example on TabularPandas for classification anywhere.

There are plenty, first the tabular tutorial in the fastai docs:

And in Walk with fastai2:

@Jan did setting reduce_memory to False fix it?

Yes, it seems fastai v2 cannot handle this anymore. It was working with v1.

Can you provide a reproducible example of what’s going on here? I’ve been actively using fastai tabular for months now without issue with both classification and regression problems, and I’m not sure exactly what’s happening to do this.

EDIT: @soerendip very important question: what’s your loss function. I believe you may not be using CrossEntropyLossFlat() which could be affecting it

Currently I’ve tried setting it up in a scenario similar to what I believe yours could be:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
new_dict = {'<50k':0,
df[y_names].replace(new_dict, inplace=True)

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock
splits = RandomSplitter()(range_of(df))

to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, y_block=y_block, splits=splits)
dls = to.dataloaders()

learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)

And can train without issue. Can you tell me how your DataFrame setup may differ?

Yes, when reduce_memory=False I didn’t have that issue. I will see if I get to a reproducible example later today.

1 Like

I am using now BCEWithLogitsLossFlat() or BCELossFlat in conjunction with y_range=(0,1)

learn = tabular_learner(dls, y_range=(0,1), layers=[300, 200, 200], loss_func=BCELossFlat())

When I switch to CrossEntropyLossFlat the code breaks. Apparently, that cannot be used for targets with multiple elements. It only works for one dimensional targets. E.g. y_names=‘target’ but not y_names=[1, 2, 3]…

My targets are binary vectors with independent labels e.g. [1, 0,0,1,0,0,0,1,…]

Yes, that would be expected. BCE is for multi-label (multiple labels showing up at one time), it’s pairing is MultiCategoryBlock(). (And this is denoted in the documentation) I haven’t played with that scenario much, so if you run into issues still let me know.

It does not work with MultiCategoryBlock in my case because my data is not in that format. For MultiCategoryBlock I would need labels like ‘horse running dawn’ but I have binary vectors [1, 0, 1, 0, 0, 0,…], so a 2D matrix for Y. So, I don’t specify a data block. The performance seems to be comparable with other models though.

1 Like

I just ran into what appears to be a Pytorch error during autograd, on a model that is a variation of the simple 2-layer Tabular model in Chapter 9 of the book. I’d appreciate any advice on how to track it down. Here are the key lines. If it makes any difference, I’m trying this on a Macbook Pro laptop that doesn’t have a compatible GPU, so it’s using the CPU only.

~/anaconda3/envs/fastai/lib/python3.8/site-packages/torch/autograd/ in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
    123         retain_graph = create_graph
--> 125     Variable._execution_engine.run_backward(
    126         tensors, grad_tensors, retain_graph, create_graph,
    127         allow_unreachable=True)  # allow_unreachable flag

RuntimeError: Found dtype Short but expected Float
Exception raised from compute_types at /Users/distiller/project/conda/conda-bld/pytorch_1595629449223/work/aten/src/ATen/native/TensorIterator.cpp:183 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 169 (0x128c4e199 in libc10.dylib)
frame #1: at::TensorIterator::compute_types(at::TensorIteratorConfig const&) + 3842 (0x121193312 in libtorch_cpu.dylib)
frame #2: at::TensorIterator::build(at::TensorIteratorConfig&) + 618 (0x12119c51a in libtorch_cpu.dylib)
frame #3: at::TensorIterator::TensorIterator(at::TensorIteratorConfig&) + 223 (0x12119c1ff in libtorch_cpu.dylib)
frame #4: at::native::mse_loss_backward_out(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long long) + 410 (0x120fe7f7a in libtorch_cpu.dylib)

I have trained a model using XGBoost, but I did the data processing for training and validation sets using TabularPandas (similar to the approach done in the fastai book). I did not use dataloaders or a learner object. Now, I need to use it for monthly inference, but the only way I can get it to process properly is to have a training and validation set. I just want to apply the transforms to the validation set the same way each month. For example, I believe whether a _na column is created is dependent on the data given to it and if it has any null values. For inference, I just want to process the “validation set” and make my predictions.

I would like a way to export the transform logic of a TabularPandas object, then import it whenever I want to process a dataframe into a TabularPandas object that can be used for inference.

My current workaround is to have a static dummy training set that gets processed with any new data so that I have a ‘training’ and validation set. I doctored the training set to ensure that the right columns have null values, etc. Then I create the TabularPandas object and do inference on the validation set. This allows TabularPandas to process it the same every month.

The example in the fastai tabular book where a RandomForest is being trained is a great example. Now, if you need to load the model up a month later to do inference only using the random forest - how would you process the new data?

Ideally, I would like to avoid dataloaders (as it doesn’t give me anything for this problem). I would also like to avoid processing extra ‘training’ data as I only want to do inference and it really shouldn’t be necessary.


Hey Ezno, I had the same situation but wanted to go into production with the ensembled model like Zach showed in his youtube series. My first attempt was to record the normalization transformation, then apply it manually… But it was difficult.

My solution was to use fastai to make the prediction while returning the ‘inputs’. The inputs parameter gave the transformed data! I passed that to xgboost.

# NN Dataloader & batch, df dataframe passed to fastai
dl = learn.dls.test_dl(df)#,bs=4, val_bs=4)

# NN prediction, 'inputs' are the transformed dataframe we need
inputs, nn_probs, _, nn_preds = learn.get_preds(dl=dl,with_input=True,with_decoded=True)
print("nn Preds:")
#response_json_nn = json.dumps(nn_preds.tolist()) # nn predictions only

# XGBoost predictions, based on the transformed 'inputs' from fastai
xgb_probs = xgb_model.predict_proba(np.hstack((inputs[0].numpy(),inputs[1].numpy()))) #faster than making a new df
xgb_preds = xgb_probs.argmax(axis=1)
#response_json_xgb = json.dumps(xgb_preds.tolist()) # xgb predictions only

# Ensemble results
avg = (nn_probs + xgb_probs) / 2
ensemble_preds =avg.argmax(axis=1)
print("Ensemble preds:")

edit: please note Colab will default to v0.9 of xgboost currently, you’ll need to match the version at time of inference/production! You can run ‘!pip install xgboost==1.2.0 -q’ in Colab to get v1.2 and specify v1.2 in your production environment as well (e.g. ‘xgboost==1.2.0’ in your Docker requirements.txt)!

1 Like

You can also just pass in dl.dataset.xs there :slight_smile:

1 Like

Thanks! I made a FastAPI version of the Starlette Docker of yours if you’re interested:

You can run/debug it in Colab with a couple tricks. Have your API code in one big block (imports can be separate)… Need to import:

import nest_asyncio #Colab
from pyngrok import ngrok #Colab

Then run it with the following code block, it’ll give you a public facing URL (I appended ‘/docs’ which is useful for FastAPI):

# Re-run this for the page to update
import nest_asyncio
from pyngrok import ngrok
import uvicorn

url = ngrok.connect(port=8000)
print('Public URL:', (url + '/docs'))
nest_asyncio.apply(), port=8000)

When you want to make a change, you’ll need to run a ‘disconnect’ code block. Expect somewhat frequent crashes of colab…

# Run this to disconnect if you have 'already connected errors'
1 Like


First of all, I would like to mention this quiestion is duplicated from here. Apologies if you think this is spam, but I guess this question belongs to this thread rather than the other.

I am trying to build a Tabular model for multicategory data using weighted loss function, since I have a highly imbalance dataset.

I got class weights as explained here, being class_weights = tensor([11.3539, 1.0000, 5.8010, 5.1732], device='cuda')

Here I have to mention that I have a dataset with 4 single classes but I would like to train the model to expect merged labels in the future, that’s the reason for the multicategory. If you think a better approach should be performed, please, tell me

So, fot the tabular_learner I have some issues. I hot encoded the variables as explained here so I have a dataset with 4 more columns with my labels and True/False. If I try to train like:

y_names = ['Label1', 'Label2', 'Label3', 'Label4']
to = TabularPandas(df_multi, procs, cat_names, cont_names,
                                 y_names = y_names, 
                                 y_block = MultiCategoryBlock(encoded=True, vocab=y_names), 
                                 splits = splits)
tab_dl_m = to.dataloaders(bs=8)
tab_learn_m = tabular_learner(tab_dl_m, metrics=accuracy_multi)
tab_learn_m.loss_func = BCEWithLogitsLossFlat(weight=class_weights)

I got a dimension error:

epoch 	train_loss 	valid_loss 	accuracy_multi 	time
0 	0.000000 	00:00

RuntimeError                              Traceback (most recent call last)
<ipython-input-277-14422a88807c> in <module>
----> 1 tab_learn_m.fit_one_cycle(3)

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastcore/ in _f(*args, **kwargs)
    452         init_args.update(log)
    453         setattr(inst, 'init_args', init_args)
--> 454         return inst if to_return else f(*args, **kwargs)
    455     return _f

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/callback/ in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
    111     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    112               'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 113, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
    115 # Cell

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastcore/ in _f(*args, **kwargs)
    452         init_args.update(log)
    453         setattr(inst, 'init_args', init_args)
--> 454         return inst if to_return else f(*args, **kwargs)
    455     return _f

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    202             self.opt.set_hypers( if lr is None else lr)
    203             self.n_epoch,self.loss = n_epoch,tensor(0.)
--> 204             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
    206     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in _with_events(self, f, event_type, ex, final)
    154     def _with_events(self, f, event_type, ex, final=noop):
--> 155         try:       self(f'before_{event_type}')       ;f()
    156         except ex: self(f'after_cancel_{event_type}')
    157         finally:   self(f'after_{event_type}')        ;final()

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in _do_fit(self)
    192         for epoch in range(self.n_epoch):
    193             self.epoch=epoch
--> 194             self._with_events(self._do_epoch, 'epoch', CancelEpochException)
    196     @log_args(but='cbs')

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in _with_events(self, f, event_type, ex, final)
    154     def _with_events(self, f, event_type, ex, final=noop):
--> 155         try:       self(f'before_{event_type}')       ;f()
    156         except ex: self(f'after_cancel_{event_type}')
    157         finally:   self(f'after_{event_type}')        ;final()

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in _do_epoch(self)
    187     def _do_epoch(self):
--> 188         self._do_epoch_train()
    189         self._do_epoch_validate()

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in _do_epoch_train(self)
    178     def _do_epoch_train(self):
    179         self.dl = self.dls.train
--> 180         self._with_events(self.all_batches, 'train', CancelTrainException)
    182     def _do_epoch_validate(self, ds_idx=1, dl=None):

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in _with_events(self, f, event_type, ex, final)
    154     def _with_events(self, f, event_type, ex, final=noop):
--> 155         try:       self(f'before_{event_type}')       ;f()
    156         except ex: self(f'after_cancel_{event_type}')
    157         finally:   self(f'after_{event_type}')        ;final()

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in all_batches(self)
    159     def all_batches(self):
    160         self.n_iter = len(self.dl)
--> 161         for o in enumerate(self.dl): self.one_batch(*o)
    163     def _do_one_batch(self):

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in one_batch(self, i, b)
    174         self.iter = i
    175         self._split(b)
--> 176         self._with_events(self._do_one_batch, 'batch', CancelBatchException)
    178     def _do_epoch_train(self):

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in _with_events(self, f, event_type, ex, final)
    154     def _with_events(self, f, event_type, ex, final=noop):
--> 155         try:       self(f'before_{event_type}')       ;f()
    156         except ex: self(f'after_cancel_{event_type}')
    157         finally:   self(f'after_{event_type}')        ;final()

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in _do_one_batch(self)
    164         self.pred = self.model(*self.xb);                self('after_pred')
    165         if len(self.yb) == 0: return
--> 166         self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')
    167         if not return
    168         self('before_backward')

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/fastai/ in __call__(self, inp, targ, **kwargs)
    295         if targ.dtype in [torch.int8, torch.int16, torch.int32]: targ = targ.long()
    296         if self.flatten: inp = inp.view(-1,inp.shape[-1]) if self.is_2d else inp.view(-1)
--> 297         return self.func.__call__(inp, targ.view(-1) if self.flatten else targ, **kwargs)
    299 # Cell

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/torch/nn/modules/ in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/torch/nn/modules/ in forward(self, input, target)
    627     def forward(self, input: Tensor, target: Tensor) -> Tensor:
--> 628         return F.binary_cross_entropy_with_logits(input, target,
    629                                                   self.weight,
    630                                                   pos_weight=self.pos_weight,

~/anaconda3/envs/fastai2/lib/python3.8/site-packages/torch/nn/ in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
   2538         raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
-> 2540     return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)

RuntimeError: The size of tensor a (32) must match the size of tensor b (4) at non-singleton dimension 0

Any idea why this is happening?

Hi Elliot,
Did you ever figure this out? I didn’t notice a response in the thread.

What could be happening here might be some decoding magic?

Try comparing the results with:

preds = learn.get_preds(with_decoded=True)

Thanks. I’ll give it a try. Currently retraining BlueBookForBulldozers and it takes a while;-)

I’m having trouble creating a tabular learner when I try to include both categorical and continuous features in the model.
Here’s what I’m using:

to_nn = TabularPandas(chip_df,
    cat_names=cat_nn, cont_names=cont_nn, y_names='height',
    splits=splits, y_block=RegressionBlock())
dls = to_nn.dataloaders(32)
y = to_nn.train.y
loss_func = mse
y_range = (0, 0.5)
learn = tabular_learner(dls, y_range=y_range, 
    loss_func = loss_func,
    n_out = 1,

I get the error:

if layers is None: layers = [200,100]
to = dls.train_ds
emb_szs = get_emb_sz(dls.train_ds, {} if emb_szs is None else emb_szs)
if n_out is None: n_out = get_c(dls)
assert n_out, “n_out is not defined, and could not be inferred from data, set dls.c or pass n_out

I’m not sure how I should be adding a categorical feature to my model. If I instead use

to_nn = TabularPandas(chip_df,
    cat_names=None, cont_names=cont_nn, y_names='height',
    splits=splits, y_block=RegressionBlock())

Everything works fine (except obviously I’m not including the categorical variable that I want to include). Is there a different method I should be using for combining categorical features. (I’m on fastai v. 2.7.7)