A walk with fastai2 - Tabular - Study Group and Online Lectures Megathread

Yes, I have tried also and stacking with feature engineer and this was the best solution.

However, as fastai2 learner runs very fast. I am going to try to make an ensemble with fastai learner too

@muellerzr executing the following code for predicting in a full dataframe is printing white lines like hell:

with learn.no_bar() and learn.no_logging():
    for i in range(df_test.shape[0]):

You should use get_preds and test_dl for anything more than 1 item otherwise it’s inefficient

I’ve never understood pretty well dataloaders. Comming from vision module in that area just were 2 dataloaders, the one from training and the one for test.

In tabular is 3 dataloaders?

There’s always been 3, no matter what sub-area of the library. Your training, validation, and test. Training and validation are used during training, the test or test_dl is used for inference only. It applies anything done to your validation set to some new dataset.

Predict actually generates a test_dl of 1 as well

1 Like


Could you tell me what do you think of the approach that I followed?

The approach that I followed at semantic segmentation was using just 2 dataloaders.

When i was selecting hyperparameters for example weight decay, I provided training with randomsplitter.

After having selecting the hyperparameters I went also with two dataloaders: full training and provided test folder splitted via ParentSplitter, this second time I was maximizing Dice.

I did this with several architectures: Unet, DeepLabV3+, Mask-RCNN…
I selected the arch with best dice and exported learn.model to TorchScript and now I am doing inference with plain PyTorch

Do you provide as parameter in get_preds or you assign it to learner? This part of the library looks confusing for me.

You do learn.get_preds(dl=dl)

Where dl is your new test DataLoader

1 Like

Okey, I think that I have understood it now . Thank you very much for your help!

Just for knowing if I have understood your course correctly. What do you think about the approach that I mention in the comment above?

Z - you cover the standard fastai tabular as well as xgboost and random forests. Do you ever cover exporting trained embeddings to the tree methods? Could always export the embeddings and recreate the full data layer outside fastai, but seems like there should be a way to hijack the fastai feed into a prepped array for xgboost training.

@ralph No I don’t, though it’s been done by someone on the forums. Also you can now represent xgboost and RF’s as NN’s so there’s a potential there too :wink: https://t.co/VyJvexZe2e?amp=1

1 Like

I’m not sure as I’ve never tried before, but I’d assume that would be fine. It’s the same concept as text data in a way where we train a LM on all the data and then fine tune it

1 Like

What is an LM?

I am going to try to explain a little better, I think that my previous explanation was not good at all.

The project that I did was comparing several architectures for segmentation. I separated my data in training and validation (this data is representative of the real one).

For each architecture I selected hyperparameters (for example weight decay). For selecting hyperparameters I used a RandomSplitter applied just to train data.

After selecting hyperparameters, I trained a model in full training set and I passed the validation folder as validation loader. Here I trained maximizing Dice.

After all architectures were trained with best hyperparameters, I choose the one with highest Dice accuracy.

Yes, I did understand what you meant :slight_smile: LM = Language Model. It follows (somewhat) the same concept in some ways, so I think you’d be fine (in terms of some thing representative of your dataset)

1 Like

Thank you very much for the course, the patience in the forums and your help! It is nice to have you in the forums!

I tried

procs = [Categorify,Normalize]
to_test = TabularPandas(df_test, procs=procs, cat_names=features)
dl_test= to_test.dataloaders(bs=512)

But it didn’t work

ValueError                                Traceback (most recent call last)
~/anaconda3/envs/proyecto5/lib/python3.7/site-packages/fastai2/learner.py in one_batch(self, i, b)
    160             if len(self.yb) == 0: return
--> 161             self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')
    162             if not self.training: return

~/anaconda3/envs/proyecto5/lib/python3.7/site-packages/fastai2/layers.py in __call__(self, inp, targ, **kwargs)
    293         if self.flatten: inp = inp.view(-1,inp.shape[-1]) if self.is_2d else inp.view(-1)
--> 294         return self.func.__call__(inp, targ.view(-1) if self.flatten else targ, **kwargs)

~/anaconda3/envs/proyecto5/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():

~/anaconda3/envs/proyecto5/lib/python3.7/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
    931         return F.cross_entropy(input, target, weight=self.weight,
--> 932                                ignore_index=self.ignore_index, reduction=self.reduction)

~/anaconda3/envs/proyecto5/lib/python3.7/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   2316         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2317     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)

~/anaconda3/envs/proyecto5/lib/python3.7/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   2112         raise ValueError('Expected input batch_size ({}) to match target batch_size ({}).'
-> 2113                          .format(input.size(0), target.size(0)))
   2114     if dim == 2:

ValueError: Expected input batch_size (512) to match target batch_size (0).

During handling of the above exception, another exception occurred:

You pass in to_test directly to get_preds. TabularPandas is a DataLoader itself

@WaterKnight see the 02_Regression notebook, specifically " Inference on a test set"

What’s the reason behind the restriction of learn.predict for just one data frame row? I tend to prefer predict over get_preds because it decodes the predictions.

That’s simply what .predict is meant to do :slight_smile: you can still decode a batch of data instead (afterwards) but predict has always been predicting one item.

It’s also super inefficient to do so. Let’s remove fastai and show this in just PyTorch. If we were to run a single prediction on 100 items, here’s what it would end up like:

for batch in test_dl:
    with torch.no_grad():
        out = learn.model(*batch[:-1])

It’s time is ~595ms on average. What about batches? If I do a batch of 32 samples, it’s only 23.8ms!

So then how do I decode? We change it to:

inp,preds,_,dec_preds = learn.get_preds(dl=test_dl, with_decoded=True, with_input=True)
b = (*tuplify(inp),*tuplify(dec_preds))
dec_pandas = learn.dls.decode(b)

This will then decode it for us :slight_smile:

(and if you want the raw messy version it looks like this):

outs = []
cats, conts = [], []
for batch in test_dl:
    with torch.no_grad():
        cats += batch[0]
        conts += batch[1]

cats = torch.stack(cats)
conts = torch.stack(conts)
outs = torch.cat(outs, dim=0)
b = (*tuplify((cats, conts)), *tuplify(outs))
dec_pandas = learn.dls.decode(b)

(Why use raw messy version? Saves about half the time (31.9ms vs 71.5ms)

Though soon to be not messy, stay tuned

Thank you, I am going to look at it!

Wow, I did not know that tuplify thing. Thank you so much @muellerzr!!!