A walk with fastai2 - Tabular - Study Group and Online Lectures Megathread

muellerzr · May 17, 2020, 4:38pm

I think I’ve found the issue and it involves the fact that CategoryMap uses sort=True. Let me verify real quick (this is just me scanning source code)

muellerzr · May 17, 2020, 4:51pm

Okay I’ve solved the issue @vrodriguezf , let me go put a PR in but here’s what I had to do:

Categorize now became:

class myCategorize(Transform):
    "Reversible transform of category string to `vocab` id"
    loss_func,order=CrossEntropyLossFlat(),1
    def __init__(self, vocab=None, add_na=False, sort=True):
        self.add_na = add_na
        self.vocab = None if vocab is None else CategoryMap(vocab,  sort=sort, add_na=add_na)

    def setups(self, dsets):
        if self.vocab is None and dsets is not None: self.vocab = CategoryMap(dsets, sort=sort, add_na=self.add_na)
        self.c = len(self.vocab)

    def encodes(self, o): return TensorCategory(self.vocab.o2i[o])
    def decodes(self, o): return Category      (self.vocab    [o])

Specifically we pass in a sort value which will re-sort those values instead of maintaining the order that was passed in. Then we adjusted the CategoryBlock:

def myCategoryBlock(vocab=None, sort=True, add_na=False):
    return TransformBlock(type_tfms=myCategorize(vocab=vocab, sort=sort, add_na=add_na))

To ensure this request. Finally we had to adjust the Categorize setups like so:

@myCategorize
def setups(self, to:Tabular):
    if len(to.y_names) > 0:
        if self.vocab is None:
            self.vocab = CategoryMap(getattr(to, 'train', to).iloc[:,to.y_names[0]].items)
        else:
            self.vocab = CategoryMap(self.vocab, sort=False, add_na=self.add_na)
        self.c = len(self.vocab)
    return self(to)

By default it was sorting every time, this is the behavior that needed to be changed.

Then when you call y_block make sure to use myCategoryBlock with your parameters

vrodriguezf · May 17, 2020, 5:02pm

Wow that was fast! I’ll use that workaround for now thank you so much, you’re awesome!

muellerzr · May 17, 2020, 5:04pm

If you run into an issue, categorize should be:

# export
class Categorize(Transform):
    "Reversible transform of category string to `vocab` id"
    loss_func,order=CrossEntropyLossFlat(),1
    def __init__(self, vocab=None, sort=True, add_na=False):
        self.add_na = add_na
        self.sort = sort
        self.vocab = None if vocab is None else CategoryMap(vocab, sort=sort, add_na=add_na)

    def setups(self, dsets):
        if self.vocab is None and dsets is not None: self.vocab = CategoryMap(dsets, sort=self.sort, add_na=self.add_na)
        self.c = len(self.vocab)

    def encodes(self, o): return TensorCategory(self.vocab.o2i[o])
    def decodes(self, o): return Category      (self.vocab    [o])

(finding bugs as I actually code this thing )

Edit: All bugs are fixed, you should be good to go @vrodriguezf

muellerzr · May 18, 2020, 12:14pm

FYI this fix is now in the main library. Install fastcore and fastai2 with the dev installs to use right away

vrodriguezf · May 20, 2020, 10:36am

Hi, there is a weird behaviour (in my opinion) when calling learn.show_results() with a TabularLearner.

If you type type(learn.dl) after fitting the learner, I get fastai2.tabular.core.TabDataLoader. However, after calling learn.show_results() it gives list. Does it make sense that a call to show_results modify the type an attribute of the learner?

I realized about this because fastshap was rasing an error because it expects learn.dl to be a TabDataLoader.

Thanks!

muellerzr · May 20, 2020, 11:16am

That’s an interesting behavior, because this is all show_results is:

@typedispatch
def show_results(x:Tabular, y:Tabular, samples, outs, ctxs=None, max_n=10, **kwargs):
    df = x.all_cols[:max_n]
    for n in x.y_names: df[n+'_pred'] = y[n][:max_n].values
    display_df(df)

Notice we don’t actually modify anything. Nor make it a list. I’m wondering if we need .copy()’s here instead? (Maybe you can try that?) if that doesn’t work, highly recommend filing an issue on GitHub with a reproducer (cc @sgugger if you can think of why on the top of your head)

sgugger · May 20, 2020, 11:39am

learn.dl is not a reliable attribute: it’s saved during any run of training loop/inference to represent the dl currently used, but outside of that it’s not a useful attribute. In this case it goes from the validation dataloader (from your previous fit) to a list containing one batch (from the get_preds launched by show_results).

In link to #350 I’ll set it to None at the end of every training for cleanup (with learn.xb, learn.yb, learn.preds, learn.loss) so you won’t actually see anything in it.

muellerzr · May 20, 2020, 11:45am

Noted! @vrodriguezf I’ll make some adjustments to fastshap by the end of the week with a fix. Thanks Sylvain

vrodriguezf · May 20, 2020, 12:56pm

Understood! It looked like a weird attribute tbh. Thanks for the fix!!!

WaterKnight · June 6, 2020, 7:27pm

On your Notebook @muellerzr you make use of

tabular_config({'emb_p':float(dp),
                          'wd':float(wd)})

With this I am getting next error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/proyecto5/lib/python3.7/site-packages/bayes_opt/target_space.py in probe(self, params)
    190         try:
--> 191             target = self._cache[_hashable(x)]
    192         except KeyError:

KeyError: (0.21434078230426126, 158.04867401632373, 100.10293733561039, 744.1986307373115, 0.014684121522803134, 1.1846771895375956, 0.07482958046651729)

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<timed eval> in <module>

~/anaconda3/envs/proyecto5/lib/python3.7/site-packages/bayes_opt/bayesian_optimization.py in maximize(self, init_points, n_iter, acq, kappa, kappa_decay, kappa_decay_delay, xi, **gp_params)
    183                 iteration += 1
    184 
--> 185             self.probe(x_probe, lazy=False)
    186 
    187             if self._bounds_transformer:

~/anaconda3/envs/proyecto5/lib/python3.7/site-packages/bayes_opt/bayesian_optimization.py in probe(self, params, lazy)
    114             self._queue.add(params)
    115         else:
--> 116             self._space.probe(params)
    117             self.dispatch(Events.OPTIMIZATION_STEP)
    118 

~/anaconda3/envs/proyecto5/lib/python3.7/site-packages/bayes_opt/target_space.py in probe(self, params)
    192         except KeyError:
    193             params = dict(zip(self._keys, x))
--> 194             target = self.target_func(**params)
    195             self.register(x, target)
    196         return target

<ipython-input-7-c018850183eb> in fit_with(lr, wd, dp, n_layers, layer_1, layer_2, layer_3)
      8         layers = [int(layer_1)]
      9     config = tabular_config({"emb_p":float(dp),
---> 10                           "wd":float(wd)})
     11     learn = tabular_learner(dls, layers=layers, metrics=accuracy, config = config)
     12 

TypeError: tabular_config() takes 0 positional arguments but 1 was given

muellerzr · June 6, 2020, 7:45pm

@WaterKnight there was a change in the code apparently somewhere along the line. Now the kwargs are passed in as actual parameters. IE:

Instead of

 kwargs = {'embed_p':0.1}
config = tabular_config(kwargs)

You should do

config = tabular_config(embed_p=0.1)

I’ll show this adjustment in the notebook shortly

Edit: @WaterKnight that notebook has been updated. Thanks for the bug report

WaterKnight · June 6, 2020, 7:48pm

@muellerzr Thank you very much as always!

In addition, I would like to know what type of data accept the predict method

muellerzr · June 6, 2020, 7:50pm

It’s a Pandas row. IE:

learn.predict(df.iloc[0])

A NumPy array will not work. Also it only works on one individual row

WaterKnight · June 6, 2020, 7:51pm

You are welcome. I think that I have seen this in other notebooks. So take a look at it, if you can’t I will do it for you!

muellerzr · June 6, 2020, 7:52pm

Should be the only notebook that has it. If there’s any more please let me know

WaterKnight · June 6, 2020, 7:57pm

You are right!

I am going to look at your ensembling notebook. In a subject we have worked with LightGBM and XGBoost. I am trying to find if this learner can do better!

muellerzr · June 6, 2020, 8:00pm

Most likely it will not outperform the GBM or XGBoost (it may with a ton of hyperparameter tuning, but without it it will still be close), however ensembling always helps.

WaterKnight · June 6, 2020, 8:03pm

Yes, I have tried also and stacking with feature engineer and this was the best solution.

However, as fastai2 learner runs very fast. I am going to try to make an ensemble with fastai learner too

@muellerzr executing the following code for predicting in a full dataframe is printing white lines like hell:

with learn.no_bar() and learn.no_logging():
    res=[]
    for i in range(df_test.shape[0]):
        aux=learn.predict(df_test.iloc[i])[2].cpu().numpy()
        res.append(aux)

muellerzr · June 6, 2020, 8:10pm

You should use get_preds and test_dl for anything more than 1 item otherwise it’s inefficient