Fastai v2 tabular

navneetkrch · February 4, 2020, 7:49am

Still not working for me. And I have my Notebook similar to Rossman notebook, Rossman I am able to run without any issues.
One thing that I find different is that my training and validation losses are really high.

It worked for MAE but gives this error when I try to run with rmspe metrics.

@muellerzr Can you please give it a look at this github gist error with rmspe.

muellerzr · February 7, 2020, 5:36pm

@sgugger I’ve manage to incorporate SHAP (model interpretability) into the fastai2 library. What would be required to add this to the library for tabular interpretation? I’ve got it to a point where you pass in a DataFrame of values to run against, let me know your thoughts. Implementation is below in full:

def pred(data):
    device = learn.dls.device
    cat_cols = len(learn.dls.train_ds.cat_names)
    cont_cols = len(learn.dls.train_ds.cont_names)
    x_cat = torch.from_numpy(data[:, :cat_cols]).to(device, torch.int64)
    x_cont = torch.from_numpy(data[:, -cont_cols:]).to(device, torch.float32)
    pred_proba = learn.model(x_cat, x_cont).detach().to('cpu').numpy()
    return pred_proba
def shap_data(dls:TabularDataLoaders, test_data:pd.DataFrame):    
    X_traincat, X_traincont, y_train = dls.one_batch()
    dl = dls.test_dl(test_data)
    X_testcat,X_testcont, y_test = tensor(dl.cats).long(),tensor(dl.conts).float(), tensor(dl.targ)
    X_train = [X_traincat, X_traincont]
    X_test = [X_testcat, X_testcont]
    cols = dls.cat_names + dls.cont_names
    X_train = pd.DataFrame(np.concatenate([v.to('cpu').numpy() for v in X_train], axis=1), columns=cols)
    X_test = pd.DataFrame(np.concatenate([v.to('cpu').numpy() for v in X_test], axis=1), columns=cols)
    return X_train, X_test

Testing with data:

X_train, X_test = shap_data(learn.dls, df.iloc[:100])
e = shap.KernelExplainer(pred, X_train)
shap_values = e.shap_values(X_test, l1_reg=False)

This would probably be added onto Interpretation and then have each be able to call? (summary_plot, dependance_plot, etc) with an optional dataframe passed in if we wanted to use more test data

SHAP: https://github.com/slundberg/shap

Or would this be better as a seperate implementation that can be pip installed

jeremy · February 7, 2020, 11:12pm

Things with an extra dep are best done as pip/conda-installable extension modules. They can still be in the ‘fastai’ module namespace. This one looks like a great project!

muellerzr · February 7, 2020, 11:45pm

Got it! Will update when it’s done

muellerzr · February 10, 2020, 4:34am

I’ve now released it on PyPi, you can do pip install fastshap! The documentation is over at muellerzr.github.io/fastshap

Here is a basic output from a decision_plot:

(big help from @nestorDemeure for the documentation and refactoring)

hnrk · February 10, 2020, 2:07pm

Great library - thanks

Small typo in the link to the documentation:

muellerzr · February 10, 2020, 2:08pm

Thanks! Yes it was quite late last night when I posted it

navneetkrch · February 11, 2020, 2:42pm

Hey @muellerzr,
I tried the new FASTSHAP in the Google Colab environment and I was not able to display plots other than force_plot(), I found the issue and workaround and I have created Github Issue summarising the issue and workaround.
Github Issue for FASTSHAP plots not shown in Google Colab.

muellerzr · February 11, 2020, 2:47pm

I’m not 100% sure why you’re getting that issue. I looked into that specifically and it shouldn’t be needed. Can you provide a notebook showing what you’re doing I can look at @navneetkrch?

(As all I work out of is Colab)

Or can you post here which plot you attempted to use?

I just tried it again myself doing the decision, dependence, summary, and waterfall plot to which they all work (and tested them in that order)

The reason is only force_plot uses javascript the rest do not

navneetkrch · February 11, 2020, 8:56pm

Tested again and everything is working now. I do not understand how I was not getting the other plots earlier.
Closed the github issue as well.
Thanks a lot

muellerzr · February 11, 2020, 8:57pm

You may have used an old version when there were still some bugs possibly. All good, glad it’s working now

navneetkrch · February 12, 2020, 10:46am

Hi @muellerzr,
Have you to run SHAP for Tabular Regression models, it was not working for me and giving me errors.
I am running this on the most recent notebook that you ran. I tried to run for the adults dataset itself on a regression task of finding ‘age’.
Please find the Github Gist with error on SHAP for Tabular Regression.

travis · February 14, 2020, 11:30am

Following the Integration Example from the docs using NCAA tournament data, I’m getting a TypeError: can't convert CUDA tensor to numpy. Seems to result from the loss function (sklearn log_loss), but I don’t know how to fix it. Tried other loss functions, but no luck.

dep_var = 'result'
cat_vars = ['Season', 'TeamId_1', 'TeamId_2', 'Coach_1', 'Coach_2',
            'Top5_1', 'Top5_2', 'Top25_1', 'Top25_2', 'Top50_1', 'Top50_2',
            'ConfAbbrev_1', 'ConfAbbrev_2', 'Is_ConfGm', 'isMajor_1', 'isMajor_2'] 
cont_vars = [c for c in df.columns if c not in cat_vars]
cont_vars.remove('result')

procs=[FillMissing, Categorify, Normalize]
splits = RandomSplitter()(range_of(df))
to = TabularPandas(df, procs, cat_vars, cont_vars, y_names=dep_var, splits=splits)
dls = to.dataloaders()

learn = tabular_learner(dls, layers=[200,100], n_out=1, loss_func=log_loss,
                        metrics=[accuracy])
learn.lr_find()
learn.recorder.plot()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-44-c7a9c29f9dd1> in <module>
----> 1 learn.lr_find()
      2 learn.recorder.plot()

~/git/fastai2/nbs/mine/fastai2/callback/schedule.py in lr_find(self, start_lr, end_lr, num_it, stop_div, show_plot, suggestions)
    196     n_epoch = num_it//len(self.dls.train) + 1
    197     cb=LRFinder(start_lr=start_lr, end_lr=end_lr, num_it=num_it, stop_div=stop_div)
--> 198     with self.no_logging(): self.fit(n_epoch, cbs=cb)
    199     if show_plot: self.recorder.plot_lr_find()
    200     if suggestions:

~/git/fastai2/nbs/mine/fastai2/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    292                     try:
    293                         self.epoch=epoch;          self('begin_epoch')
--> 294                         self._do_epoch_train()
    295                         self._do_epoch_validate()
    296                     except CancelEpochException:   self('after_cancel_epoch')

~/git/fastai2/nbs/mine/fastai2/learner.py in _do_epoch_train(self)
    267         try:
    268             self.dl = self.dls.train;                        self('begin_train')
--> 269             self.all_batches()
    270         except CancelTrainException:                         self('after_cancel_train')
    271         finally:                                             self('after_train')

~/git/fastai2/nbs/mine/fastai2/learner.py in all_batches(self)
    245     def all_batches(self):
    246         self.n_iter = len(self.dl)
--> 247         for o in enumerate(self.dl): self.one_batch(*o)
    248 
    249     def one_batch(self, i, b):

~/git/fastai2/nbs/mine/fastai2/learner.py in one_batch(self, i, b)
    253             self.pred = self.model(*self.xb);                self('after_pred')
    254             if len(self.yb) == 0: return
--> 255             self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')
    256             if not self.training: return
    257             self.loss.backward();                            self('after_backward')

~/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_classification.py in log_loss(y_true, y_pred, eps, normalize, sample_weight, labels)
   2239     The logarithm used is the natural logarithm (base-e).
   2240     """
-> 2241     y_pred = check_array(y_pred, ensure_2d=False)
   2242     check_consistent_length(y_pred, y_true, sample_weight)
   2243 

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    529                     array = array.astype(dtype, casting="unsafe", copy=False)
    530                 else:
--> 531                     array = np.asarray(array, order=order, dtype=dtype)
    532             except ComplexWarning:
    533                 raise ValueError("Complex data not supported\n"

~/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

~/anaconda3/lib/python3.7/site-packages/torch/tensor.py in __array__(self, dtype)
    447     def __array__(self, dtype=None):
    448         if dtype is None:
--> 449             return self.numpy()
    450         else:
    451             return self.numpy().astype(dtype, copy=False)

TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

nestorDemeure · February 14, 2020, 11:37am

I will debug that this weekend.

nestorDemeure · February 16, 2020, 8:16pm

I wrote a bugfix for the shap+regression problem, it should be soon integrated into the main repo

muellerzr · February 16, 2020, 10:00pm

@travis does working with Adults run fine? Also does the fastai MSELossFlat work fine?

alvinleong · February 17, 2020, 5:43am

Hello,

I would like to use Fastai v2 to do a multivariable regression model.

This is what I did on Colab:

y_names = ['Phytofluene (Extract)','Phytoene (Extract)','Lycopene (Extract)']
cat_names = ['Tomato skin Dry gm','IPA', 'Ethanol','Chloroform', 'Acetone', 'Hexane', 'Mixing time','Wetting time', 'Particle size', 'Pre heating time', 'Pre heat temp',
             'Extraction residence time','Temp.',
       'Concentration temp', 'Concentration time', 'cooling temp',]
cont_names = []
procs = [FillMissing, Categorify, Normalize]
splits = RandomSplitter()(range_of(phytofluene))
to = TabularPandas(phytofluene, procs, cat_names, cont_names, y_names=y_names, splits=splits)
to.procs[-1].means
dls = to.dataloaders()
dls.valid.show_batch()
learn = tabular_learner(dls, n_out=3,layers=[200,100], metrics=accuracy, loss_func=torch.nn.L1Loss)

I get the following error when I try to fit_one_cycle:

/usr/local/lib/python3.6/dist-packages/fastprogress/fastprogress.py:74: UserWarning: Your generator is empty.
  warn("Your generator is empty.")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-56-4dfb24161c57> in <module>()
----> 1 learn.fit_one_cycle(1)

7 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py in legacy_get_string(size_average, reduce, emit_warning)
     34         reduce = True
     35 
---> 36     if size_average and reduce:
     37         ret = 'mean'
     38     elif reduce:

RuntimeError: bool value of Tensor with more than one value is ambiguous

muellerzr · February 17, 2020, 5:46am

You should call your loss function as a function itself. IE nn.L1Loss() @alvinleong

alvinleong · February 17, 2020, 7:10am

Hi, I changed it but it doesn’t help.

Another thing is that the validation set somehow loads data with #na#

alvinleong · February 17, 2020, 8:03am

I just tried it for one variable and it works fine.

I guess the multivariable regression is not ready yet:

http://dev.fast.ai/tabular.core#Regression