Fastai v2 tabular

travis · February 17, 2020, 7:19pm

Adults works fine, but MSELossFlat returns RuntimeError: bool value of Tensor with more than one value is ambiguous. Did a quick %debug, but not immediately obvious to me why. It fails on if size_average and reduce: in torch/nn/_reduction.py. I’m not sure what ‘size_average’ refers to, but it’s a tensor with many different values, some of them negative, min value: -0.1936, max value 0.4885.

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/loss.py in __init__(self, size_average, reduce, reduction)
    426 
    427     def __init__(self, size_average=None, reduce=None, reduction='mean'):
--> 428         super(MSELoss, self).__init__(size_average, reduce, reduction)
    429 
    430     def forward(self, input, target):

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/loss.py in __init__(self, size_average, reduce, reduction)
     10         super(_Loss, self).__init__()
     11         if size_average is not None or reduce is not None:
---> 12             self.reduction = _Reduction.legacy_get_string(size_average, reduce)
     13         else:
     14             self.reduction = reduction

~/anaconda3/lib/python3.7/site-packages/torch/nn/_reduction.py in legacy_get_string(size_average, reduce, emit_warning)
     34         reduce = True
     35 
---> 36     if size_average and reduce:
     37         ret = 'mean'
     38     elif reduce:

RuntimeError: bool value of Tensor with more than one value is ambiguous

muellerzr · February 17, 2020, 7:21pm

First, we’re sure that it’s MSELossFlat() when passing it in? If so the next step would be to manually calculate it with two of your y’s (one from a model standpoint and one from your ground truth)

travis · February 17, 2020, 7:28pm

Yeah, it’s definitely MSELossFlat.

~/git/fastai2/nbs/mine/fastai2/layers.py in MSELossFlat(axis, floatify, *args, **kwargs)
    313 def MSELossFlat(*args, axis=-1, floatify=True, **kwargs):
    314     "Same as `nn.MSELoss`, but flattens input and target."
--> 315     return BaseLoss(nn.MSELoss, *args, axis=axis, floatify=floatify, is_2d=False, **kwargs)
    316 
    317 # Cell

muellerzr · February 17, 2020, 7:32pm

@travis a few things I’m noticing in your DataBunch creation. You’re not specifying a regression problem, so most likely it’s standardizing to a classification problem. To do so, in your call to TabularPandas you should add block_y = TransformBlock() (or RegressionBlock). Second, you’re also attempting to use accuracy, which is meant for classification problems. Try that and see if it solves your issue

The notebook I’m looking at is the Rossmann notebook here:

github.com

fastai/fastai2/blob/master/nbs/course/lesson6-rossmann.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai2.tabular.all import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Rossmann"
   ]
  },
  {
   "cell_type": "markdown",

This file has been truncated. show original

travis · February 17, 2020, 7:35pm

I’m actually treating it as a classification problem, as either a win or a loss. The targets are all 0’s & 1’s. And by the way, this same technique worked fine last year in Fastai v1. I’m just trying to convert it over to v2.

muellerzr · February 17, 2020, 7:38pm

In that case you should be using CrossEntropyLoss instead as MSELossFlat is meant for regression problems (hence our issue I think) as we’re actually outputting 2 values (probability of 0 and 1) when in reality it’s expecting just one (our one value) for MSE. Will it run with CrossEntropy? (And you were using MSE in v1?)

travis · February 17, 2020, 8:00pm

Unfortunately, cross entropy is not working either. That’s what was throwing the error originally. I tried sklearn log_loss, BCELossFlat, torch.nn.functional.binary_cross_entropy. Each one throws TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first..

travis · February 17, 2020, 8:10pm

No, sklearn log_loss is what I used. I should have known MSE wouldn’t work, but didn’t think it through. I only know enough about all this to be dangerous.

travis · February 17, 2020, 8:40pm

Well, actually, I just looked back at my notebook from last year. I don’t see where I defined a loss function. I guess tabular learner inferred it. It wasn’t able to do that this time, so I supplied log_loss, which threw the error.

Anyway, thanks @muellerzr for your help.

muellerzr · February 19, 2020, 1:11pm

Said fix was pushed the other day btw

navneetkrch · February 20, 2020, 11:18am

Thanks a lot @muellerzr & @nestorDemeure,

SHAP for FASTAI Tabular Regression is working.
Others can find this Github Gist for Colab notebbok for plots.

dangraf · February 22, 2020, 10:59pm

I just installed latest version of fastai2 and are trying to run the notebooks for tabular data but it fails.

One of the places are:

dls = to.dataloaders()
dls.valid.show_batch()

It seems like one_batch() is called and generates a subcall with variable “b” which is of type “tuple”.
the following lines are called to extract the row from the dataframe:

class _TabIloc:
“Get/set rows by iloc and cols by name”
def init(self,to): self.to = to
def getitem(self, idxs):
df = self.to.items
if isinstance(idxs,tuple):
rows,cols = idxs

So the problem seem to be that the idx that is a tuple with one item (row 0) is trying to be unpacked to rows and cols.
Does anyone else have this problem or did I fail to install fastai2 properly?
If this is a bug, what’s the correct way to fix it? Should the “b” variable be another type or should the tuple check if the length is 1 or 2 and unpack the rows and cols accordingly?

muellerzr · February 22, 2020, 11:02pm

@dangraf can you tell us the error code and what you did to set up your TabularPandas?

dangraf · February 23, 2020, 8:33am

I installed fastaiv2 by creating a news environment and then installing pytortch (conda install -c pytorch pytorch) to get version 1.4 and then cloning the gitrepo and then using pip install -e .“[dev]” to install fastai2.

After that, I opened up notebook 40 and started to run the cells from top down.

The error code I got is the following:

Could not do one pass in your dataloader, there is something wrong in it

ValueError Traceback (most recent call last)
in
1 dls = to.dataloaders()
----> 2 dls.valid.show_batch()

c:\gitrepo\fastai2\fastai2\data\core.py in show_batch(self, b, max_n, ctxs, show, **kwargs)
88
89 def show_batch(self, b=None, max_n=9, ctxs=None, show=True, **kwargs):
—> 90 if b is None: b = self.one_batch()
91 if not show: return self._pre_show_batch(b, max_n=max_n)
92 show_batch(*self._pre_show_batch(b, max_n=max_n), ctxs=ctxs, max_n=max_n, **kwargs)

c:\gitrepo\fastai2\fastai2\data\load.py in one_batch(self)
128 def one_batch(self):
129 if self.n is not None and len(self)==0: raise ValueError(f’This DataLoader does not contain any batches’)
→ 130 with self.fake_l.no_multiproc(): res = first(self)
131 if hasattr(self, ‘it’): delattr(self, ‘it’)
132 return res

C:\ProgramData\Anaconda3\envs\cryptopred\lib\site-packages\fastcore\utils.py in first(x)
174 def first(x):
175 “First element of x, or None if missing”
→ 176 try: return next(iter(x))
177 except StopIteration: return None
178

c:\gitrepo\fastai2\fastai2\data\load.py in iter(self)
95 self.randomize()
96 self.before_iter()
—> 97 for b in _loadersself.fake_l.num_workers==0:
98 if self.device is not None: b = to_device(b, self.device)
99 yield self.after_batch(b)

C:\ProgramData\Anaconda3\envs\cryptopred\lib\site-packages\torch\utils\data\dataloader.py in next(self)
343
344 def next(self):
→ 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \

C:\ProgramData\Anaconda3\envs\cryptopred\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
383 def _next_data(self):
384 index = self._next_index() # may raise StopIteration
→ 385 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
386 if self._pin_memory:
387 data = _utils.pin_memory.pin_memory(data)

C:\ProgramData\Anaconda3\envs\cryptopred\lib\site-packages\torch\utils\data_utils\fetch.py in fetch(self, possibly_batched_index)
32 raise StopIteration
33 else:
—> 34 data = next(self.dataset_iter)
35 return self.collate_fn(data)
36

c:\gitrepo\fastai2\fastai2\data\load.py in create_batches(self, samps)
104 self.it = iter(self.dataset) if self.dataset is not None else None
105 res = filter(lambda o:o is not None, map(self.do_item, samps))
→ 106 yield from map(self.do_batch, self.chunkify(res))
107
108 def new(self, dataset=None, cls=None, **kwargs):

c:\gitrepo\fastai2\fastai2\data\load.py in do_batch(self, b)
125 def create_item(self, s): return next(self.it) if s is None else self.dataset[s]
126 def create_batch(self, b): return (fa_collate,fa_convert)self.prebatched
→ 127 def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
128 def one_batch(self):
129 if self.n is not None and len(self)==0: raise ValueError(f’This DataLoader does not contain any batches’)

in create_batch(self, b)
7 super().init(dataset, bs=bs, shuffle=shuffle, after_batch=after_batch, num_workers=num_workers, **kwargs)
8
----> 9 def create_batch(self, b): return self.dataset.iloc[b]
10
11 TabularPandas._dl_type = TabDataLoader

in getitem(self, idxs)
6 df = self.to.items
7 if isinstance(idxs,tuple):
----> 8 rows,cols = idxs
9 cols = df.columns.isin(cols) if is_listy(cols) else df.columns.get_loc(cols)
10 else: rows,cols = idxs,slice(None)

ValueError: too many values to unpack (expected 2)

kshitijpatil09 · February 24, 2020, 11:48am

Is there any way to use Tabular as a TransformBlock in DataBlock API? Like using it with other types of data (image,mask,etc.)

sgugger · February 24, 2020, 3:08pm

No, it’s independent of the other blocks. Tabular is there to preprocess datafraems and creates batches from them. It only supports a y_block for targets. There will be a more modular block but since multimodal settings was not a priority in development, it’s not ready yet.

dangraf · February 25, 2020, 8:50pm

Does anyone have input on the error above? I just tried install the fastai2 on a linux machine (previously was windows) using the environment.yml file. I get the same error.

muellerzr · February 25, 2020, 8:56pm

@dangraf I can’t recreate the error in Colab on the regular install (I haven’t tried dev yet).

Edit: Okay, now I can. It’s a bug inside of the dev version.

@sgugger it seems to be from the fact that the idxs are a very long list when calling a batch, whereas simply doing TabularPandas returns something different (The root of the bug is TabIloc, I put the print statement in idxs like so:

#export
class _TabIloc:
    "Get/set rows by iloc and cols by name"
    def __init__(self,to): self.to = to
    def __getitem__(self, idxs):
        df = self.to.items
        print(idxs)
        if isinstance(idxs,tuple):
            rows,cols = idxs
            cols = df.columns.isin(cols) if is_listy(cols) else df.columns.get_loc(cols)
        else: rows,cols = idxs,slice(None)
        return self.to.new(df.iloc[rows, cols])

What is expected when just doing a TabularPandas:

to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)
(slice(None, None, None), 'workclass')
(slice(None, None, None), 'education')
(slice(None, None, None), 'marital-status')
(slice(None, None, None), 'occupation')
(slice(None, None, None), 'relationship')
(slice(None, None, None), 'race')
(slice(None, None, None), 'age_na')
(slice(None, None, None), 'fnlwgt_na')
(slice(None, None, None), 'education-num_na')
(slice(None, None, None), 'salary')

Behavior on dls.one_batch():

(1829, 6000, 4754, 3678, 823, 4682, 3525, 3136, 4430, 6376, 3077, 5487, 4382, 1594, 3501, 4306, 258, 7924, 6271, 7174, 5970, 1363, 7407, 4908, 2201, 7369, 3305, 7116, 499, 4439, 5406, 4046, 3743, 6204, 639, 1232, 3675, 256, 5134, 4411, 7563, 6902, 5661, 3314, 1243, 5573, 3327, 750, 6232, 3363, 2840, 5906, 4775, 7995, 4008, 3089, 7674, 4214, 5414, 5955, 7726, 3045, 7570, 3432)

ValueError                                Traceback (most recent call last)
<ipython-input-90-ccb93b9fbe07> in <module>()
----> 1 dls.one_batch()

9 frames
/usr/local/lib/python3.6/dist-packages/fastai2/data/load.py in one_batch(self)
    128     def one_batch(self):
    129         if self.n is not None and len(self)==0: raise ValueError(f'This DataLoader does not contain any batches')
--> 130         with self.fake_l.no_multiproc(): res = first(self)
    131         if hasattr(self, 'it'): delattr(self, 'it')
    132         return res

/usr/local/lib/python3.6/dist-packages/fastcore/utils.py in first(x)
    174 def first(x):
    175     "First element of `x`, or None if missing"
--> 176     try: return next(iter(x))
    177     except StopIteration: return None
    178 

/usr/local/lib/python3.6/dist-packages/fastai2/data/load.py in __iter__(self)
     95         self.randomize()
     96         self.before_iter()
---> 97         for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
     98             if self.device is not None: b = to_device(b, self.device)
     99             yield self.after_batch(b)

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    343 
    344     def __next__(self):
--> 345         data = self._next_data()
    346         self._num_yielded += 1
    347         if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    383     def _next_data(self):
    384         index = self._next_index()  # may raise StopIteration
--> 385         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    386         if self._pin_memory:
    387             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     32                 raise StopIteration
     33         else:
---> 34             data = next(self.dataset_iter)
     35         return self.collate_fn(data)
     36 

/usr/local/lib/python3.6/dist-packages/fastai2/data/load.py in create_batches(self, samps)
    104         self.it = iter(self.dataset) if self.dataset is not None else None
    105         res = filter(lambda o:o is not None, map(self.do_item, samps))
--> 106         yield from map(self.do_batch, self.chunkify(res))
    107 
    108     def new(self, dataset=None, cls=None, **kwargs):

/usr/local/lib/python3.6/dist-packages/fastai2/data/load.py in do_batch(self, b)
    125     def create_item(self, s):  return next(self.it) if s is None else self.dataset[s]
    126     def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
--> 127     def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
    128     def one_batch(self):
    129         if self.n is not None and len(self)==0: raise ValueError(f'This DataLoader does not contain any batches')

<ipython-input-46-cad3c12e3ff5> in create_batch(self, b)
      6         super().__init__(dataset, bs=bs, shuffle=shuffle, after_batch=after_batch, num_workers=num_workers, **kwargs)
      7 
----> 8     def create_batch(self, b): return self.dataset.iloc[b]
      9 
     10 TabularPandas._dl_type = TabDataLoader

<ipython-input-52-8847abb6a04f> in __getitem__(self, idxs)
      6         print(idxs)
      7         if isinstance(idxs,tuple):
----> 8             rows,cols = idxs
      9             cols = df.columns.isin(cols) if is_listy(cols) else df.columns.get_loc(cols)
     10         else: rows,cols = idxs,slice(None)

ValueError: too many values to unpack (expected 2)

Hope that helps with debugging (as I’m unsure what to do here)

sgugger · February 25, 2020, 9:05pm

It’s all running fine for me, so I think the problem is not having the dev install of fastcore to go with fastai2.

muellerzr · February 25, 2020, 9:07pm

Shoot… that was totally what was going on… my bad!

@dangraf there you go

Considering this is a common issue, I added it to the FAQ as well