Regression on tabular data when target elements are numpy arrays

nestorDemeure · October 8, 2020, 7:49am

I have hit a roadblock while converting a code from V1 to V2. I have a dataframe where the elements of a column are numpy arrays of size 2000.
When fastai extracts the column, it gets a numpy array whose elements are themselves numpy arrays, something it is not equipped to deal with.

In V1 I could get a model to train without problems by defining my own label class, building on FloatList, to do a proper conversion but I have not been able to get something similar working in V2.

My first intuition was to use a modified RegressionBlock but I get errors before it’s methods are even called…
Another solution would be to provide outputs separately from the inputs (I would then give an input dataframe and an output tensor with the proper number of rows) but the tabular API does not seem designed to do that kind of things.

Here is a Collab notebook with a minimal reproductible example: Example.ipynb

muellerzr · October 8, 2020, 8:11am

You need to separate them into a column for each item. TabularPandas can’t process a NumPy array y

s.s.o · October 8, 2020, 8:47am

@muellerzr, I don’t know how it may effect other functions, but a small modification ( if isinstance(x, ndarray) or isinstance(x, array)) can solve it…

@use_kwargs_dict(dtype=None, device=None, requires_grad=False, pin_memory=False)
def tensor(x, *rest, **kwargs):
    "Like `torch.as_tensor`, but handle lists too, and can pass multiple vector elements directly."
    if len(rest): x = (x,)+rest
    # There was a Pytorch bug in dataloader using num_workers>0. Haven't confirmed if fixed
    # if isinstance(x, (tuple,list)) and len(x)==0: return tensor(0)
    res = (x if isinstance(x, Tensor)
           else torch.tensor(x, **kwargs) if isinstance(x, (tuple,list))
           else _array2tensor(x) if isinstance(x, ndarray)                    ***** <--- This line 
           else as_tensor(x.values, **kwargs) if isinstance(x, (pd.Series, pd.DataFrame))
           else as_tensor(x, **kwargs) if hasattr(x, '__array__') or is_iter(x)
           else _array2tensor(array(x), **kwargs))
    if res.dtype is torch.float64: return res.float()
    return res

nestorDemeure · October 8, 2020, 10:23am

I cannot separate each cell into a column as, in this form, the dataset does not fit in memory.

(there are also practical reasons that would make further manipulation of the results a lot less efficient in this form)

muellerzr · October 8, 2020, 10:24am

I see. I would perhaps@patch @s.s.o‘s answer onto tensor and see what happens then

nestorDemeure · October 8, 2020, 10:26am

In the long term, if it works, this would require either a PR to fastai or using a forked version ?

muellerzr · October 8, 2020, 10:26am

Yes, a pr to fastai would be ideal

nestorDemeure · October 8, 2020, 11:39am

I manage to get it training when I inject the following code into fastai

def _pandas2tensor(x, **kwargs):
    "Converts pandas Dataframe or Serie into numpy array."
    v = x.values # extracts the values as a numpy array
    if (v.dtype == np.object_): # deals with arrays whose item type is itself a numpy array
        nb_rows = v.shape[0]
        if nb_rows == 1: v = v.item() # only one row, cannot stack
        else: v = np.vstack(v.squeeze())
    return as_tensor(v, **kwargs)

@use_kwargs_dict(dtype=None, device=None, requires_grad=False, pin_memory=False)
def tensor(x, *rest, **kwargs):
    "Like `torch.as_tensor`, but handle lists too, and can pass multiple vector elements directly."
    if len(rest): x = (x,)+rest
    # There was a Pytorch bug in dataloader using num_workers>0. Haven't confirmed if fixed
    # if isinstance(x, (tuple,list)) and len(x)==0: return tensor(0)
    res = (x if isinstance(x, Tensor)
           else torch.tensor(x, **kwargs) if isinstance(x, (tuple,list))
           else _array2tensor(x) if isinstance(x, ndarray)
           else _pandas2tensor(x, **kwargs) if isinstance(x, (pd.Series, pd.DataFrame))
           else as_tensor(x, **kwargs) if hasattr(x, '__array__') or is_iter(x)
           else _array2tensor(array(x), **kwargs))
    if res.dtype is torch.float64: return res.float()
    return res

However the show_batch, show_results and more importantly predict methods fail when they try to rebuild a dataframe. The problem comes from the decode function of ReadTabBatch. I will see if I can rewrite it…