I have hit a roadblock while converting a code from V1 to V2. I have a dataframe where the elements of a column are numpy arrays of size 2000.
When fastai extracts the column, it gets a numpy array whose elements are themselves numpy arrays, something it is not equipped to deal with.
In V1 I could get a model to train without problems by defining my own label class, building on FloatList, to do a proper conversion but I have not been able to get something similar working in V2.
My first intuition was to use a modified RegressionBlock but I get errors before it’s methods are even called…
Another solution would be to provide outputs separately from the inputs (I would then give an input dataframe and an output tensor with the proper number of rows) but the tabular API does not seem designed to do that kind of things.
Here is a Collab notebook with a minimal reproductible example: Example.ipynb
@muellerzr, I don’t know how it may effect other functions, but a small modification ( if isinstance(x, ndarray) or isinstance(x, array)) can solve it…
@use_kwargs_dict(dtype=None, device=None, requires_grad=False, pin_memory=False)
def tensor(x, *rest, **kwargs):
"Like `torch.as_tensor`, but handle lists too, and can pass multiple vector elements directly."
if len(rest): x = (x,)+rest
# There was a Pytorch bug in dataloader using num_workers>0. Haven't confirmed if fixed
# if isinstance(x, (tuple,list)) and len(x)==0: return tensor(0)
res = (x if isinstance(x, Tensor)
else torch.tensor(x, **kwargs) if isinstance(x, (tuple,list))
else _array2tensor(x) if isinstance(x, ndarray) ***** <--- This line
else as_tensor(x.values, **kwargs) if isinstance(x, (pd.Series, pd.DataFrame))
else as_tensor(x, **kwargs) if hasattr(x, '__array__') or is_iter(x)
else _array2tensor(array(x), **kwargs))
if res.dtype is torch.float64: return res.float()
return res
I manage to get it training when I inject the following code into fastai
def _pandas2tensor(x, **kwargs):
"Converts pandas Dataframe or Serie into numpy array."
v = x.values # extracts the values as a numpy array
if (v.dtype == np.object_): # deals with arrays whose item type is itself a numpy array
nb_rows = v.shape[0]
if nb_rows == 1: v = v.item() # only one row, cannot stack
else: v = np.vstack(v.squeeze())
return as_tensor(v, **kwargs)
@use_kwargs_dict(dtype=None, device=None, requires_grad=False, pin_memory=False)
def tensor(x, *rest, **kwargs):
"Like `torch.as_tensor`, but handle lists too, and can pass multiple vector elements directly."
if len(rest): x = (x,)+rest
# There was a Pytorch bug in dataloader using num_workers>0. Haven't confirmed if fixed
# if isinstance(x, (tuple,list)) and len(x)==0: return tensor(0)
res = (x if isinstance(x, Tensor)
else torch.tensor(x, **kwargs) if isinstance(x, (tuple,list))
else _array2tensor(x) if isinstance(x, ndarray)
else _pandas2tensor(x, **kwargs) if isinstance(x, (pd.Series, pd.DataFrame))
else as_tensor(x, **kwargs) if hasattr(x, '__array__') or is_iter(x)
else _array2tensor(array(x), **kwargs))
if res.dtype is torch.float64: return res.float()
return res
However the show_batch, show_results and more importantly predict methods fail when they try to rebuild a dataframe. The problem comes from the decode function of ReadTabBatch. I will see if I can rewrite it…