DataBunch from numpy arrays

If it might be helpful, I’ve created a custom dataset for a Kaggle competition that takes drawings encoded as a sequence, converts them to greyscal images and feeds the images to the network - and it is compatible with fastai tools. The notebook is below:

1 Like

While giving up and saving all arrays into image files would save me the trouble, I think getting images from arrays is very useful in general so I’m trying to persevere.

Tried the channels trick (took a serious amount of time :woman_facepalming: )
I think I have everything right but it still doesn’t work.

I’m pretty sure it’s a problem with getting the batches in Pytorch’s DataLoader. :thinking:

I would appreciate any guidance of course. Here’s the Colab notebook I am working on.

1 Like

@Zeina I have modified your notebook. The issue was the way you were returning the value of y from your Dataset.

Updated Notebook

@sgugger Why my train and validation loss is going negative? Loss function here is nll_loss and I have not used any data transformations. Loss is negative with resnet18 and lr as low as 1e-9

@sgugger I have updated the notebook URL.

I had some problems with a custom dataset class. Probably I was doing something wrong though.

I’ve had good success saving numpy arrays as png using this general flow:

numpy array named arr filled with integers(though I think this would work for floats also).

arr = (arr-arr.mean()) / arr.std() #normalizes between -1 and +1
arr = (arr + 1) / 2 * 255 # moves it between 0 and 255
arr = np.clip(arr, 0, 255).astype(np.uint8), clips it between 0 and 255 and converts to 8 bit int.
imageio.imwrite(output_filename, arr)

Using these I have created a databunch through the data block api or the higher level api.

1 Like

My interpretation is that, we are feeding torch’s nll_loss the output of model and true lables. And as per my understanding (and running manually) I find below observations.

  1. Model output of one batch is fed to nll_loss 's input.
  2. Target lables (0 indexed) are fed to nll_loss as target.
  3. nll_loss is simply returning the -sum(target * input) or -sum(input[target]). Which I believe should not be the case as negative log likelihood is defined as sum(y*logp)

Please see below image.


Here in run number 67 nll_loss simply took negative of index 27 value from input and run number 69 took nll_loss as negative of index 11 from input.
Why is that the case? Why it is not taking the log? Also, this is why I believe I was getting the negative loss as well. Both of the inputs in the above image are taken while debugging and running my above notebook.

@sgugger Please help please.
Thanks

CC: @jeremy Sorry for @ mention

Have you checked if the problem is with the order of the array? It is common to have a numpy array where channels is the last dimension, something like X.shape will return (5000, 120,120,3) 5000 samples of 120hx120wx3c`. But pytorch expects 3, 120, 120, 5000.

I am looking forward a solution myself, if I manage to solve, I will post here.

I’d love to see some example custom data classes from this problem or anything really. please post if you have done it successfully(or not successfully).

I guess I made it work:

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
# Normalization
mean = X_train.mean()
std = X_train.std()
X_train = (X_train-mean)/std
X_valid = (X_valid-mean)/std

# Numpy to Torch Tensor
X_train = torch.from_numpy(np.float32(X_train))
y_train = torch.from_numpy(y_train.astype(np.long))
X_valid = torch.from_numpy(np.float32(X_valid))
y_valid = torch.from_numpy(y_valid.astype(np.long))

train = torch.utils.data.TensorDataset(X_train, y_train)
valid = torch.utils.data.TensorDataset(X_valid, y_valid)

data = ImageDataBunch.create(train_ds = train, valid_ds=valid)
14 Likes

hmm, I am getting a TensorDatasets' object has no attribute 'new' error with your MNIST example

AttributeError                            Traceback (most recent call last)
<ipython-input-56-8bf4e3c38b32> in <module>()
----> 1 testdata = test()

<ipython-input-55-13ce614b2d08> in test()
     19     valid = torch.utils.data.TensorDataset(X_valid, y_valid)
     20 
---> 21     data = ImageDataBunch.create(train_ds = train, valid_ds=valid)
     22     return data

/usr/local/lib/python3.6/dist-packages/fastai/basic_data.py in create(cls, train_ds, valid_ds, test_ds, path, bs, num_workers, tfms, device, collate_fn, no_check)
    112                collate_fn:Callable=data_collate, no_check:bool=False)->'DataBunch':
    113         "Create a `DataBunch` from `train_ds`, `valid_ds` and maybe `test_ds` with a batch size of `bs`."
--> 114         datasets = cls._init_ds(train_ds, valid_ds, test_ds)
    115         val_bs = bs
    116         dls = [DataLoader(d, b, shuffle=s, drop_last=(s and b>1), num_workers=num_workers) for d,b,s in

/usr/local/lib/python3.6/dist-packages/fastai/basic_data.py in _init_ds(train_ds, valid_ds, test_ds)
    102     @staticmethod
    103     def _init_ds(train_ds:Dataset, valid_ds:Dataset, test_ds:Optional[Dataset]=None):
--> 104         fix_ds = valid_ds.new(train_ds.x, train_ds.y) # train_ds, but without training tfms
    105         datasets = [train_ds,valid_ds,fix_ds]
    106         if test_ds is not None: datasets.append(test_ds)

AttributeError: 'TensorDataset' object has no attribute 'new'

I guess something in the lib has changed again?
I don’t suppose you have any idea of how I should tweak the code to get it to work?

2 Likes

That should be fixed now.

1 Like

Any solution on this so far?

yes and no. I managed to create the ImageDataBunch after Jeremey’s fix. But when I try to run data.show_batch(rows=3, figsize=(7,6)) it errors with 'KmnistDataset' object has no attribute 'x' on the line if self.train_ds.x._square_show: rows = rows ** 2 (KministDataset) is the custom class that subclasses the Dataset class

1 Like

Except, you can’t do anything with this databunch object. 'TensorDataset' object has no attribute 'c'

And your code is missing:

from sklearn.model_selection import train_test_split

:slight_smile:

as @sgugger said, you have to implement your own Dataset class. It appears from this thread that a new method from_array() would be a useful addition to the fastai library.

data = (ItemList.from_array(train_ds = train_array, valid_ds=valid_array, test_ds=test_array), ...)
3 Likes

An ItemList takes an array of items, so I’m not sure a new method is required. Note that you want an ItemLists if you’re using several datasets.

1 Like

Here are the docs

Can you repost the link.
It’s not working anymore

1 Like

Here is a way to do it. This will create an in memory image list.

p = untar_data(URLs.MNIST_SAMPLE)
train = p/'train'
imagesl = ImageList.from_folder(train)
images = []
for i in imagesl:
    images.append(i.data.numpy())
images = np.array(images)
class MyImageList(ImageList):
    def open(self, i):
        return i
    
    @staticmethod
    def from_numpy(arr):
        items = []
        for i in arr: items.append(Image(torch.from_numpy(i)))
        return MyImageList(items)

MyImageList.from_numpy(images)

Anyone have any success with creating databunch or dataset from numpy arrays? I’ve read through a lot of the forum posts and either code no longer works or it doesn’t apply? Any code samples of how to create a databunch from

x = np.ndarray((50,224,224))
y = np.ndarray((50))

meaning 50 arrays 224x224 and 50 labels.

much appreciated.

1 Like

Here is the simplest way I can think of doing what you want:

class ArrayItemList(ItemList):
    @classmethod
    def from_numpy(cls, numpy_array):
        return cls(items=numpy_array)
    
    def label_from_array(self, array, label_cls=None, **kwargs):
        return self._label_from_list(array,label_cls=label_cls,**kwargs)

x = np.random.rand(50,224,224)
y = np.random.rand(50)

data = (ArrayItemList.from_numpy(x)
        .split_none()
        .label_from_array(y)
        .databunch(bs=10))