DataBunch from numpy arrays

fredguth · November 19, 2018, 6:46pm

I guess I made it work:

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
# Normalization
mean = X_train.mean()
std = X_train.std()
X_train = (X_train-mean)/std
X_valid = (X_valid-mean)/std

# Numpy to Torch Tensor
X_train = torch.from_numpy(np.float32(X_train))
y_train = torch.from_numpy(y_train.astype(np.long))
X_valid = torch.from_numpy(np.float32(X_valid))
y_valid = torch.from_numpy(y_valid.astype(np.long))

train = torch.utils.data.TensorDataset(X_train, y_train)
valid = torch.utils.data.TensorDataset(X_valid, y_valid)

data = ImageDataBunch.create(train_ds = train, valid_ds=valid)

wwymak · December 16, 2018, 9:01pm

hmm, I am getting a TensorDatasets' object has no attribute 'new' error with your MNIST example

AttributeError                            Traceback (most recent call last)
<ipython-input-56-8bf4e3c38b32> in <module>()
----> 1 testdata = test()

<ipython-input-55-13ce614b2d08> in test()
     19     valid = torch.utils.data.TensorDataset(X_valid, y_valid)
     20 
---> 21     data = ImageDataBunch.create(train_ds = train, valid_ds=valid)
     22     return data

/usr/local/lib/python3.6/dist-packages/fastai/basic_data.py in create(cls, train_ds, valid_ds, test_ds, path, bs, num_workers, tfms, device, collate_fn, no_check)
    112                collate_fn:Callable=data_collate, no_check:bool=False)->'DataBunch':
    113         "Create a `DataBunch` from `train_ds`, `valid_ds` and maybe `test_ds` with a batch size of `bs`."
--> 114         datasets = cls._init_ds(train_ds, valid_ds, test_ds)
    115         val_bs = bs
    116         dls = [DataLoader(d, b, shuffle=s, drop_last=(s and b>1), num_workers=num_workers) for d,b,s in

/usr/local/lib/python3.6/dist-packages/fastai/basic_data.py in _init_ds(train_ds, valid_ds, test_ds)
    102     @staticmethod
    103     def _init_ds(train_ds:Dataset, valid_ds:Dataset, test_ds:Optional[Dataset]=None):
--> 104         fix_ds = valid_ds.new(train_ds.x, train_ds.y) # train_ds, but without training tfms
    105         datasets = [train_ds,valid_ds,fix_ds]
    106         if test_ds is not None: datasets.append(test_ds)

AttributeError: 'TensorDataset' object has no attribute 'new'

I guess something in the lib has changed again?
I don’t suppose you have any idea of how I should tweak the code to get it to work?

jeremy · December 19, 2018, 12:31am

That should be fixed now.

shakur · December 19, 2018, 4:25pm

Any solution on this so far?

wwymak · December 19, 2018, 9:40pm

yes and no. I managed to create the ImageDataBunch after Jeremey’s fix. But when I try to run data.show_batch(rows=3, figsize=(7,6)) it errors with 'KmnistDataset' object has no attribute 'x' on the line if self.train_ds.x._square_show: rows = rows ** 2 (KministDataset) is the custom class that subclasses the Dataset class

stas · December 28, 2018, 5:59am

Except, you can’t do anything with this databunch object. 'TensorDataset' object has no attribute 'c'

And your code is missing:

from sklearn.model_selection import train_test_split

as @sgugger said, you have to implement your own Dataset class. It appears from this thread that a new method from_array() would be a useful addition to the fastai library.

data = (ItemList.from_array(train_ds = train_array, valid_ds=valid_array, test_ds=test_array), ...)

sgugger · December 28, 2018, 8:03am

An ItemList takes an array of items, so I’m not sure a new method is required. Note that you want an ItemLists if you’re using several datasets.

baz · March 22, 2019, 12:08am

Here are the docs

shivamchandhok · March 24, 2019, 7:47am

Can you repost the link.
It’s not working anymore

baz · July 3, 2019, 4:57pm

Here is a way to do it. This will create an in memory image list.

p = untar_data(URLs.MNIST_SAMPLE)
train = p/'train'
imagesl = ImageList.from_folder(train)
images = []
for i in imagesl:
    images.append(i.data.numpy())
images = np.array(images)

class MyImageList(ImageList):
    def open(self, i):
        return i
    
    @staticmethod
    def from_numpy(arr):
        items = []
        for i in arr: items.append(Image(torch.from_numpy(i)))
        return MyImageList(items)

MyImageList.from_numpy(images)

source99 · July 28, 2019, 6:12am

Anyone have any success with creating databunch or dataset from numpy arrays? I’ve read through a lot of the forum posts and either code no longer works or it doesn’t apply? Any code samples of how to create a databunch from

x = np.ndarray((50,224,224))
y = np.ndarray((50))

meaning 50 arrays 224x224 and 50 labels.

much appreciated.

noachr · July 28, 2019, 11:54pm

Here is the simplest way I can think of doing what you want:

class ArrayItemList(ItemList):
    @classmethod
    def from_numpy(cls, numpy_array):
        return cls(items=numpy_array)
    
    def label_from_array(self, array, label_cls=None, **kwargs):
        return self._label_from_list(array,label_cls=label_cls,**kwargs)

x = np.random.rand(50,224,224)
y = np.random.rand(50)

data = (ArrayItemList.from_numpy(x)
        .split_none()
        .label_from_array(y)
        .databunch(bs=10))

source99 · July 29, 2019, 12:02am

Did not work:
I’ll dig in but wanted to get this up for now…
Code:

error part 1:

error part 2:

source99 · July 29, 2019, 12:04am

fast.ai version 1.0.54

noachr · July 29, 2019, 12:12am

If you want your arrays to be used as input for a cnn they will need a channel dimension. I assume these are one channel images, so just reshape with x.reshape(50,1,224,224)

However you’ll still have a problem. The fastai vision models expect 3 input channels. One potential answer is to just copy the one channel three times – x.reshape(50,1,224,224).repeat(3,1). How well this will work with transfer learning depends on the dataset.

I’d also amend my previous custom class to subclass ImageList instead, and implement a custom get method to turn the arrays into fastai Images. This is so methods like show_batch work.

class ArrayImageList(ImageList):
    @classmethod
    def from_numpy(cls, numpy_array):
        return cls(items=numpy_array)
    
    def label_from_array(self, array, label_cls=None, **kwargs):
        return self._label_from_list(array,label_cls=label_cls,**kwargs)
    
    def get(self, i):
        n = self.items[i]
        n = torch.tensor(n)
        return Image(n)

source99 · July 29, 2019, 1:36am

Getting closer but still not working:

source99 · July 29, 2019, 1:37am

I’m working on this error now:
RuntimeError: Input type (torch.cuda.DoubleTensor) and weight type (torch.cuda.FloatTensor) should be the same

noachr · July 29, 2019, 1:54am

I can get it running using the ArrayImageList class in my post above and changing the line

n = torch.tensor(n)

to

n = torch.tensor(n).float()

For clarity, the full code is now:

class ArrayImageList(ImageList):
    @classmethod
    def from_numpy(cls, numpy_array):
        return cls(items=numpy_array)
    
    def label_from_array(self, array, label_cls=None, **kwargs):
        return self._label_from_list(array,label_cls=label_cls,**kwargs)
    
    def get(self, i):
        n = self.items[i]
        n = torch.tensor(n).float()
        return Image(n)

x = np.random.rand(50,3,224,224)
y = np.random.rand(50)

data = (ArrayImageList.from_numpy(x)
        .split_none()
        .label_from_array(y)
        .databunch(bs=10))

learn = cnn_learner(data,models.resnet18)
learn.fit(10)

source99 · July 29, 2019, 1:57am

Awesome…Runs for me now. Thanks…

Now on to trying it with some real data!

algara · August 4, 2019, 11:16pm

Hey @noachr and everyone!

Thank for the code - saved me a lot of time.

I have stumbled upon a problem though (most likely due to me not knowing much). I am trying to hold out 20% of training data for validation purpose and struggling to have it correctly labeled.

data = (ArrayImageList.from_numpy(training_images)
        .split_subsets(train_size=0.8, valid_size=0.2)
        .label_from_array(training_labels)
        .databunch(bs=10))

Could you (or anyone else) give me a hint how can I correctly split (re-assign?) labels for training and validation data?

Thanks