DataBunch from numpy arrays

yonatan365 · November 1, 2018, 9:14am

There are several useful methods to create a databunch from a folder, csv labels, etc.
However, I have (grayscale) images data that is already within an array - i.e.
X.shape: (n_samples, 100, 100)
y.shape: (n_samples)

Is there a way to create a databunch straight from these (numpy) arrays?

I saw another post asking the same here (currently unanswered):

thanks!

sgugger · November 1, 2018, 1:39pm

You would need to create your custom dataset for this, but it’s not too complicated: jsut subclass DatasetBaset and define __len__, __getitem__ and c (number of classes).
Once you have your custom Dataset class you can just create a DataBunch from it via DataBunch.create.

Kaspar · November 1, 2018, 6:26pm

There are several issues with reading and viewing grayscal in the current version. I am however working on several PRs to fastai to make it easier to include conversion of grayscale

Until then have a look at the class: “GrayImageDataset” here : fastaiNotebooks/mammo/mammo_tiles.ipynb at master · kasparlund/fastaiNotebooks · GitHub

Here it is for convenience:’

def getGrayStats( imagenet_stats = ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ):
stats = imagenet_stats
s=np.asarray(stats)
st =
if len(s.shape)>=2 and s.shape[1] > 1:
st.append( torch.from_numpy( np.asarray( np.mean(s[0]) ) ).float() )
st.append( torch.from_numpy( np.asarray( np.sqrt( sum(s[1]*s[1]) / s.shape[1] ) ) ).float() )
return st

#Class for reading 16 grascale images
class GrayImageDataset(ImageClassificationDataset):
@staticmethod
def create(path, dfData ):
return GrayImageDataset( fns = [path/dir_im/f for f in dfData.fnImage.values],
labels = dfData.classes.values )
@staticmethod
def pil2tensor(image)->TensorImage:
“Convert PIL style image array to torch style image tensor.”
arr = torch.from_numpy(np.asarray(image))
arr = arr.view(image.size[1], image.size[0], -1)
return arr.permute(2,0,1)
@staticmethod
def open_image(fn:PathOrStr)->Image:
x = PIL.Image.open(fn).convert(‘I’)
return Image(GrayImageDataset.pil2tensor(x).float().div_(65536.0))
@abstractmethod
def _get_x(self,i): 
    return GrayImageDataset.open_image(self. x[i])
train_ds = GrayImageDataset.create( path, tvData[tvData.purpose==“train”] )
valid_ds = GrayImageDataset.create( path, tvData[tvData.purpose==“test”] )

ata = ImageDataBunch.create(train_ds, valid_ds, path = str(path/“tiles”), ds_tfms=get_transforms(), bs=64, size=224)
data.normalize(getGrayStats())

The notebook also show how to modify the first layer in a pretrained resnet to take a grayscale image

yonatan365 · November 1, 2018, 6:51pm

Interesting!

That may explain the errors I was getting. I implemented a dataset as suggested above (thank you sgugger!):

class NumpyDataset(DatasetBase):
    def __init__(self, X, y=None):
        super().__init__(np.unique(y))
        self.classes = np.unique(y)
        self.c = len(np.unique(y))
        self.X = X
        if y is not None: self.y = y
        
    def __getitem__(self, i):
        if self.y is not None: return (Tensor(self.X[i]).to(torch.float), Tensor(self.y[i]).to(torch.float))
        return Tensor(X[i]).to(float)
                
    def __len__(self): return len(self.X)

and got the following error, which I couldn’t yet find its source:

RuntimeError: Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/opt/anaconda3/lib/python3.6/site-packages/fastai/torch_core.py", line 91, in data_collate
    return torch.utils.data.dataloader.default_collate(to_data(batch))
  File "/opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/opt/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 1 and 8 in dimension 1 at /opt/conda/conda-bld/pytorch-nightly_1540121100527/work/aten/src/TH/generic/THTensorMoreMath.cpp:1317

but in light of what you describe above, maybe a quick (and inefficient) fix for me would be to duplicate the grayscale data to 3 channels, as in a color image, or just add 2 more 0 channels.

I’m not sure I can use your code as is because my data is already in a numpy array format, not an image file, and I don’t want to take the extra step of converting it to many image files and back to a tensor. I’ll keep trying and update… and will be gratefull with more ideas.

Thanks!

Kaspar · November 2, 2018, 6:35am

Converting 16bit to rgb will lead to a loss in precision. that may os may not be important to you
The PR i am working on also takes a numpy (see below).

Better keep you data in numpy and convert them to tensor as below. Also remember to convert to float an divide by the appropriate scaling factor

def pil2tensor(image])
“Convert PIL.Image or numpy.ndarray to torch tensor.”
a = np.asarray(image)
if a.ndim==2 : a = np.expand_dims(a,2)
a = np.transpose(a, (1,0,2)) #transpose width, height to height,width
a = np.transpose(a, (2, 1, 0)) #move channels to the first position
if a.dtype == bool : a = a.astype(np.uint8)
elif a.dtype in [np.dtype(">u2"), np.uint16] : a = a.astype(np.int32)
return torch.from_numpy(a)

Zeina · November 11, 2018, 8:54pm

I’m in the same situation, I’m using data from arrays and got the same error.

Not sure if this is related to the number of channels in the data. According to https://github.com/fastai/fastai/issues/1028, it might be related with image size.

After creating a custom dataset from DatasetBase, I passed it to DataBunch.create() that’s when I got the error:
invalid argument 0: Sizes of tensors must match except in dimension 0

In order to handle sizing easily, I thought I’d try to pass the Dataset to ImageDataBunch.create() instead, and call it with the size option, this produced a different error:
‘Tensor’ object has no attribute ‘set_sample’

I gave up for today , but thought to see if you have any updates about this

Kaspar · November 11, 2018, 9:34pm

The new version of pil2tensor have been implemented now: https://docs.fast.ai/vision.image.html#pil2tensor

if the issue is about 16 bit grayscale images then the following can help ?

you can now create your own open_image like in the following:

def open_image_16bit2rgb( fn ):
# step 1 : open 16 bit grayscale and convert to int32 and create a view on the image a np.asarray
a = np.asarray(PIL.Image.open( fn ).convert(“I”))
#step 2: add an 1 dimension so we have height, width, 1 channel
a = np.expand_dims(a,axis=2)
#create two extra channels to make it an rgb image
a = np.repeat(a, 3, axis=2)
return Image( pil2tensor(a, np.float32 ).div(65535) )

or if you already have a np.array like width * height *1 channel then start at step2

By the way you should set the image_opener in the dataset. This can be done using the datablock design and through the ImageDatabunch

This works for me - guess i should make a blog .

did this help ? @Zeina, @yonatan365

yonatan365 · November 12, 2018, 6:40am

Hi Kaspar,

Thanks for your continuous interest and help!
for me at least (don’t know about Zeina) the case is that my data is synthetic and represented as a numpy array. It was never an image in the first place.

It seems that this is the simpler case - ultimately all the image become numpy (or pytorch) tensors anyhow. But somehow I get errors when I try to change the pipeline to work with this data.

To be explicit, lets say I create a tensor of the following shape:

data = np.random.randn(100,50,50)

which in my case represents 100 samples of 50x50 grayscale “images”.

I don’t know exactly how to connect with your function Pil2Tensor. Ultimately, it would be nice to have the following command:
data = ImageDataBunch.from_numpy(data, train_idxs, valid_idxs, n_channels)
or something similar that will get a numpy array, and use it for the next modeling stages as image data given the number of channels (which ideally can be anything >=1).

I started building it, but got into some trouble and didn’t yet find the time to solve it… I will probably do it but it will take some more time for me.

If you know the flow of actions required to fit a pre-trained model such as resnet with data of this kind, without the errors, and can make it public here, it will be very helpful for me (and probably others…)

thanks,
Yonatan.

Zeina · November 12, 2018, 8:09am

Thanks for your help Kaspar!

I have a hunch it’s not about RGB channels but will try this and update you asap.

Kaspar · November 12, 2018, 9:38am

fastai basically need a list of references to input data vs output data. This does not have to be images on a disk although that would be the most straight forward case. The list should be split into a training and a validation set by you or fastai.

I have not implemented your case but would go about it approximately as follows. @sgugger would certainly be able to propose a muc more elegant approach

class MyMemoryData
validIDs
validInput
validOutput
trainIDs
trainInput
trainOutput

def memorydata2tensor( id ):
     a = MyMemoryData.validInput[id]
     #step 2: add an 1 dimension so we have height, width, 1 channel
     a = np.expand_dims(a,axis=2)
     #create two extra channels to make it an rgb image
     a = np.repeat(a, 3, axis=2)
   return Image( pil2tensor(a, np.float32 ).div(65535) )

MyMemoryData.validInput = your input data
MyMemoryData.validOutput = your ouput data validation . ie classes
MyMemoryData.validIDs = np.arange( MyMemoryData.ValidInput )
MyMemoryData.trainInput = your input data
MyMemoryData.trainOutput = your ouput data for training . ie classes
MyMemoryData.trainIDs = np.arange( MyMemoryData.trainInput )

dsTrain = ImageClassificationDataset( MyMemoryData.validInput, MyMemoryData.validOutput )
dsValid = SegmentationDataset(MyMemoryData.trainInput, MyMemoryData.trainOutput )
dsTrain.image_opener = dsValid.image_opener = memorydata2tensor

tfms = getTransforms()
dsTrainTfm = DatasetTfm(dsTrain, tfms[0], tfm_y=True)
dsValidTfm = DatasetTfm(dsValid, tfms[1], tfm_y=True)

data = ImageDataBunch.create(
dsTrainTfm,
dsValidTfm,
bs=8,
size=224
)
data.normalize(imagenet_stats) # i guess you are not usng pretrained net so you would need your own means and stds to normalize

Kaspar · November 12, 2018, 9:53am

sorry about the edits my fingers slipped on the keyboard

Kaspar · November 12, 2018, 10:28am

can see that they want us to use the data.block in a coming version so the above will have to be tranformed into the data.block api

adilism · November 12, 2018, 10:53am

If it might be helpful, I’ve created a custom dataset for a Kaggle competition that takes drawings encoded as a sequence, converts them to greyscal images and feeds the images to the network - and it is compatible with fastai tools. The notebook is below:

github.com

adilism/quickdraw/blob/master/fastai-baseline.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Quick, Draw! on fastai v1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook generates a solution for [Quick, Draw! Kaggle competition](https://www.kaggle.com/c/quickdraw-doodle-recognition) using the [fastai](https://github.com/fastai/fastai#installation) library:\n",
    "* Implements a custom dataset that generates images from strokes\n",
    "* Trains using a pretrained resnet34 and a random sample of the data\n",
    "* The solution achieves 0.82 on the Public Leaderboard in under 30 minutes (n1-standard-8+P100)\n",
    "\n",
    "NB. The notebook uses code from a fast.ai MOOC that will be publicly available in Jan 2019"
   ]

This file has been truncated. show original

Zeina · November 15, 2018, 9:06pm

While giving up and saving all arrays into image files would save me the trouble, I think getting images from arrays is very useful in general so I’m trying to persevere.

Tried the channels trick (took a serious amount of time )
I think I have everything right but it still doesn’t work.

I’m pretty sure it’s a problem with getting the batches in Pytorch’s DataLoader.

I would appreciate any guidance of course. Here’s the Colab notebook I am working on.

deepanshu2017 · November 15, 2018, 11:39pm

@Zeina I have modified your notebook. The issue was the way you were returning the value of y from your Dataset.

Updated Notebook

@sgugger Why my train and validation loss is going negative? Loss function here is nll_loss and I have not used any data transformations. Loss is negative with resnet18 and lr as low as 1e-9

@sgugger I have updated the notebook URL.

devforfu · November 16, 2018, 9:32am

I had some problems with a custom dataset class. Probably I was doing something wrong though.

source99 · November 16, 2018, 7:26pm

I’ve had good success saving numpy arrays as png using this general flow:

numpy array named arr filled with integers(though I think this would work for floats also).

arr = (arr-arr.mean()) / arr.std() #normalizes between -1 and +1
arr = (arr + 1) / 2 * 255 # moves it between 0 and 255
arr = np.clip(arr, 0, 255).astype(np.uint8), clips it between 0 and 255 and converts to 8 bit int.
imageio.imwrite(output_filename, arr)

Using these I have created a databunch through the data block api or the higher level api.

deepanshu2017 · November 16, 2018, 7:52pm

My interpretation is that, we are feeding torch’s nll_loss the output of model and true lables. And as per my understanding (and running manually) I find below observations.

Model output of one batch is fed to nll_loss 's input.
Target lables (0 indexed) are fed to nll_loss as target.
nll_loss is simply returning the -sum(target * input) or -sum(input[target]). Which I believe should not be the case as negative log likelihood is defined as sum(y*logp)

Please see below image.

Here in run number 67 nll_loss simply took negative of index 27 value from input and run number 69 took nll_loss as negative of index 11 from input.
Why is that the case? Why it is not taking the log? Also, this is why I believe I was getting the negative loss as well. Both of the inputs in the above image are taken while debugging and running my above notebook.

@sgugger Please help please.
Thanks

CC: @jeremy Sorry for @ mention

fredguth · November 19, 2018, 5:01pm

Have you checked if the problem is with the order of the array? It is common to have a numpy array where channels is the last dimension, something like X.shape will return (5000, 120,120,3) 5000 samples of 120hx120wx3c`. But pytorch expects 3, 120, 120, 5000.

I am looking forward a solution myself, if I manage to solve, I will post here.

source99 · November 19, 2018, 5:05pm

I’d love to see some example custom data classes from this problem or anything really. please post if you have done it successfully(or not successfully).