If it might be helpful, I’ve created a custom dataset for a Kaggle competition that takes drawings encoded as a sequence, converts them to greyscal images and feeds the images to the network - and it is compatible with fastai tools. The notebook is below:
While giving up and saving all arrays into image files would save me the trouble, I think getting images from arrays is very useful in general so I’m trying to persevere.
Tried the channels trick (took a serious amount of time )
I think I have everything right but it still doesn’t work.
I’m pretty sure it’s a problem with getting the batches in Pytorch’s DataLoader.
I would appreciate any guidance of course. Here’s the Colab notebook I am working on.
@Zeina I have modified your notebook. The issue was the way you were returning the value of y
from your Dataset.
@sgugger Why my train and validation loss is going negative? Loss function here is nll_loss
and I have not used any data transformations. Loss is negative with resnet18
and lr
as low as 1e-9
@sgugger I have updated the notebook URL.
I’ve had good success saving numpy arrays as png using this general flow:
numpy array named arr filled with integers(though I think this would work for floats also).
arr = (arr-arr.mean()) / arr.std() #normalizes between -1 and +1
arr = (arr + 1) / 2 * 255 # moves it between 0 and 255
arr = np.clip(arr, 0, 255).astype(np.uint8), clips it between 0 and 255 and converts to 8 bit int.
imageio.imwrite(output_filename, arr)
Using these I have created a databunch through the data block api or the higher level api.
My interpretation is that, we are feeding torch’s nll_loss
the output of model and true lables. And as per my understanding (and running manually) I find below observations.
- Model output of one batch is fed to
nll_loss
's input. - Target lables (0 indexed) are fed to
nll_loss
as target. -
nll_loss
is simply returning the-sum(target * input)
or-sum(input[target])
. Which I believe should not be the case as negative log likelihood is defined assum(y*logp)
Please see below image.
Here in run number 67
nll_loss
simply took negative of index 27 value from input
and run number 69 took nll_loss
as negative of index 11 from input
.Why is that the case? Why it is not taking the log? Also, this is why I believe I was getting the negative loss as well. Both of the inputs in the above image are taken while debugging and running my above notebook.
@sgugger Please help please.
Thanks
CC: @jeremy Sorry for @ mention
Have you checked if the problem is with the order of the array? It is common to have a numpy array where channels is the last dimension, something like X.shape
will return (5000, 120,120,3)
5000 samples of 120hx120wx3c`. But pytorch expects 3, 120, 120, 5000.
I am looking forward a solution myself, if I manage to solve, I will post here.
I’d love to see some example custom data classes from this problem or anything really. please post if you have done it successfully(or not successfully).
I guess I made it work:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
# Normalization
mean = X_train.mean()
std = X_train.std()
X_train = (X_train-mean)/std
X_valid = (X_valid-mean)/std
# Numpy to Torch Tensor
X_train = torch.from_numpy(np.float32(X_train))
y_train = torch.from_numpy(y_train.astype(np.long))
X_valid = torch.from_numpy(np.float32(X_valid))
y_valid = torch.from_numpy(y_valid.astype(np.long))
train = torch.utils.data.TensorDataset(X_train, y_train)
valid = torch.utils.data.TensorDataset(X_valid, y_valid)
data = ImageDataBunch.create(train_ds = train, valid_ds=valid)
hmm, I am getting a TensorDatasets' object has no attribute 'new'
error with your MNIST example
AttributeError Traceback (most recent call last)
<ipython-input-56-8bf4e3c38b32> in <module>()
----> 1 testdata = test()
<ipython-input-55-13ce614b2d08> in test()
19 valid = torch.utils.data.TensorDataset(X_valid, y_valid)
20
---> 21 data = ImageDataBunch.create(train_ds = train, valid_ds=valid)
22 return data
/usr/local/lib/python3.6/dist-packages/fastai/basic_data.py in create(cls, train_ds, valid_ds, test_ds, path, bs, num_workers, tfms, device, collate_fn, no_check)
112 collate_fn:Callable=data_collate, no_check:bool=False)->'DataBunch':
113 "Create a `DataBunch` from `train_ds`, `valid_ds` and maybe `test_ds` with a batch size of `bs`."
--> 114 datasets = cls._init_ds(train_ds, valid_ds, test_ds)
115 val_bs = bs
116 dls = [DataLoader(d, b, shuffle=s, drop_last=(s and b>1), num_workers=num_workers) for d,b,s in
/usr/local/lib/python3.6/dist-packages/fastai/basic_data.py in _init_ds(train_ds, valid_ds, test_ds)
102 @staticmethod
103 def _init_ds(train_ds:Dataset, valid_ds:Dataset, test_ds:Optional[Dataset]=None):
--> 104 fix_ds = valid_ds.new(train_ds.x, train_ds.y) # train_ds, but without training tfms
105 datasets = [train_ds,valid_ds,fix_ds]
106 if test_ds is not None: datasets.append(test_ds)
AttributeError: 'TensorDataset' object has no attribute 'new'
I guess something in the lib has changed again?
I don’t suppose you have any idea of how I should tweak the code to get it to work?
That should be fixed now.
Any solution on this so far?
yes and no. I managed to create the ImageDataBunch after Jeremey’s fix. But when I try to run data.show_batch(rows=3, figsize=(7,6))
it errors with 'KmnistDataset' object has no attribute 'x'
on the line if self.train_ds.x._square_show: rows = rows ** 2
(KministDataset) is the custom class that subclasses the Dataset class
Except, you can’t do anything with this databunch object. 'TensorDataset' object has no attribute 'c'
And your code is missing:
from sklearn.model_selection import train_test_split
as @sgugger said, you have to implement your own Dataset class. It appears from this thread that a new method from_array()
would be a useful addition to the fastai library.
data = (ItemList.from_array(train_ds = train_array, valid_ds=valid_array, test_ds=test_array), ...)
An ItemList
takes an array of items, so I’m not sure a new method is required. Note that you want an ItemLists
if you’re using several datasets.
Can you repost the link.
It’s not working anymore
Here is a way to do it. This will create an in memory image list.
p = untar_data(URLs.MNIST_SAMPLE)
train = p/'train'
imagesl = ImageList.from_folder(train)
images = []
for i in imagesl:
images.append(i.data.numpy())
images = np.array(images)
class MyImageList(ImageList):
def open(self, i):
return i
@staticmethod
def from_numpy(arr):
items = []
for i in arr: items.append(Image(torch.from_numpy(i)))
return MyImageList(items)
MyImageList.from_numpy(images)
Anyone have any success with creating databunch or dataset from numpy arrays? I’ve read through a lot of the forum posts and either code no longer works or it doesn’t apply? Any code samples of how to create a databunch from
x = np.ndarray((50,224,224))
y = np.ndarray((50))
meaning 50 arrays 224x224 and 50 labels.
much appreciated.
Here is the simplest way I can think of doing what you want:
class ArrayItemList(ItemList):
@classmethod
def from_numpy(cls, numpy_array):
return cls(items=numpy_array)
def label_from_array(self, array, label_cls=None, **kwargs):
return self._label_from_list(array,label_cls=label_cls,**kwargs)
x = np.random.rand(50,224,224)
y = np.random.rand(50)
data = (ArrayItemList.from_numpy(x)
.split_none()
.label_from_array(y)
.databunch(bs=10))