Thanks for that note. Sorry I had not checked in in a while. I am playing with V1 right now and I can do that update. Thanks for the suggestion!
I did the update here. The code get’s simpler since you use the pytorch TensorDataset and then load that into a DataBunch. From there it is very similar with small changes to the API. Let me know if you have an questoins.
hi Bobak,
in your example you seem to be passing numpy arrays as data
train_ds = tdatautils.TensorDataset(X,y)
valid_ds = tdatautils.TensorDataset(X_val,y_val)
test_ds = tdatautils.TensorDataset(X_test,y_test)
with large datasets it would difficult to do. do you know how can i pass in file names with a custom function to read the files ?
thanks,
Two thoughts on this:
First, I have managed to pass some pretty large files in this way with no issue. If you can load up the Numpy array, you can cast to a tensor (you are not putting on the GPU, you are just wrapping as a PyTorch Tensor.) So, I am curious why the data can’t be loaded in your case.
Second, I am happy to try to get a function working to wrap the files. Can you give me a sample set of data and then I will, probably, see better what is going on with your use case. It does not need to be large, but just representative of the use case. It can be filled with random values just so we can figure out how to handle the files.
with the first approach i already tried but doesnt work because the inputs are large images in resolutions of upto 5K. The labels are rather smaller, processed as numpy arrays, saved in compressed format. They are not special file, just a bit large. What I would need is to have an interface where i can pass the input images file list( list of strings containing file paths & names) and similary a an inteface to pass list of npz files along with a
input = IMG_1.jpg, IMG_2.jpg, …etc saved in folder: images
labels= IMG_1.npz, IMG_2.npz,…etc saved in folder : npz_files
here is my current function which i call in getitem function of my dataset class extend from pytorch data set class.
def load_data(img_path):
gt_path = img_path.replace('.jpg','.npz').replace('images','npz_files')
img = Image.open(img_path).convert('RGB')
gt_file=np.load(gt_path)
target = gt_file['arr_0']
return img,target
To get some sample files, just get some large images in images folder and read images in numpy and save some smaller portions of them in numpy format. here is one example.
img= plt.imread(img_path)
label = cv2.resize(img,(int(img.shape[1]/8),int(img.shape[0]/8)),interpolation = cv2.INTER_CUBIC)*64
np.savez_compressed("npz_files/label1.npz", label)
I just dont want to load all of them in the RAM in advance because then it runs out of RAM.
Thanks so much… I am excited if i can use fastai’s features like lr_find for my project.
let me know if you need any further of my assistance. I am happy to help.
Is this thread closer to your use case? I have not tried this with v1 yet, so not sure how much of the code maps over, but the idea of a MatchedDataSet from files seems closer to your use case.
If this is the case, then it is likely you will benefit from using the built-in ImageDataBunch design that looks at paths or filenames stored in a pd.DataFrame
Let me know if that gets you started down a path that is productive or if that is not the right direction.
The first thread is older version, it cant find ImageData class.
with newer version i tried using the from_name_func, it does accept it and tries to cache all labels but then at the end throws an error.
dirpathB = "data/train_data/images"
tfms=get_transforms(max_lighting=0.1, max_zoom=1.05, max_warp=0.)
def get_labels(file_path):
gt_path = file_path.replace('.jpg','.npz').replace('images','label_dir')
gt_file=np.load(gt_path)
target = gt_file['arr_0']
print (gt_path)
return target
data = ImageDataBunch.from_name_func(dirpathB, train_listB,
label_func=get_labels, ds_tfms=tfms, size=(768,1024))
error:
/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self, processor)
66 if processor is not None: self.processor = processor
67 self.processor = listify(self.processor)
---> 68 for p in self.processor: p.process(self)
69 return self
70
/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self, ds)
281
282 def process(self, ds):
--> 283 if self.classes is None: self.create_classes(self.generate_classes(ds.items))
284 ds.classes = self.classes
285 ds.c2i = self.c2i
/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in generate_classes(self, items)
325 "Generate classes from `items` by taking the sorted unique values."
326 classes = set()
--> 327 for c in items: classes = classes.union(set(c))
328 classes = list(classes)
329 classes.sort()
TypeError: unhashable type: 'numpy.ndarray'
I haven’t tried with dfs suspecting that for labels it expects classes as its example shows.
Like so many things, once you “get it” the code is actually quite simple.
It appears that and ImageImageList is in the works for this kind of purpose. In the meantime, I was able to get something working that I think would be close.
What is needed for this is to extend the ImageItemList
class to generate your own labels. So far, in the new V1 version of the library, all the label methods expect you to have lists or names/labels as categories or names. From there, they seem to proceed as expected for a classification problem.
Here is my type that, I think, is generating what you are looking for:
## Extend the ImageItemList to include a custom_label method
class CustomImageItemList(ImageItemList):
def custom_label(self,**kwargs)->'LabelList':
'''custom label from path and npy directory'''
#self.items is an np array of PosixPath objects with each image path
target_filenames = [Path(str(x).replace('.tif','.npz')) for x in self.items]
target_np_array = np.array([np.load(x)['arr_0'] for x in target_filenames],dtype=int) #can't be type='object'
y = ItemList(items=target_np_array)
res = self._label_list(x=self,y=y) # like this: https://github.com/fastai/fastai/blob/master/fastai/data_block.py#L221
return res
One caveat, is that you need the latest fastai version that includes these two lines in datablock.py
:
48 self.items,self.x = items,x
49 if not isinstance(self.items,np.ndarray): self.items = array(self.items, dtype=object)
To try and make it all end-to-end for you, I generated this gist. I don’t put them images/npz into separate directories, but I do manage to generate a single batch from the paths used. I think from here, you should be able to feed the databunch
into a learner()
in the canonical ways.
Let me know if anything is not clear.
Thanks a lot Bobak, i really appreciate your time.
I am testing with my model.
i think i can make it run now. (edited above posts).
My model takes an image and generates a smaller version of it as a smaller map.
will post soon the result.
cheers
Great to hear. keep me updated!
well, i can’t really make it work.
2 Problems:
- I have batch size 1 and images of different sizes so it doesnt like it.
- Even with same size images it does starts lr_find or fit but crashes in between.
here is an example similar to my model (a simple and clean version of my project).
Any help much appreciated.
I got this to run end-to-end with one key change to line up the input/output of the whole thing. I put a debug in at the forward() so that I could figure out what the x_in and the x_out looked like on that pass and made sure that the conv() lined up with what was expected from the batch.
y = ItemList(items=target_np_array[:,None,:,:])
Full example is here. Let me know if that maps back to your case.
Thanks so much again.
The training works on images of same sizes but on different sizes images it throws an exception.
currently my training looks like this:
for epoch in range(0, epochs):
learn.fit(1)
abs_error= validate(val_list, model, criterion)
if(abs_error< best_abs_error):
checkpoint(model, epoch%3,model_out_path+"C1Net"+"_"+repr(int(abs_error)))
else:
checkpoint(model, epoch%3,model_out_path+"C1Net")
is_best = abs_error< best_abs_error
best_abs_error = min(abs_error, best_abs_error)
print(' * best abs_error {abs_error:.3f} '
.format(abs_error=best_abs_error))
Also, is there a way to stop it validating inside fit? i.e. i have a different folder for test images and labels so i want to validate separately.
Also, to save the model after each epoch i’ve to put the fit inside a loop where after each completion of fit(1) i can save the model, or there is a better way to do that which fastai offers?
Your CNN model maps from the same size down to the same size. If you want it to do something different you can change your convolutions or change your padding to change the in/out dimensions. Once you set them up, they are static for the model (hence the need to add padding.) When I am trying to debug these types of things I will put in import pdb;pdb.set_trace()
in the forward
call and then inspect the input/output to figure out what the model is going to try and use.
It appears that Fit is going to check if data.empty_val
to decide about running validation once per epoch. You could leave that empty or you can build your validation and test data as needed/expected and feed into the DataBunch
Don’t put the fit inside the loop. Go back and look at the Fast.ai lectures about fitting and then use the fit_one_cycle
method with theappropriate callbacks so that you save on best or save each cycle. If you do it in a loop, you lose all the history about momentum and gradients that is very important to getting a good fit.
Thanks so much Bobak.
I will modify it as you mentioned.
@bfarzin In the Custom Dataset it seems that you are dropping the 4th channel.I have a dataset of 7 channels and I need all those 7 channels.So when I use this How Will Data Augmentation works since it is only defined for 3 channel images in fast.ai or is there a way to pass a dataset with my own transforms into fastai learner.
I don’t understand your question. you have a 7 channel image? Maybe if you post an example it would be clear what you are trying to do and I could help out.
I have 7 channels as input to CNN.The dataset is similar to one used in dstl competition.For One Output mask there are 7 channels as input.It is basically a segmentation problem
@bfarzin Basically I have a (7 * size * size) as my Input and I want to pass it To UNET to get (n_classes * sz * sz) as output masks.How do I pass this to Fastai.While using Standard Pytorch I implemented Basic Transformations like horizontal flip and 90 degree rotations using numpy.Will the transforms of fastai work for 7 channel input i not how can i pass my own transforms to fastai.