So, you want to create your custom pipeline with fastai

Thanks for that note. Sorry I had not checked in in a while. I am playing with V1 right now and I can do that update. Thanks for the suggestion!

1 Like

I did the update here. The code get’s simpler since you use the pytorch TensorDataset and then load that into a DataBunch. From there it is very similar with small changes to the API. Let me know if you have an questoins.

4 Likes

hi Bobak,

in your example you seem to be passing numpy arrays as data

train_ds = tdatautils.TensorDataset(X,y)
valid_ds = tdatautils.TensorDataset(X_val,y_val)
test_ds = tdatautils.TensorDataset(X_test,y_test)

with large datasets it would difficult to do. do you know how can i pass in file names with a custom function to read the files ?

thanks,

Two thoughts on this:

First, I have managed to pass some pretty large files in this way with no issue. If you can load up the Numpy array, you can cast to a tensor (you are not putting on the GPU, you are just wrapping as a PyTorch Tensor.) So, I am curious why the data can’t be loaded in your case.

Second, I am happy to try to get a function working to wrap the files. Can you give me a sample set of data and then I will, probably, see better what is going on with your use case. It does not need to be large, but just representative of the use case. It can be filled with random values just so we can figure out how to handle the files.

with the first approach i already tried but doesnt work because the inputs are large images in resolutions of upto 5K. The labels are rather smaller, processed as numpy arrays, saved in compressed format. They are not special file, just a bit large. What I would need is to have an interface where i can pass the input images file list( list of strings containing file paths & names) and similary a an inteface to pass list of npz files along with a

input = IMG_1.jpg, IMG_2.jpg, …etc saved in folder: images
labels= IMG_1.npz, IMG_2.npz,…etc saved in folder : npz_files

here is my current function which i call in getitem function of my dataset class extend from pytorch data set class.

def load_data(img_path):       
    gt_path = img_path.replace('.jpg','.npz').replace('images','npz_files')
    img = Image.open(img_path).convert('RGB')
    gt_file=np.load(gt_path)      
    target =  gt_file['arr_0']
    return img,target 

To get some sample files, just get some large images in images folder and read images in numpy and save some smaller portions of them in numpy format. here is one example.

   img= plt.imread(img_path)
   label = cv2.resize(img,(int(img.shape[1]/8),int(img.shape[0]/8)),interpolation = cv2.INTER_CUBIC)*64
   np.savez_compressed("npz_files/label1.npz", label)

I just dont want to load all of them in the RAM in advance because then it runs out of RAM.

Thanks so much… I am excited if i can use fastai’s features like lr_find for my project. :slight_smile:

let me know if you need any further of my assistance. I am happy to help.

Is this thread closer to your use case? I have not tried this with v1 yet, so not sure how much of the code maps over, but the idea of a MatchedDataSet from files seems closer to your use case.

If this is the case, then it is likely you will benefit from using the built-in ImageDataBunch design that looks at paths or filenames stored in a pd.DataFrame Let me know if that gets you started down a path that is productive or if that is not the right direction.

The first thread is older version, it cant find ImageData class.

with newer version i tried using the from_name_func, it does accept it and tries to cache all labels but then at the end throws an error.

dirpathB = "data/train_data/images"

tfms=get_transforms(max_lighting=0.1, max_zoom=1.05, max_warp=0.)

def get_labels(file_path):
  gt_path = file_path.replace('.jpg','.npz').replace('images','label_dir')

  gt_file=np.load(gt_path)

  target =  gt_file['arr_0']
  print (gt_path)
  return target   

 data = ImageDataBunch.from_name_func(dirpathB, train_listB, 
 label_func=get_labels, ds_tfms=tfms, size=(768,1024))

error:

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self, processor)
     66         if processor is not None: self.processor = processor
     67         self.processor = listify(self.processor)
---> 68         for p in self.processor: p.process(self)
     69         return self
     70 

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self, ds)
    281 
    282     def process(self, ds):
--> 283         if self.classes is None: self.create_classes(self.generate_classes(ds.items))
    284         ds.classes = self.classes
    285         ds.c2i = self.c2i

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in generate_classes(self, items)
    325         "Generate classes from `items` by taking the sorted unique values."
    326         classes = set()
--> 327         for c in items: classes = classes.union(set(c))
    328         classes = list(classes)
    329         classes.sort()

TypeError: unhashable type: 'numpy.ndarray'

I haven’t tried with dfs suspecting that for labels it expects classes as its example shows.

Like so many things, once you “get it” the code is actually quite simple.

It appears that and ImageImageList is in the works for this kind of purpose. In the meantime, I was able to get something working that I think would be close.

What is needed for this is to extend the ImageItemList class to generate your own labels. So far, in the new V1 version of the library, all the label methods expect you to have lists or names/labels as categories or names. From there, they seem to proceed as expected for a classification problem.

Here is my type that, I think, is generating what you are looking for:

## Extend the ImageItemList to include a custom_label method
class CustomImageItemList(ImageItemList):    
    def custom_label(self,**kwargs)->'LabelList':
        '''custom label from path and npy directory'''        
        #self.items is an np array of PosixPath objects with each image path
        target_filenames = [Path(str(x).replace('.tif','.npz')) for x in self.items]
        target_np_array = np.array([np.load(x)['arr_0'] for x in target_filenames],dtype=int) #can't be type='object'
        
        y = ItemList(items=target_np_array)        
        res = self._label_list(x=self,y=y)   # like this: https://github.com/fastai/fastai/blob/master/fastai/data_block.py#L221

        return res

One caveat, is that you need the latest fastai version that includes these two lines in datablock.py:

48      self.items,self.x = items,x
49      if not isinstance(self.items,np.ndarray): self.items = array(self.items, dtype=object)

To try and make it all end-to-end for you, I generated this gist. I don’t put them images/npz into separate directories, but I do manage to generate a single batch from the paths used. I think from here, you should be able to feed the databunch into a learner() in the canonical ways.

Let me know if anything is not clear.

2 Likes

Thanks a lot Bobak, i really appreciate your time.

I am testing with my model.

i think i can make it run now. (edited above posts).

My model takes an image and generates a smaller version of it as a smaller map.

will post soon the result.

cheers

1 Like

Great to hear. keep me updated!

well, i can’t really make it work.

2 Problems:

  1. I have batch size 1 and images of different sizes so it doesnt like it.
  2. Even with same size images it does starts lr_find or fit but crashes in between.

here is an example similar to my model (a simple and clean version of my project).

Any help much appreciated.

I got this to run end-to-end with one key change to line up the input/output of the whole thing. I put a debug in at the forward() so that I could figure out what the x_in and the x_out looked like on that pass and made sure that the conv() lined up with what was expected from the batch.

y = ItemList(items=target_np_array[:,None,:,:])

Full example is here. Let me know if that maps back to your case.

1 Like

Thanks so much again.

The training works on images of same sizes but on different sizes images it throws an exception.

currently my training looks like this:

for epoch in range(0, epochs): 
    learn.fit(1)
    abs_error= validate(val_list, model, criterion)
    
    if(abs_error< best_abs_error):
        checkpoint(model, epoch%3,model_out_path+"C1Net"+"_"+repr(int(abs_error)))
    else:
        checkpoint(model, epoch%3,model_out_path+"C1Net")


    is_best = abs_error< best_abs_error
    best_abs_error = min(abs_error, best_abs_error)
 
    print(' * best abs_error {abs_error:.3f} '
          .format(abs_error=best_abs_error))

Also, is there a way to stop it validating inside fit? i.e. i have a different folder for test images and labels so i want to validate separately.
Also, to save the model after each epoch i’ve to put the fit inside a loop where after each completion of fit(1) i can save the model, or there is a better way to do that which fastai offers?

Your CNN model maps from the same size down to the same size. If you want it to do something different you can change your convolutions or change your padding to change the in/out dimensions. Once you set them up, they are static for the model (hence the need to add padding.) When I am trying to debug these types of things I will put in import pdb;pdb.set_trace() in the forward call and then inspect the input/output to figure out what the model is going to try and use.

It appears that Fit is going to check if data.empty_val to decide about running validation once per epoch. You could leave that empty or you can build your validation and test data as needed/expected and feed into the DataBunch

Don’t put the fit inside the loop. Go back and look at the Fast.ai lectures about fitting and then use the fit_one_cycle method with theappropriate callbacks so that you save on best or save each cycle. If you do it in a loop, you lose all the history about momentum and gradients that is very important to getting a good fit.

1 Like

Thanks so much Bobak.

I will modify it as you mentioned.

1 Like

@bfarzin In the Custom Dataset it seems that you are dropping the 4th channel.I have a dataset of 7 channels and I need all those 7 channels.So when I use this How Will Data Augmentation works since it is only defined for 3 channel images in fast.ai or is there a way to pass a dataset with my own transforms into fastai learner.

I don’t understand your question. you have a 7 channel image? Maybe if you post an example it would be clear what you are trying to do and I could help out.

I have 7 channels as input to CNN.The dataset is similar to one used in dstl competition.For One Output mask there are 7 channels as input.It is basically a segmentation problem

@bfarzin Basically I have a (7 * size * size) as my Input and I want to pass it To UNET to get (n_classes * sz * sz) as output masks.How do I pass this to Fastai.While using Standard Pytorch I implemented Basic Transformations like horizontal flip and 90 degree rotations using numpy.Will the transforms of fastai work for 7 channel input i not how can i pass my own transforms to fastai.