So, you want to create your custom pipeline with fastai

I started to write this post when my very first fastai based pipeline with custom Dataset, DataLoader and Learner was trained on background. I went through much of pain to do that and I want to help those of you who go the same way at the first time.

If you want to create with fastai something that was not covered in Jeremy’s awesome DL course, you’ll need to understand a fastai structure first, so my first suggest is to look at this this mindmap created by @shaun1 (link to his post).

This will give you some big picture but this is not all.

For pipeline you need:


  • Base fastai class - BaseDataset
  • Knows how to open your data by index
  • Example: feed him with list of images filenames and list of labels and create get_x func that returns image and get_y func that return label im np.array format


  • Base fastai class - DataLoader
  • Knows how to read data from Dataset and create batches from it


  • Base fastai class - ModelData
  • Contains your DataLoaders, path to your data, transformation etc


  • Base class - nn.Module
  • Pure pytorch model of whatever architecture you want

Loss function

  • The functions that calculate loss between Model output and target


  • Your favorite optimizer (Adam, SGD, RMSProp etc)


  • Base fastai class - Stepper
  • Makes optimizer steps during training process


  • Base classs - torch Sampler
  • Sample your data somehow during training (ex. - balancing classes)


  • Base fastai class - Learner
  • Knows how to learn your Model with given ModelData, Loss function, Optimizer, Stepper and Sampler

Simple steps to create your custom pipeline

NOTE: You may have not to create a new class, existing class may be ok for you, so check this first.

  1. Create a Dataset, make sure it returns what you want
  2. Create a DataLoader for your dataset
  3. Create a ModelData
  4. Create a Model and try it on some sample data from your DataLoader
  5. Grab some suitable Optimizer, Loss function and create a Learner with them (you can also create your custom Stepper and Sampler for this)
  6. Try to call
  7. Get some errors and get mad :smiley:
  8. Chill out, put ipdb.set_trace() everywhere you can (for my experience the most useful parts are beginning of your models forward() functions and beginning of loss function)
  9. Debug untill it works

I hope this little guide help some of you. If you have something useful to add, write it to me or in this topic, I will add it to this post.


I found it pretty hard to navigate the objects that were necessary to do this. So, I produced this Jupyter Notebook as a gist so that others could navigate more easily when you encounter this problem.

  • I picked a toy problem for multi-classification so that you can see how to get numpy data into a learner and produce results.
  • I had a multi-classification problem, so I used that as my example
  • This is a trivial sized MLP to solve the problem.
  • You need to build two objects to make this work:
  1. A ModelData object to wrap your model data from
  2. What I termed a LearnerModelBuilder() which returns a list of models (in our case 1 model)

Let me know if this helps anyone to solve their applied problem with!!


Hi Bobak,

thanks for sharing your notebook! Being able to use some fastai tools such as the learning rate finder with custom models/data/loss function would be very helpful. However, with the current version of fastai (1.x), see, your code doesn’t work anymore. For instance, I think ArraysIndexDataset doesn’t exist anymore. It would be really cool if you could update your code for the current version of fastai.


Thanks for that note. Sorry I had not checked in in a while. I am playing with V1 right now and I can do that update. Thanks for the suggestion!

1 Like

I did the update here. The code get’s simpler since you use the pytorch TensorDataset and then load that into a DataBunch. From there it is very similar with small changes to the API. Let me know if you have an questoins.


hi Bobak,

in your example you seem to be passing numpy arrays as data

train_ds = tdatautils.TensorDataset(X,y)
valid_ds = tdatautils.TensorDataset(X_val,y_val)
test_ds = tdatautils.TensorDataset(X_test,y_test)

with large datasets it would difficult to do. do you know how can i pass in file names with a custom function to read the files ?


Two thoughts on this:

First, I have managed to pass some pretty large files in this way with no issue. If you can load up the Numpy array, you can cast to a tensor (you are not putting on the GPU, you are just wrapping as a PyTorch Tensor.) So, I am curious why the data can’t be loaded in your case.

Second, I am happy to try to get a function working to wrap the files. Can you give me a sample set of data and then I will, probably, see better what is going on with your use case. It does not need to be large, but just representative of the use case. It can be filled with random values just so we can figure out how to handle the files.

with the first approach i already tried but doesnt work because the inputs are large images in resolutions of upto 5K. The labels are rather smaller, processed as numpy arrays, saved in compressed format. They are not special file, just a bit large. What I would need is to have an interface where i can pass the input images file list( list of strings containing file paths & names) and similary a an inteface to pass list of npz files along with a

input = IMG_1.jpg, IMG_2.jpg, …etc saved in folder: images
labels= IMG_1.npz, IMG_2.npz,…etc saved in folder : npz_files

here is my current function which i call in getitem function of my dataset class extend from pytorch data set class.

def load_data(img_path):       
    gt_path = img_path.replace('.jpg','.npz').replace('images','npz_files')
    img ='RGB')
    target =  gt_file['arr_0']
    return img,target 

To get some sample files, just get some large images in images folder and read images in numpy and save some smaller portions of them in numpy format. here is one example.

   img= plt.imread(img_path)
   label = cv2.resize(img,(int(img.shape[1]/8),int(img.shape[0]/8)),interpolation = cv2.INTER_CUBIC)*64
   np.savez_compressed("npz_files/label1.npz", label)

I just dont want to load all of them in the RAM in advance because then it runs out of RAM.

Thanks so much… I am excited if i can use fastai’s features like lr_find for my project. :slight_smile:

let me know if you need any further of my assistance. I am happy to help.

Is this thread closer to your use case? I have not tried this with v1 yet, so not sure how much of the code maps over, but the idea of a MatchedDataSet from files seems closer to your use case.

If this is the case, then it is likely you will benefit from using the built-in ImageDataBunch design that looks at paths or filenames stored in a pd.DataFrame Let me know if that gets you started down a path that is productive or if that is not the right direction.

The first thread is older version, it cant find ImageData class.

with newer version i tried using the from_name_func, it does accept it and tries to cache all labels but then at the end throws an error.

dirpathB = "data/train_data/images"

tfms=get_transforms(max_lighting=0.1, max_zoom=1.05, max_warp=0.)

def get_labels(file_path):
  gt_path = file_path.replace('.jpg','.npz').replace('images','label_dir')


  target =  gt_file['arr_0']
  print (gt_path)
  return target   

 data = ImageDataBunch.from_name_func(dirpathB, train_listB, 
 label_func=get_labels, ds_tfms=tfms, size=(768,1024))


/usr/local/lib/python3.6/dist-packages/fastai/ in process(self, processor)
     66         if processor is not None: self.processor = processor
     67         self.processor = listify(self.processor)
---> 68         for p in self.processor: p.process(self)
     69         return self

/usr/local/lib/python3.6/dist-packages/fastai/ in process(self, ds)
    282     def process(self, ds):
--> 283         if self.classes is None: self.create_classes(self.generate_classes(ds.items))
    284         ds.classes = self.classes
    285         ds.c2i = self.c2i

/usr/local/lib/python3.6/dist-packages/fastai/ in generate_classes(self, items)
    325         "Generate classes from `items` by taking the sorted unique values."
    326         classes = set()
--> 327         for c in items: classes = classes.union(set(c))
    328         classes = list(classes)
    329         classes.sort()

TypeError: unhashable type: 'numpy.ndarray'

I haven’t tried with dfs suspecting that for labels it expects classes as its example shows.

Like so many things, once you “get it” the code is actually quite simple.

It appears that and ImageImageList is in the works for this kind of purpose. In the meantime, I was able to get something working that I think would be close.

What is needed for this is to extend the ImageItemList class to generate your own labels. So far, in the new V1 version of the library, all the label methods expect you to have lists or names/labels as categories or names. From there, they seem to proceed as expected for a classification problem.

Here is my type that, I think, is generating what you are looking for:

## Extend the ImageItemList to include a custom_label method
class CustomImageItemList(ImageItemList):    
    def custom_label(self,**kwargs)->'LabelList':
        '''custom label from path and npy directory'''        
        #self.items is an np array of PosixPath objects with each image path
        target_filenames = [Path(str(x).replace('.tif','.npz')) for x in self.items]
        target_np_array = np.array([np.load(x)['arr_0'] for x in target_filenames],dtype=int) #can't be type='object'
        y = ItemList(items=target_np_array)        
        res = self._label_list(x=self,y=y)   # like this:

        return res

One caveat, is that you need the latest fastai version that includes these two lines in

48      self.items,self.x = items,x
49      if not isinstance(self.items,np.ndarray): self.items = array(self.items, dtype=object)

To try and make it all end-to-end for you, I generated this gist. I don’t put them images/npz into separate directories, but I do manage to generate a single batch from the paths used. I think from here, you should be able to feed the databunch into a learner() in the canonical ways.

Let me know if anything is not clear.


Thanks a lot Bobak, i really appreciate your time.

I am testing with my model.

i think i can make it run now. (edited above posts).

My model takes an image and generates a smaller version of it as a smaller map.

will post soon the result.


1 Like

Great to hear. keep me updated!

well, i can’t really make it work.

2 Problems:

  1. I have batch size 1 and images of different sizes so it doesnt like it.
  2. Even with same size images it does starts lr_find or fit but crashes in between.

here is an example similar to my model (a simple and clean version of my project).

Any help much appreciated.

I got this to run end-to-end with one key change to line up the input/output of the whole thing. I put a debug in at the forward() so that I could figure out what the x_in and the x_out looked like on that pass and made sure that the conv() lined up with what was expected from the batch.

y = ItemList(items=target_np_array[:,None,:,:])

Full example is here. Let me know if that maps back to your case.

1 Like

Thanks so much again.

The training works on images of same sizes but on different sizes images it throws an exception.

currently my training looks like this:

for epoch in range(0, epochs):
    abs_error= validate(val_list, model, criterion)
    if(abs_error< best_abs_error):
        checkpoint(model, epoch%3,model_out_path+"C1Net"+"_"+repr(int(abs_error)))
        checkpoint(model, epoch%3,model_out_path+"C1Net")

    is_best = abs_error< best_abs_error
    best_abs_error = min(abs_error, best_abs_error)
    print(' * best abs_error {abs_error:.3f} '

Also, is there a way to stop it validating inside fit? i.e. i have a different folder for test images and labels so i want to validate separately.
Also, to save the model after each epoch i’ve to put the fit inside a loop where after each completion of fit(1) i can save the model, or there is a better way to do that which fastai offers?

Your CNN model maps from the same size down to the same size. If you want it to do something different you can change your convolutions or change your padding to change the in/out dimensions. Once you set them up, they are static for the model (hence the need to add padding.) When I am trying to debug these types of things I will put in import pdb;pdb.set_trace() in the forward call and then inspect the input/output to figure out what the model is going to try and use.

It appears that Fit is going to check if data.empty_val to decide about running validation once per epoch. You could leave that empty or you can build your validation and test data as needed/expected and feed into the DataBunch

Don’t put the fit inside the loop. Go back and look at the lectures about fitting and then use the fit_one_cycle method with theappropriate callbacks so that you save on best or save each cycle. If you do it in a loop, you lose all the history about momentum and gradients that is very important to getting a good fit.

1 Like

Thanks so much Bobak.

I will modify it as you mentioned.

1 Like

@bfarzin In the Custom Dataset it seems that you are dropping the 4th channel.I have a dataset of 7 channels and I need all those 7 channels.So when I use this How Will Data Augmentation works since it is only defined for 3 channel images in or is there a way to pass a dataset with my own transforms into fastai learner.