Custom ItemList

jcreinhold · November 14, 2018, 6:25am

I tried to implement a data loader with the new ItemList class. My goal is to take in two 3d volumes and extract corresponding slices from them. I have some initial work here. I can get the ItemList to extract the two source and target images with the same transformations applied to each of them, which is great. However, I don’t know if I am using the ItemList class as desired, since I cannot create an ImageDataBunch nor can I instantiate a Learner class (see the link for the errors that I receive).

I believe this is due to calling ItemList(…) instead of one of the class methods. However, it is not clear to me which class method would create the functionality I need for this specific task.

Thanks for the help and great work!

sgugger · November 14, 2018, 2:21pm

ImageDataBun.create_from_ll takes a LabelLists object, so you’d need to split your srcd, tgtd into training and validation, then call LabelList(srcd, tgtd, tfms, tfm_y=True) on each of them before creating that LabelLists.

It would be more efficient to fully use the data block API, I’ll update the docs today to include how to customize ItemList.

jcreinhold · November 17, 2018, 5:52pm

I was able to create an ImageDataBunch using the instructions by @sgugger (above). See an implementation here.

FYI, I tried to get this working just using the data block API (e.g., ItemList.from_folder().random_split_by_pct()..., however I got stuck on getting the label_from_... to work for the use case shown in the link. If anyone has suggestions, it would be appreciated. Otherwise, putting it together like this was pretty easy (after figuring out how all the blocks go together).

For users who want to import NIfTI files for training with fastai, I also wrapped all this up in a small repo called niftidataset. If the package is installed, a user can create a databunch as shown in the ipynb by calling niftidataset.fastai.niidatabunch(...). Some 3D data processing transforms that interface with fastai are also included (see the niftidataset.fastai module).

Thanks for the help

nextM · November 23, 2018, 12:43pm

Hi,

I created a custom DataBunch that loads embeddings, continuous vars and images, into a model which processes them appropriately. The trick was overriding the get function to return a class based on ItemBase with self.data populated as a list of the input lists (I didn’t see this in the docs so got it from the source code)

class QuickLine(ItemBase):
    def __init__(self, cats, conts, imgs):
        self.cats,self.conts,self.imgs = cats,conts,imgs
        self.data = [tensor(cats), tensor(conts), tensor(imgs)]

But then when I override DataLoader.proc_batch like so (where 2 is the index of img part of my ItemBase.data):

def proc_batch(self,b:Tensor)->Tensor:
        "Proces batch `b` of `TensorImage`."
        b = to_device(b, self.device)
        for f in listify(self.tfms): b[2] = f(b[2])
        return b

…then all sorts of stuff breaks with the transformers… What I am trying to do is to make sure that image augmentation gets implemented in the custom dataloader

nextM · November 23, 2018, 2:07pm

I get this error:

'Tensor' object has no attribute 'affine'

Please could you help? I would like to eventually wrap this up in a custom data loader & model generator which I will do a PR for

jcreinhold · November 23, 2018, 5:28pm

For image data augmentation, you will need to first cast your image to the fastai.vision.Image class. I believe this will resolve the part of the problem having to do with the tensor object not having the affine attribute.

shaun1 · November 26, 2018, 2:15pm

Hi,

Do you load different types of data are processed separately into a same databunch? If you do process data like that, could share your code? I will be working on a problem that has different data components and this will be very helpful to me.

Thanks.

sgugger · November 26, 2018, 4:41pm

As I said earlier, there is now a full tutorial on how to create a custom ItemList in the docs.

marcmuc · November 26, 2018, 6:01pm

Thanks, that is extremely helpful!

jbmaxwell · December 5, 2018, 12:59am

I’ve been following this tutorial to build an image-to-image translation model. My code looks essentially the same (as the ImageTupleList example), but my instantiated ImageTupleList is giving me errors when I try to actually use it, e.g., 'ImageTupleList' object has no attribute 'databunch', or 'ImageTupleList' object has no attribute 'transforms'. I really don’t understand what’s wrong. It almost seems as though it isn’t a proper subclass. Anything obvious come to mind?

sgugger · December 5, 2018, 5:02am

It’s hard to say without any code.
Note that you need to use it in the data block API as usual with a split, label, then transform and databunch call.

jbmaxwell · December 5, 2018, 4:41pm

Ah, okay, I’m not concerned about labels, so I left that out. I’m getting a little further now with label_from_folder() included!

jbmaxwell · December 5, 2018, 6:10pm

Though I’ve got a little farther, I’m hitting an index out of bounds error when calling show_batch() on my DataBunch. I’m guessing something is wrong with the data or folder structure I’m giving it…

But how would the ImageTupleList in that tutorial expect to see the data organized on disk? My problem is a paired data image-to-image translation problem, and I have my data organized as:

data/
├─ sketch2rep/
│ ├── train/
│ │ ├── sketch/ (images…)
│ │ ├── rep/ (images…)
│ ├── valid
│ │ ├── sketch/ (images…)
│ │ ├── rep/ (images…)
│ ├── test
│ │ ├── sketch/ (images…)
│ │ ├── rep/ (images…)

That is, I’ve created my image corpus with a folder structure that reflects the structure of the data. However, ImageTupleList.from_folders() expects two folders as inputs, which I’ve assumed must be my “sketch” and “rep” folders (i.e., the paired images that will be sent to the model). But then that would mean giving it only one of train, or valid… So what, then, does split_by_folder() do?

The code is just:

data = (ImageTupleList.from_folders(PATH, "train/sketch", "train/rep")
    .split_by_folder()
    .label_from_folder()
    .databunch(bs=4))
data.show_batch()

I don’t have any transforms because I have yet to write reasonable ones for my domain (image-based music representations). There are possibilities, but they’re not the normal image transformations.

jbmaxwell · December 6, 2018, 12:27am

Trying to trace python != fun times, but I’ve tracked it down to a kind of weird fit between (how I’m understanding) the ImageTupleList and split_by_folder(), since the latter is explicitly looking for train and valid, whereas I’m only giving it my “sketch” and “rep” from the train folder. So, I’m confused. How is ImageTupleList intended to be used in a cycleGAN-like context?

jbmaxwell · December 6, 2018, 5:28pm

Okay, I rebuilt my data so that my “sketches” and “reps” are basically separate data sets, each with their own train and valid folders. This gets me a little further, but I hit this error:

AttributeError                            Traceback (most recent call last)
<ipython-input-40-0f553cce9c41> in <module>
  5         .label_from_folder()
  6         .databunch(bs=16))
----> 7 data.show_batch()

~/anaconda3/envs/fastai/lib/python3.7/site-packages/fastai/basic_data.py in show_batch(self, rows, ds_type, **kwargs)
149     def show_batch(self, rows:int=5, ds_type:DatasetType=DatasetType.Train, **kwargs)->None:
150         "Show a batch of data in `ds_type` on a few `rows`."
--> 151         x,y = self.one_batch(ds_type, True, True)
152         if self.train_ds.x._square_show: rows = rows ** 2
153         xs = [self.train_ds.x.reconstruct(grab_idx(x, i, self._batch_first)) for i in range(rows)]

~/anaconda3/envs/fastai/lib/python3.7/site-packages/fastai/basic_data.py in one_batch(self, ds_type, detach, denorm)
132         w = self.num_workers
133         self.num_workers = 0
--> 134         try:     x,y = next(iter(dl))
135         finally: self.num_workers = w
136         if detach: x,y = to_detach(x),to_detach(y)

~/anaconda3/envs/fastai/lib/python3.7/site-packages/fastai/basic_data.py in __iter__(self)
 68         for b in self.dl:
 69             y = b[1][0] if is_listy(b[1]) else b[1]
---> 70             if not self.skip_size1 or y.size(0) != 1: yield self.proc_batch(b)
 71 
 72     @classmethod

AttributeError: 'str' object has no attribute 'size'

It seems clear that it’s expecting an image but getting a string. My ImageTupleList is basically identical to the tutorial, except that my get() uses i directly on itemsB, rather than getting a random image:

def get(self, i):
    # get the ith sketch
    img1 = super().get(i)
    # and the corresponding ith rep
    fn = self.itemsB[i]
    return ImageTuple(img1, open_image(fn))

Any thoughts on why it would still have a string (presumably a path) rather than an image, at this stage? My get() is clearly returning an ImageTuple, but I can’t see whether that is actually called by the dataloader (I’m quite new to python, so finding my way through the source is a bit tricky).

UPDATE: It seems like train_dl and valid_dl are somehow not the correct types. But how can that be???

sgugger · December 6, 2018, 9:48pm

At this stage, your xs are fine, it’s the ys that aren’t. You should try to get the first items in your training set to look at what they look like.

jbmaxwell · December 7, 2018, 12:00am

I’m really not sure what I should be looking for, I’m afraid. Isn’t (x,y) basically determined inside the DataBunch? From data.get(3) (for example), I can “show()” image1 and image2, and also the result of the to_one() call, and these all look as I would expect (except that my pairs aren’t correct – I guess I have to sort my lists somewhere?).

What confuses me is that the tutorial suggests that we’re required to implement show_xys() because it is required by show_batch(), but calling show_batch() on my data object never reaches my show_xys() implementation.

Sorry for what must seem like a bunch of stupid questions, but there seems to be a lot hidden by fastai, which makes it difficult to know what’s supposed to be happening.

jcreinhold · December 7, 2018, 2:27am

Not sure if this will help, but I was following this and tried to use the CycleGAN with a custom itemlist which seems to be relevant to the current thread. Here is my work so far.

I am able to get the ImageTupleList working (I call it a TIFFTupleList since I am using it to open a specific type of 1-channel TIFF image). Everything seems to work fine, even when I call learner.find_lr (so perhaps this is helpful as a code example @jbmaxwell). However, I seem to be running into an issue when I use learner.fit or learner.fit_one_cycle (see the bottom of the above link). .fit and .fit_one_cycle only run for most of one epoch before failing (the training part seems to run, but then it fails when it tries to process a validation batch).

I initially tried to follow the original implementation provided by @sgugger here, but when I used .split_by_idx([]) and ran .fit I received the error: IndexError: index 0 is out of bounds for axis 0 with size 0 when validation tried to run.

Any ideas on how to get this up and running? Happy to provide additional details as necessary.

Thanks for the help

sgugger · December 7, 2018, 2:32am

For that bug should just say data.valid_dl=None. Note that GAN haven’t been taught in the course yet and this is just development, there will be an easier way to get this running before next course.

@jbmaxwell The show_xys method isn’t called because pytorch didn’t manage to create the batch. In your error call, it seems the problem seems to be in the ys, so that’s why I suggested you look at them to debug further. You should check data.train_ds.y.items as well as data.train_ds.y[0] (and a few other values than 0).

jcreinhold · December 7, 2018, 2:50am

Manually setting valid_dl=None worked and I was able to use both the .fit and .fit_one_cycle with that one change. I updated the gist, FWIW. Thanks for the help!