How to iterate using Data API without having to reload dataset

tcwalther · January 13, 2020, 1:02pm

I’m trying to understand how to best work with fast.ai’s data API. In my understanding, the purpose of subclassing ItemBase is so that one can, for example, implement a plotting method for said item in that subclass. Similarly, subclassing ItemList allows implementing show_xys() and show_xyzs(). That’s very handy when using functions such as data.show_batch().

Unfortunately, that also means that I always have to recreate my dataset when I iterate on these methods. It would be much more convenient to have my data in one place, and my plotting functions in a separate class. Then I could substitute my plotting functions without having to reload my data.

I vaguely remember that Jemery said that fast.ai v2 would make use a lot more of delegation. I guess this would be one such case, where I could simply substitute the delegate of a plotting function, for example.

Is my understanding of the limitation of the v1 API correct, or does it sound like I’m using it wrongly? If I have correctly described a limitation of the v1 data API, is this indeed something that you are trying to address in v2?

Edit: I have worked around this limitation with the following coding style:

class MyItemBase(ItemBase):
    #...
    def plot(self, *args, **kwargs): return _plot(self.data, *args, **kwargs)
        
def _plot(data): 
    # ...
    plt.show()

That way, I can just reexecute the cell, which will redefine _plot, and it gets picked up by my existing MyItemBase instances. Same for show_xys etc… I do wonder though whether there is a more elegant way.

sgugger · January 13, 2020, 2:37pm

You are correct on all counts. With the dispatach system in v2, you end up implementing a version of show_batch or show_results for your new type, which doesn’t require you to recreate your dataset at each new iteration (if it returns batches of that new type).

tcwalther · January 14, 2020, 5:04pm

Thank you very much for clarifying.