SOURCE CODE: Mid-Level API

This the new meeting URL :slight_smile:

1 Like

Thanks! :slight_smile:

Hello again everyone!
So, today for the past 7~8 hours, me, @barnacl, @arora_aman, @init_27, @rsomani95 and @ganesh.bhat have been digging through and trying to understand all the source code for the datablocks API. Here’s what we’ve covered (not in this order):

  1. Transforms
  2. Pipeline
  3. TfmdLists
  4. Datasets
  5. DataLoader
  6. DataBlock
  7. TransformBlock, etc.
  8. A lot of fastai2 utility functions
  9. FilteredBase
  10. TfmdDL (partially, we were saturated at this point and will continue from here on 29th March, 7:30 AM, IST)

These sessions were meant to be complete, in depth, line by line walkthroughs of all the above items. Make sure that you’re watching them in the order that @arora_aman posts them in. This is because there will be context that will keep building over each video and the amount of knowledge that is assumed of the viewer will also keep on increasing. Nonetheless, if you’re familiar with the topics in particular videos feel free skip them.

We recorded all our sessions and @arora_aman will be posting those videos soon enough.

We also noted down a few questions and observations we had, which @arora_aman will again be posting soon enough.

Hopefully, this helps everyone!

10 Likes

Also, a few days ago I went ahead and commented every line of code in DataLoader and I’m just posting it here. This will probably be slightly out of sync with the current version of the library and will keep getting out of sync moving ahead, but I guess the ideas will remain pretty much close to what they are now. Here it is:
(it’s a Wiki so anyone can edit this with clearer explanations)
(also, you might have to do a bit of homework before completely understanding this!)

@funcs_kwargs # Make delegation work
class DataLoader(GetAttr):
    _noop_methods = 'wif before_iter after_item before_batch after_batch after_iter'.split()
    for o in _noop_methods:
        exec(f"def {o}(self, x=None, *args, **kwargs): return x")
    # Define each of the _noop_methods as identity transforms
    _methods = _noop_methods + 'create_batches create_item create_batch retain \
        get_idxs sample shuffle_fn do_batch create_batch'.split()
    _default = 'dataset'
    def __init__(self, dataset=None, bs=None, num_workers=0, pin_memory=False, timeout=0, batch_size=None,
                 shuffle=False, drop_last=False, indexed=None, n=None, device=None, **kwargs):
        if batch_size is not None: bs = batch_size # PyTorch compatibility
        assert not (bs is None and drop_last)
        if indexed is None: indexed = dataset is not None and hasattr(dataset,'__getitem__')
        # indexed will be true if the dataset exists and can be indexed into
        if n is None:
            try: n = len(dataset)
            except TypeError: pass
        # n signifies the length of the dataset. This can be set to be smaller than the actual length
        # This is probably to allow conveniently using a subset of the dataset
        store_attr(self, 'dataset,bs,shuffle,drop_last,indexed,n,pin_memory,timeout,device')
        # convenient way to do `self.x = x` for all x in the above string
        self.rng,self.nw,self.offs = random.Random(),1,0
        # The RNG will be used later in the module
        self.fake_l = _FakeLoader(self, pin_memory, num_workers, timeout)
        # The source code for _FakeLoader is a bit complicated. 
        # The useful bit is the __iter__ function:
        # return iter(self.create_batches(self.sample()))

    def __len__(self):
        if self.n is None: raise TypeError
        if self.bs is None: return self.n
        return self.n//self.bs + (0 if self.drop_last or self.n%self.bs==0 else 1)
        # pretty self explanatory
        # the length gets divided by batch_size because that's the length of 
        # the dataloader as opposed to the dataset

    def get_idxs(self):
        idxs = Inf.count if self.indexed else Inf.nones
        # Inf.count = itertools.count(0) : a counter (iterator) starting from 0
        # Inf.nones = itertools.cycle([None]): indefinitely returns None on each next() call
        if self.n is not None: idxs = list(itertools.islice(idxs, self.n))
        # list(itertools.islice(idxs, self.n)) ~ list(range(0, self.n))
        if self.shuffle: idxs = self.shuffle_fn(idxs)
        # basic shuffling of indexes
        return idxs

    def sample(self):
        idxs = self.get_idxs()
        return (b for i,b in enumerate(idxs) if i//(self.bs or 1)%self.nw==self.offs)
        # current implementation means this will always send back 
        # a generator idxs pretty much as it is.
        # this is puzzling

    def __iter__(self):
        self.randomize()
        # Reseed random number generator
        self.before_iter()
        for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
            # _loaders = _loaders = (_MultiProcessingDataLoaderIter,_SingleProcessDataLoaderIter)
            # These are imported from torch.utils.dataloader
            # They have no documentation, but I guess this just creates an iterator
            if self.device is not None: b = to_device(b, self.device)
            yield self.after_batch(b)
            # Above 2 lines are pretty standard:
            # Get the batch, process it, yield
        self.after_iter()
        if hasattr(self, 'it'): delattr(self, 'it')
        # Cleanup

    def create_batches(self, samps):
        # `samps` is a list of indexes to the batches in the dataset 
        # which may or may not be in shuffled
        self.it = iter(self.dataset) if self.dataset is not None else None
        res = filter(lambda o:o is not None, map(self.do_item, samps))
        # res is a generator of training samples in the order described by `samps`
        # res is also careful not to return a None value
        yield from map(self.do_batch, self.chunkify(res))
        # Returns a batch from res with appropriate processing
        # and while trying to retain the original type where it can

    def new(self, dataset=None, cls=None, **kwargs):
        # Create a copy of the Dataloader with possibly a new dataset
        # and return
        if dataset is None: dataset = self.dataset
        if cls is None: cls = type(self)
        cur_kwargs = dict(dataset=dataset, num_workers=self.fake_l.num_workers, pin_memory=self.pin_memory, timeout=self.timeout,
                          bs=self.bs, shuffle=self.shuffle, drop_last=self.drop_last, indexed=self.indexed, device=self.device)
        for n in self._methods: cur_kwargs[n] = getattr(self, n)
        return cls(**merge(cur_kwargs, kwargs))

    @property
    def prebatched(self): return self.bs is None
    # prebatched will probably be true when our dataset returns items in batches
    # in which case we specify bs=None when creating the dataloader
    def do_item(self, s):
        try: return self.after_item(self.create_item(s))
        # Process and return an item indexed at s
        except SkipItemException: return None
    def chunkify(self, b): return b if self.prebatched else chunked(b, self.bs, self.drop_last)
    # return a batch of samples from the iterator b
    # chunked is pretty straightforward
    def shuffle_fn(self, idxs): return self.rng.sample(idxs, len(idxs))
    # Simply shuffle idxs and return
    def randomize(self): self.rng = random.Random(self.rng.randint(0,2**32-1))
    # reseed RNG
    def retain(self, res, b):  return retain_types(res, b[0] if is_listy(b) else b)
    # retain_types tries to retain the type of each elemenet in `res` to that of `b`
    def create_item(self, s):  return next(self.it) if s is None else self.dataset[s]
    # Index into dataset at `s`
    def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
    # If it's not prebatched then collate, else convert
    # fa_collate tries and uses PyTorch's default_collate if b has array like elements
    # fa_convert simply converts `b` into a Tensor using PyTorch's default_convert
    def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
    # process elements in batch, collate/convert them, retain the type along the way, and return.
    def to(self, device): self.device = device
    # change default device of self
    def one_batch(self):
        # Gets the first batch from `self`.
        if self.n is not None and len(self)==0: raise ValueError(f'This DataLoader does not contain any batches')
        with self.fake_l.no_multiproc(): res = first(self)
        # first just gets the first item from any iterator or `None` if there is no such item
        if hasattr(self, 'it'): delattr(self, 'it')
        return res
4 Likes

I also want to point out that I foolishly missed the message and joined super late.

Please don’t hesitate to join any of our calls, everyone who is interested in going through the Mid API (as the title and OP by @arora_aman says) is most welcome to join.

We’ll be meeting again tomorrow at 7:30 AM IST.

1 Like

@arora_aman do you plan on uploading those notebooks as well? That would be super helpful

@rsomani95 I might be wrong, but AFAIK, these were the source notebooks and then we were trying to run the code in separate notebooks to understand the functioning.

So, @aman_arora has all the notebooks we did our experimentation on. I’m guessing he’s asleep right now so he should upload everything in a few hours.

3 Likes

In today’s session, the last question/issue we had was with knowing exactly how splits and split_idx would work inside of Datasets, DataLoader, etc. Specifically, we wanted to figure out how we might go about restricting a Transform to a particular subset of our data. For example, there may be times when you do not want to apply particular augmentations to images in your validation set.
For this, we were looking at the source code for TfmdDL and Datasets and trying to understand how it’s all working. That’s exactly where we stopped and decided to continue the next day.

But I did some digging afterwards and wrote this blog about dealing with that exact issue, and understanding a few more subtleties about spreading your transforms among your Datasets splits.

Here it is: Using separate Transforms for your training and validation sets in fastai2

This blog is a bit technical and goes slightly towards the advanced side of things. Also, it’s kind of rough at this point simply because I wanted to finish it off quickly :sweat_smile:. But I guess it gets the point across.

Any feedback is very much appreciated!
(I’ll also add it to the wiki)

2 Likes

By the way, as this came up here is how everything is applied in the API in order (from the DataBlock):

  1. get_items
  2. Splits
  3. type_tfms (PILImage/Categorify)
  4. item_tfms
  5. batch_tfms
4 Likes

We’ll meet again at 10 AM IST.

2 Likes

Could you guys upload yesterday’s recordings?

Meeting URL? Should I set-up zoom call? (with a 40 minute limit)

New Link: https://zoom.us/j/290640150?pwd=N25BKzJCc2VGMzVMQTk0bFNYV24wUT09

how long before he uploads the videos and the notebooks?

Hi all,

I just finished uploading the videos, here they are:

  1. DataBlock Overview, Categorize block and CategoryMap
  2. Datasets, Dataloaders and TfmdDL source code walkthru
  3. Complete Pipelines and detailed Datasets
  4. TfmdLists complete walkthrough

In these videos we start looking in to the source code in VIM and also run our own experiments in Jupyter Notebook. We try to ask questions in the videos and also answer them.

The past two days have been one of the most fruitful in terms of learning about Fastai. Particularly, want to thank @akashpalrecha, @init_27, @barnacl for being on a call with me for pretty much all day yesterday and getting unstuck on things together.

From today, the study group has grown and we have decided that we’ll be having weekly sessions every Saturday 7:30am IST and half of Sundays to suit everyone.

We will be spending other days Mon-Fri to work on our personal blogs/experiments/projects using the library and on the weekends dig deep into the library.

As a rough plan, starting next week we will start looking into the learner source code and also much of optimizers/callbacks etc. It only get’s more interesting from here because the above four videos set the base and we can move forward towards implementing new deep learning research papers/loss functions/customize the API to meet our needs. The more interesting bits lie ahead of us :slight_smile:

16 Likes

After analyzing split_idx it in depth, don’t you feel it’s an awkward bit of the design? We are requiring a Transform, which is the most basic building block, to be aware of an implementation detail of Datasets which has the splits. So it’s a backward reference in our layered API.

I’d rather have the Datasets tell the loaders which transforms to use. Maybe just have the transforms expose whether they are meant to be used traintime through a field or a decorator and then pipelines can filter them out respectively. Likewise RandomTransforms could have randomized and deterministic behaviour and the decision which to use happens higher up the chain.

I admit I don’t understand all the details here, but it would be cool if we can come up with a refactoring that makes it cleaner and more elegant.

3 Likes

Just spend my Sunday watching your 4 videos! Amazing stuff, I really like your guys approach =) Thank you for doing this =)

3 Likes