How to debug your code and ask for help with fastai v2

We understand you might get frustrated when debugging your code and try to figure out if there is something wrong with it or if there is a bug in the library, so I wanted to highlight in a post the first steps to debug your model training with fastai v2 and how to best ask for help on the forum, to make sure you get quick answers from experienced users.

What not to do when asking for help

Here is an example post you should absolutely never write:

When I tried learn.fit_one_cycle(1), my model crashed with the error RuntimeError: invalid argument 0. Please help @jeremy @sgugger @rachel

What’s wrong with it? Let’s go over the error step by step

1. Show us the code

When your code crashes at the model training step, it could have multiple reasons: something wrong with your data, your model, your loss function, your optimizer, a custom callback… That’s why someone else needs to see each of those to try to guess the cause of the error. No one is a magician that can read your throught :wink:

2. Show us the whole stack trace

It is really unhelpful to only show part of the error message. Python returns way more than a single sentence when an error occurs, and even if it doesn’t make sense to you (read further if that’s the case!) it will make sense to someone more experimented. Just copy the whole thing and paste it in your post, between two lines of ```. This will make sure it displays nicely on the forum

3. Be mindful with the at-mentions

In general, only at-mention a person on the forum when the question you ask can only be answered by that person. In the present case, lots of people can help (providing you follow the guidelines in this post :slight_smile: ). In particular, try avoiding at-mentioning the administrators.

Debugging your error

Before rushing on the forum for help, take a deep breath. Let’s first try to make sense of that error before rushing on the forum.

1. Break it into pieces

If your error comes in a training loop (whether it’s a call to fit, fit_one_cycle or lr_find for instance), it can have multiple reasons, so try to inspect the pieces one by one (data, model, optimizer, callbacks) and see which one is the cause. More generally, try breaking your code in smaller pieces and inspect them individually.

Here are some tips to help you debug specifics part:

  1. For your data, check it’s possible to actually load a batch with dls.one_batch(). If you built your data with fastai2, check the result of dls.show_batch() makes sense. If you build your data with the data block API, type dls.summary(same_args_as_in_your_call_to_dataloaders), this will print a verbose summary of what’s actually happening and you will see where it failed, and get a clear error message for most common errors.

  2. For your model, grab a batch of your data with x,y = dls.one_batch() (assuming your data has just one tensor for inputs, one tensor for labels; adapt this code if necessary) then do pred = learn.model(x) and see what’s happening

  3. For your loss function, use the previous preds and y and type loss = learn.loss_func(preds, y) and see what’s happening. Check that learn.loss_func is the loss function you expected

  4. For your callbacks, try learn.show_training_loop(). This will show you all the callbacks and events called, in what order.

2. Read the stack trace

As you can see, python did not just throw you two words for an error but a whole text that doesn’t make any sense. Don’t be afraid and look at it, it’s called the stack trace. The very bottom is where your error actually happened, at the lowest level, then, as you go back up, each paragraph corresponds to the function that called the block below.

This can be helpful as you can see in which part of the library you were when the error happened, and can help you determine its cause. For instance, here is the stack trace that was with the error in the bad post up there:

----------------------------------------------------------------
RuntimeError                   Traceback (most recent call last)
<ipython-input-18-8c0a3d421ca2> in <module>
      4                  splitter=RandomSplitter(seed=42),
      5                  get_y=using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'))
----> 6 pets1.summary(path/"images")

~/git/fastai2/fastai2/data/block.py in summary(self, source, bs, **kwargs)
    172         why = _find_fail_collate(s)
    173         print("Make sure all parts of your samples are tensors of the same size" if why is None else why)
--> 174         raise e
    175 
    176     if len([f for f in dls.train.after_batch.fs if f.name != 'noop'])!=0:

~/git/fastai2/fastai2/data/block.py in summary(self, source, bs, **kwargs)
    166     print("\nCollating items in a batch")
    167     try:
--> 168         b = dls.train.create_batch(s)
    169         b = retain_types(b, s[0] if is_listy(s) else s)
    170     except Exception as e:

~/git/fastai2/fastai2/data/load.py in create_batch(self, b)
    124     def retain(self, res, b):  return retain_types(res, b[0] if is_listy(b) else b)
    125     def create_item(self, s):  return next(self.it) if s is None else self.dataset[s]
--> 126     def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
    127     def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
    128     def one_batch(self):

~/git/fastai2/fastai2/data/load.py in fa_collate(t)
     44     b = t[0]
     45     return (default_collate(t) if isinstance(b, _collate_types)
---> 46             else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
     47             else default_collate(t))
     48 

~/git/fastai2/fastai2/data/load.py in <listcomp>(.0)
     44     b = t[0]
     45     return (default_collate(t) if isinstance(b, _collate_types)
---> 46             else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
     47             else default_collate(t))
     48 

~/git/fastai2/fastai2/data/load.py in fa_collate(t)
     43 def fa_collate(t):
     44     b = t[0]
---> 45     return (default_collate(t) if isinstance(b, _collate_types)
     46             else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
     47             else default_collate(t))

~/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     53             storage = elem.storage()._new_shared(numel)
     54             out = elem.new(storage)
---> 55         return torch.stack(batch, 0, out=out)
     56     elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
     57             and elem_type.__name__ != 'string_':

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 414 and 375 in dimension 2 at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensor.cpp:612

So starting at the bottom, the actual error message is more helpful than just RuntimeError: invalid argument 0. It says that the size of two tensors should have matched but they did not. Looking at the stack trace, the lowest level was in the function default_collate of torch/utils/data/_utils/collate.py. That means, the PyTorch function default_collate was unhappy with the sizes of tensors.

Just above, we see that this function default_collate was called by fa_collate in fastai2/data/load.py. So we have left PyTorch and arrived in fastai2, module data.load. The function fa_collate was unhappy with the sizes of tensors it got. Remember that you can use doc(any_function) in your notebook to pop up a window with the documentation and a link to the function in the doc website. Doing this will let you know that fa_collate is responsible for grouping together your tensors in a batch (actually it does not right now, but let’s imagine I’ve fixed this :wink: )

So we add a problem when grouping the tensors to put them in a batch because they were not of the same size… this means you forgot to resize your items to the same size!

I’m not saying this is easy, but give it a try at your next error, and even if you don’t understand, don’t forget to copy that stack trace when you ask for help. Someone might make sense of it and be able to help you, and even explain to you how to read that particular stack trace.

3 %debug

When in a notebook, type %debug in the cell just after your error then Shift+enter. You will be put inside the stack trace we just studied, and you can inspect the content of any variable there. Just press u and enter if you want to go one frame up (e.g. on paragraph up).

In the same way, having a line set_trace() anywhere in the code will pause the execution and let you inspect the content of any variable once that code is executed.

How to write your cry for help on the forum

If you’re still desperate after all of this, or spent more than half an hour trying to figure out what’s wrong, do go on the forum. First search for your error to see if anyone has already posted something like yours and (hopefully) someone else has come with the solution.

If not, create a new topic in the relevant category. Show a minimal amount of code necessary for anyone to reproduce your bug (if possible, on one of the fastai’s dataset). Don’t show the line that gave an error only!

Copy and paste the full stack trace between two lines of ``` . Explain what you tried, what you think is causing the error (it’s ok if you have no thoughts). Don’t forget to be nice and courteous, it’s no one’s job to fix your problem. But if you followed all the steps of that post, you will find that plenty of users will help.

Above all else, be mindful of who you at-mention. Don’t post twice. Be patient, remember that if no one is answering it probably means that everyone is as lost as you are.

Also note that all I mentioned here (apart form the bits specific to fastai2) is general good practices when interacting with people on any open-source project. So always try to follow these guidelines :slight_smile:

29 Likes