We understand you might get frustrated when debugging your code and try to figure out if there is something wrong with it or if there is a bug in the library, so I wanted to highlight in a post the first steps to debug your model training with fastai v2 and how to best ask for help on the forum, to make sure you get quick answers from experienced users.
What not to do when asking for help
Here is an example post you should absolutely never write:
When I tried
learn.fit_one_cycle(1)
, my model crashed with the error RuntimeError: invalid argument 0. Please help @jeremy @sgugger @rachel
Whatās wrong with it? Letās go over the error step by step
1. Show us the code
When your code crashes at the model training step, it could have multiple reasons: something wrong with your data, your model, your loss function, your optimizer, a custom callbackā¦ Thatās why someone else needs to see each of those to try to guess the cause of the error. No one is a magician that can read your throught
2. Show us the whole stack trace
It is really unhelpful to only show part of the error message. Python returns way more than a single sentence when an error occurs, and even if it doesnāt make sense to you (read further if thatās the case!) it will make sense to someone more experimented. Just copy the whole thing and paste it in your post, between two lines of ```. This will make sure it displays nicely on the forum
3. Be mindful with the at-mentions
In general, only at-mention a person on the forum when the question you ask can only be answered by that person. In the present case, lots of people can help (providing you follow the guidelines in this post ). In particular, try avoiding at-mentioning the administrators.
Debugging your error
Before rushing on the forum for help, take a deep breath. Letās first try to make sense of that error before rushing on the forum.
1. Break it into pieces
If your error comes in a training loop (whether itās a call to fit, fit_one_cycle or lr_find for instance), it can have multiple reasons, so try to inspect the pieces one by one (data, model, optimizer, callbacks) and see which one is the cause. More generally, try breaking your code in smaller pieces and inspect them individually.
Here are some tips to help you debug specifics part:
-
For your data, check itās possible to actually load a batch with
dls.one_batch()
. If you built your data with fastai2, check the result ofdls.show_batch()
makes sense. If you build your data with the data block API, typedls.summary(same_args_as_in_your_call_to_dataloaders)
, this will print a verbose summary of whatās actually happening and you will see where it failed, and get a clear error message for most common errors. -
For your model, grab a batch of your data with
x,y = dls.one_batch()
(assuming your data has just one tensor for inputs, one tensor for labels; adapt this code if necessary) then dopred = learn.model(x)
and see whatās happening -
For your loss function, use the previous preds and y and type
loss = learn.loss_func(preds, y)
and see whatās happening. Check thatlearn.loss_func
is the loss function you expected -
For your callbacks, try
learn.show_training_loop()
. This will show you all the callbacks and events called, in what order.
2. Read the stack trace
As you can see, python did not just throw you two words for an error but a whole text that doesnāt make any sense. Donāt be afraid and look at it, itās called the stack trace. The very bottom is where your error actually happened, at the lowest level, then, as you go back up, each paragraph corresponds to the function that called the block below.
This can be helpful as you can see in which part of the library you were when the error happened, and can help you determine its cause. For instance, here is the stack trace that was with the error in the bad post up there:
----------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-18-8c0a3d421ca2> in <module>
4 splitter=RandomSplitter(seed=42),
5 get_y=using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'))
----> 6 pets1.summary(path/"images")
~/git/fastai2/fastai2/data/block.py in summary(self, source, bs, **kwargs)
172 why = _find_fail_collate(s)
173 print("Make sure all parts of your samples are tensors of the same size" if why is None else why)
--> 174 raise e
175
176 if len([f for f in dls.train.after_batch.fs if f.name != 'noop'])!=0:
~/git/fastai2/fastai2/data/block.py in summary(self, source, bs, **kwargs)
166 print("\nCollating items in a batch")
167 try:
--> 168 b = dls.train.create_batch(s)
169 b = retain_types(b, s[0] if is_listy(s) else s)
170 except Exception as e:
~/git/fastai2/fastai2/data/load.py in create_batch(self, b)
124 def retain(self, res, b): return retain_types(res, b[0] if is_listy(b) else b)
125 def create_item(self, s): return next(self.it) if s is None else self.dataset[s]
--> 126 def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
127 def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
128 def one_batch(self):
~/git/fastai2/fastai2/data/load.py in fa_collate(t)
44 b = t[0]
45 return (default_collate(t) if isinstance(b, _collate_types)
---> 46 else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
47 else default_collate(t))
48
~/git/fastai2/fastai2/data/load.py in <listcomp>(.0)
44 b = t[0]
45 return (default_collate(t) if isinstance(b, _collate_types)
---> 46 else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
47 else default_collate(t))
48
~/git/fastai2/fastai2/data/load.py in fa_collate(t)
43 def fa_collate(t):
44 b = t[0]
---> 45 return (default_collate(t) if isinstance(b, _collate_types)
46 else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
47 else default_collate(t))
~/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
53 storage = elem.storage()._new_shared(numel)
54 out = elem.new(storage)
---> 55 return torch.stack(batch, 0, out=out)
56 elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
57 and elem_type.__name__ != 'string_':
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 414 and 375 in dimension 2 at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensor.cpp:612
So starting at the bottom, the actual error message is more helpful than just RuntimeError: invalid argument 0
. It says that the size of two tensors should have matched but they did not. Looking at the stack trace, the lowest level was in the function default_collate
of torch/utils/data/_utils/collate.py
. That means, the PyTorch function default_collate
was unhappy with the sizes of tensors.
Just above, we see that this function default_collate
was called by fa_collate
in fastai2/data/load.py
. So we have left PyTorch and arrived in fastai2, module data.load. The function fa_collate
was unhappy with the sizes of tensors it got. Remember that you can use doc(any_function)
in your notebook to pop up a window with the documentation and a link to the function in the doc website. Doing this will let you know that fa_collate
is responsible for grouping together your tensors in a batch (actually it does not right now, but letās imagine Iāve fixed this )
So we add a problem when grouping the tensors to put them in a batch because they were not of the same sizeā¦ this means you forgot to resize your items to the same size!
Iām not saying this is easy, but give it a try at your next error, and even if you donāt understand, donāt forget to copy that stack trace when you ask for help. Someone might make sense of it and be able to help you, and even explain to you how to read that particular stack trace.
3 %debug
When in a notebook, type %debug in the cell just after your error then Shift+enter. You will be put inside the stack trace we just studied, and you can inspect the content of any variable there. Just press u and enter if you want to go one frame up (e.g. on paragraph up).
In the same way, having a line set_trace()
anywhere in the code will pause the execution and let you inspect the content of any variable once that code is executed.
How to write your cry for help on the forum
If youāre still desperate after all of this, or spent more than half an hour trying to figure out whatās wrong, do go on the forum. First search for your error to see if anyone has already posted something like yours and (hopefully) someone else has come with the solution.
If not, create a new topic in the relevant category. Show a minimal amount of code necessary for anyone to reproduce your bug (if possible, on one of the fastaiās dataset). Donāt show the line that gave an error only!
Copy and paste the full stack trace between two lines of ``` . Explain what you tried, what you think is causing the error (itās ok if you have no thoughts). Donāt forget to be nice and courteous, itās no oneās job to fix your problem. But if you followed all the steps of that post, you will find that plenty of users will help.
Above all else, be mindful of who you at-mention. Donāt post twice. Be patient, remember that if no one is answering it probably means that everyone is as lost as you are.
Also note that all I mentioned here (apart form the bits specific to fastai2) is general good practices when interacting with people on any open-source project. So always try to follow these guidelines