Developer chat

pierreguillou · January 21, 2019, 6:57pm

I agree with you. Thanks if you can fix it.

I think you wanted to write “it will not work”.

Great Thank you.

xnutsive · January 21, 2019, 7:00pm

Yeah, got ahead of myself and sent the reply and then re-read the original message. Not my best Monday /shrug.

Thank for the detailed investigation. I’ll work on it tomorrow and get back to you guys with a PR hopefully.

sgugger · January 21, 2019, 7:07pm

train_ds is only hard-coded when we are looking at the class of either the inputs or the labels, to call things like reconstruct or show_xys. Those are the same for all the datasets in your DataBunch. The data is actually accessed in the first line, when we call one_batch (and there we pass ds_type).

pierreguillou · January 21, 2019, 8:17pm

Thanks Sylvain. You are right.
I just tried data.show_batch(row=5, ds_type=DatasetType.Valid) and it worked in the pets notebook (bs=64)

sgugger · January 21, 2019, 11:58pm

Breaking change: to have the same name in ImageDataBunch and TextDataBunch, as well as avoid the confusion where some people thought it was a csv separator, sep is now label_delim in the data block API and ImageDataBunch factory methods.

Docs have been updated accordingly.

stas · January 22, 2019, 2:31am

Last few days I’ve been tracking the cause of unrecoverable Out of Memory and a mem leakage on manual interrupt of the nb run. I first found the solution to the problem, which I have been polishing for quite a while only to discard it after digging deeper and finding the cause, and then fixing the cause.

So when you get CUDA OOM and you can’t recover from it w/o restart, or when you get memory leaked when you hit stop during training, the cause is ipython. It stores the traceback of the exception. The traceback ties up the locals() and they don’t get released until… another exception occurs, at which point it frees up the old tb, which allows gc.collect() to do its work. Ouch. It was quite a journey to figure it out and I have learned a lot about python on the way.

I submitted a fix here https://github.com/ipython/ipython/pull/11572 - it seems some tests that compare the exact tb no longer match, but I trust they will figure it out. Imagine that! a one line fix and now you can OOM as much you’d like and continue running your notebook! Amazing!

If you want to understand more about the problem, I explained the nuances of the problem of saving a traceback or an exception object here.

Until ipython sorts it out, if you need a solution today, you can either do a hotfix for your installed version of ipython so you can enjoy the change now,:

curl https://github.com/ipython/ipython/commit/657cde76ad07ec5b69470758d9bb6adbae88a1da.patch > /tmp/tb-leak-fix.patch
cd $CONDA_PREFIX/lib/python3.7/site-packages/
patch -p1 < /tmp/tb-leak-fix.patch

adjust the path of course, this for python 3.7

Alternatively, here is some magic code for you:

import functools
import traceback
def get_ref_free_exc_info():
    "Free traceback from references to locals/globals to avoid circular reference leading to gc.collect() unable to reclaim memory"
    type, val, tb = sys.exc_info()
    traceback.clear_frames(tb)
    return (type, val, tb)

def gpu_mem_restore(func):
    "Reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except:
            type, val, tb = get_ref_free_exc_info() # must!
            raise type(val).with_traceback(tb) from None
    return wrapper

Now add before any of your functions:

@gpu_mem_restore
def fit(...

and OOM is now recoverable! And interrupts leak no memory!

Regardless of ipython’s fix this is now part of fastai, so you should be able to see the impact by just using the latest git. At the moment only functions that call fit() are positively affected.

Here is a notebook that demonstrates the OOM w/o the leak and that recovers almost 100% of memory w/o restart, using the current fastai git: https://github.com/fastai/fastai_docs/blob/master/dev_nb/mem_leaks/OOM_on_fit_recover.ipynb

And if you want to protect just a few lines of code, here is a context manager that does the same:

class gpu_mem_restore_ctx():
    " context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    def __enter__(self): return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        if not exc_val: return True
        traceback.clear_frames(exc_tb)
        raise exc_type(exc_val).with_traceback(exc_tb) from None

So now you can do:

with gpu_mem_restore_ctx():
    learn.fit_one_cycle(1,1e-2)

with the same results. Except this one (fit functions) is already protected, this would be more useful for your custom code.

Both functions are now in https://github.com/fastai/fastai/blob/master/fastai/utils/mem.py so you will just need to from fastai.utils.mem import * before you can use them.

BTW, another workaround is to throw another exception following the OOM exception:

# cell1 - if this leads to OOM leak
learn.fit_one_cycle(1,1e-2)
# cell 2 - this will release the memory, since it will reset %tb and free its locals()
assert False, "please liberate my GPU!"

If you want a more exact case where it only recovers from OOM, but the problem remains with any other exception it’d be:

def gpu_mem_restore(func):
    "Reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        oom_exc = False
        try:
            return func(*args, **kwargs)
        except RuntimeError as e:
            if "CUDA out of memory" in str(e):
                oom_exc = True
                type, val, tb = get_ref_free_exc_info() # must!
                raise type(val).with_traceback(tb) from None
            else: raise # re-raises the exact last exception
        except: raise # any other types of errors
        finally:
            if oom_exc:
                # reclaim memory
                gc.collect()
                if torch.cuda.is_available(): torch.cuda.empty_cache()
    return wrapper

(need to include the KeyboardInterrupt type in there too)

If you encounter any related issues you can discuss those here: A guide to recovering from CUDA Out of Memory and other exceptions

xnutsive · January 22, 2019, 5:35am

@pierreguillou, ugh, show_batch(1) had a weird bug in there, seems like a regression. I think I fixed it here: https://github.com/fastai/fastai/pull/1498

Also fixed trying to show_batch(10) on smaller batch sizes.

jeremy · January 22, 2019, 1:59pm

Amazing work @stas

sermakarevich · January 22, 2019, 3:54pm

Lets add some to Stas`s pull request!

radek · January 22, 2019, 4:29pm

Read about this on Twitter but just wanted to stop by and say unbelievable, outstanding, amazing job Stas Kudos to you!

PierreO · January 22, 2019, 7:05pm

That’s amazing Stas, very very nice work !

Kaspar · January 22, 2019, 7:47pm

that is good work thz**64

j.laute · January 23, 2019, 12:02am

This is amazing! Many thanks

stas · January 23, 2019, 3:07am

Any idea why I get no follow up on this jupyter issue? I’m dumbfounded that after I broke it down to having an easy to reproduce minimal notebook and confirmed with the generic third party install, I bisected to find that the problem started with exactly python 3.6.0, spent hours trying to bisect on components and custom config to rule them all out and not a peep from the jupyter notebook devs

Does this problem not bother you at all?

If it does please vote on the issue, perhaps then it’d get some attention.

Or perhaps you’re not using the TOC extension and the magic follow the execution focus, and jump to currently executing cell shortcut - you’re missing out on being a way more efficient than manually scrolling around at times very long notebooks. Except all 3 are problematic when this bug gets triggered. TOC is still useful despite the bug, but the other two can’t work with the bug.

Clearly it’s been around for at least 2 years now (3.6.0 release). And the bug manifestation seems to be dependent on what each cell contains. The reproducible notebooks always does manifest the bug. It should be easy to verify.

kcturgutlu · January 23, 2019, 3:35am

Truly thankful

Kaspar · January 23, 2019, 10:43am

it a really annoying issue. I run pretty long jobs and when i use run all cells i cannot see the progress before i get to a celll using fastaiprogress that does work.i have uopvoted

Kaspar · January 23, 2019, 11:31am

We use BOS but not EOS in the languagemodel tokenization. Isn’t this inconsistent.
When we are reading the tokens going forward then we use BOS to signal that a new sentence begin. shouldn’t we also use EOS so that when we read the tokens backwards then EOS signals that a new reverse sentence begins ?

bfarzin · January 23, 2019, 3:31pm

I have been thinking about this as being the BOS as the start of the input, then the RNN can “reset” whatever is needed for the next pass and can proceed from there. If the tokens are forward or reversed I don’t think matters, what matters is that you have something that says, “This is a new sequence.” So, when I try the backwards tokens, I revers them all and then have a BOS at the start of the reversed series. Maybe I got that wrong.

I have also been curious about why we don’t reset_state() when we get a new BOS (or EOS in your case) to be sure we are starting “clean” with the new sentence. That would seem right to me but have not tried it out to see if you get a better model.

I have a simple flag added to the spacy tokenizer that would allow you to get reversed tokens. Should I put in a PR for that?

sgugger · January 23, 2019, 3:34pm

It’s hard to do that in practice because you get BOS in one of your batches but not all of them.

fl2o · January 23, 2019, 4:49pm

Downloaded fastai on a windows machine today, tried it for image classification and noticed that there is a big performance issue due to the use of torch dataloader. It’s either:

Set num_workers to 0, which doesn’t use the GPU optimally,
Set num_workers to 8, which uses GPU at maximum but add few minutes (~5) at the begining of each epoch for windows to set workers …

any workaround around this ?
Refer to this issue for the current state (torch side).