Developer chat

Since there was a lot of confusions, in DataBunch I’ve renamed the tfms argument to dl_tfms (people often used it for ds_tfms in computer vision).

1 Like

I have built a small version of Beam Search that seems promising. In the process, I looked carefully at the LanguageLearner.predict() method. I am not sure if this is a bug or I am misunderstanding how it works.

When you call predict(), you begin with an initial self.model.reset() that sets the hidden states to zero. Then you pass through the sample text and continue to append a new token each time to your list of generated tokens. However, your text is now the full set of tokens you have generated from the start, but you have not reset the state, so you are predicting from the end of the last prediction state.

What am I missing here?

I think we should add the parameter cut to unet_learner to be able to use custom model.
Current signature of the function is unet_learner(data:DataBunch,arch:Callable,pretrained:bool=True,blur_final:bool=True,norm_type:Optional[NormType]=’NormType’,split_on:Union[Callable, Collection[ModuleList], NoneType]=None,blur:bool=False,self_attention:bool=False,y_range:OptRange=None,last_cross:bool=True,bottle:bool=False,kwargs:Any)

I propose unet_learner(data:DataBunch, arch:Callable, pretrained:bool=True, blur_final:bool=True, norm_type:Optional[NormType]=NormType, split_on:Optional[SplitFuncOrIdxList]=None, blur:bool=False, self_attention:bool=False, y_range:Optional[Tuple[float,float]]=None, last_cross:bool=False, bottle:bool=False,cut:Union[int,Callable]=None, **kwargs:Any)->None: and pass the parameter to create_body as in create_cnn

I found this setup to debug PyTorch memory leaks on the Pyro forums: https://forum.pyro.ai/t/a-clever-trick-to-debug-tensor-memory/556
Maybe this is interesting for the library development. :slight_smile:

1 Like

That’s a nice version, @MicPie! Except it’s incomplete, it should be merged with this version: https://discuss.pytorch.org/t/how-to-debug-causes-of-gpu-memory-leaks/6741/24
We should put it somewhere in the docs for sure.

If you find other goodies please share!

1 Like

A post was merged into an existing topic: Fastai v1 install issues thread

Hi @sgugger cutout for Data Augmentation was implemented in previous fastai (before v1) but not in v1. Do you plan to add it in vision.transform or this is not a relevant technique and will not be implemented ? Thanks.

We just forgot. Will implement it when I have a bit of time next week, send me a PM if I forget!

1 Like

Thanks Sylvain !

@sgugger @pierreguillou I was procrastinating (instead of training LMs) and implemented it: https://github.com/fastai/fastai/pull/1489

Figured Sylvain’s busy with text stuff and I’ll help out a bit.

@pierreguillou, you can test if this works for you by using my fork, or wait till it’s merged (or till Sylvain implements it himself if my code sucks). Ping me if you’ll decide to try it out now and if you’ll have any question about it!

2 Likes

Thanks @xnutsive and @sgugger :slight_smile: (Just the letter t is missing at the end of the following phrase in fastai/docs_src/vision.transform.ipynb: “The normalization technique described in this paper: Improved Regularization of Convolutional Neural Networks with Cutou”).

Hello, this message concerns 2 issues with show_batch().

Note: I link it to my previous messages about plot_top_losses() as they have in common one issue about the DatasetType used by these 2 functions.

1) DatasetType: only train ?

The function show_batch() asks as argument ds_type (ie, the DatasetType) and has DatasetType.Train as default. Right ? But in its code (see below), self.train_ds is hard coded. Does it mean we can’t use show_batch() to display a validation batch (we would need self.valid_ds in this case, not ?)?

def show_batch(self, rows:int=5, ds_type:DatasetType=DatasetType.Train, **kwargs)->None:
    "Show a batch of data in `ds_type` on a few `rows`."
    x,y = self.one_batch(ds_type, True, True)
    if self.train_ds.x._square_show: rows = rows ** 2
    xs = [self.train_ds.x.reconstruct(grab_idx(x, i)) for i in range(rows)]
    #TODO: get rid of has_arg if possible
    if has_arg(self.train_ds.y.reconstruct, 'x'):
        ys = [self.train_ds.y.reconstruct(grab_idx(y, i), x=x) for i,x in enumerate(xs)]
    else : ys = [self.train_ds.y.reconstruct(grab_idx(y, i)) for i in range(rows)]
    self.train_ds.x.show_xys(xs, ys, **kwargs)

2) When batch size is one (bs=1), show_batch() does not work.

When batch size is one (bs=1), data.show_batch() (data is a ImageDataBunch) gives the following error, which is normal as the function tries to display by default 5x5=25 images from a train batch:
IndexError: index 1 is out of bounds for dimension 0 with size 1

However, data.show_batch(rows=1) that should display 1 image gives as well an error:
TypeError: 'AxesSubplot' object is not iterable

And, even if the batch size is > 1, data.show_batch(rows=1) gives the same error.

Then, the minimum to make show_batch() worked is bs=4 and data.show_batch(rows=2).
How to solve this issue and make show_batch() worked even for bs=1 ?

Thanks.

Hm, I can look into that tomorrow. I saw that show_batch doesn’t work for small batches (if you do show_batch(1) it’ll work, it just tries to show rows*cols elements, and a batch of 1 doesn’t have enough elements, default rows is 5.

That’s an easy fix, and I could look into valid_ds issue too.

I agree with you. Thanks if you can fix it.

I think you wanted to write “it will not work”.

Great :slight_smile: Thank you.

1 Like

Yeah, got ahead of myself and sent the reply and then re-read the original message. Not my best Monday /shrug.

Thank for the detailed investigation. I’ll work on it tomorrow and get back to you guys with a PR hopefully.

1 Like

train_ds is only hard-coded when we are looking at the class of either the inputs or the labels, to call things like reconstruct or show_xys. Those are the same for all the datasets in your DataBunch. The data is actually accessed in the first line, when we call one_batch (and there we pass ds_type).

2 Likes

Thanks Sylvain. You are right.
I just tried data.show_batch(row=5, ds_type=DatasetType.Valid) and it worked in the pets notebook (bs=64) :slight_smile:

1 Like

Breaking change: to have the same name in ImageDataBunch and TextDataBunch, as well as avoid the confusion where some people thought it was a csv separator, sep is now label_delim in the data block API and ImageDataBunch factory methods.

Docs have been updated accordingly.

2 Likes

Last few days I’ve been tracking the cause of unrecoverable Out of Memory and a mem leakage on manual interrupt of the nb run. I first found the solution to the problem, which I have been polishing for quite a while only to discard it after digging deeper and finding the cause, and then fixing the cause.

So when you get CUDA OOM and you can’t recover from it w/o restart, or when you get memory leaked when you hit stop during training, the cause is ipython. It stores the traceback of the exception. The traceback ties up the locals() and they don’t get released until… another exception occurs, at which point it frees up the old tb, which allows gc.collect() to do its work. Ouch. It was quite a journey to figure it out and I have learned a lot about python on the way.

I submitted a fix here https://github.com/ipython/ipython/pull/11572 - it seems some tests that compare the exact tb no longer match, but I trust they will figure it out. Imagine that! a one line fix and now you can OOM as much you’d like and continue running your notebook! Amazing!

If you want to understand more about the problem, I explained the nuances of the problem of saving a traceback or an exception object here.

Until ipython sorts it out, if you need a solution today, you can either do a hotfix for your installed version of ipython so you can enjoy the change now,:

curl https://github.com/ipython/ipython/commit/657cde76ad07ec5b69470758d9bb6adbae88a1da.patch > /tmp/tb-leak-fix.patch
cd $CONDA_PREFIX/lib/python3.7/site-packages/
patch -p1 < /tmp/tb-leak-fix.patch

adjust the path of course, this for python 3.7

Alternatively, here is some magic code for you:

import functools
import traceback
def get_ref_free_exc_info():
    "Free traceback from references to locals/globals to avoid circular reference leading to gc.collect() unable to reclaim memory"
    type, val, tb = sys.exc_info()
    traceback.clear_frames(tb)
    return (type, val, tb)

def gpu_mem_restore(func):
    "Reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except:
            type, val, tb = get_ref_free_exc_info() # must!
            raise type(val).with_traceback(tb) from None
    return wrapper

Now add before any of your functions:

@gpu_mem_restore
def fit(...

and OOM is now recoverable! And interrupts leak no memory!

Regardless of ipython’s fix this is now part of fastai, so you should be able to see the impact by just using the latest git. At the moment only functions that call fit() are positively affected.

Here is a notebook that demonstrates the OOM w/o the leak and that recovers almost 100% of memory w/o restart, using the current fastai git: https://github.com/fastai/fastai_docs/blob/master/dev_nb/mem_leaks/OOM_on_fit_recover.ipynb

And if you want to protect just a few lines of code, here is a context manager that does the same:

class gpu_mem_restore_ctx():
    " context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    def __enter__(self): return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        if not exc_val: return True
        traceback.clear_frames(exc_tb)
        raise exc_type(exc_val).with_traceback(exc_tb) from None

So now you can do:

with gpu_mem_restore_ctx():
    learn.fit_one_cycle(1,1e-2)

with the same results. Except this one (fit functions) is already protected, this would be more useful for your custom code.

Both functions are now in https://github.com/fastai/fastai/blob/master/fastai/utils/mem.py so you will just need to from fastai.utils.mem import * before you can use them.

BTW, another workaround is to throw another exception following the OOM exception:

# cell1 - if this leads to OOM leak
learn.fit_one_cycle(1,1e-2)
# cell 2 - this will release the memory, since it will reset %tb and free its locals()
assert False, "please liberate my GPU!"

If you want a more exact case where it only recovers from OOM, but the problem remains with any other exception it’d be:

def gpu_mem_restore(func):
    "Reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        oom_exc = False
        try:
            return func(*args, **kwargs)
        except RuntimeError as e:
            if "CUDA out of memory" in str(e):
                oom_exc = True
                type, val, tb = get_ref_free_exc_info() # must!
                raise type(val).with_traceback(tb) from None
            else: raise # re-raises the exact last exception
        except: raise # any other types of errors
        finally:
            if oom_exc:
                # reclaim memory
                gc.collect()
                if torch.cuda.is_available(): torch.cuda.empty_cache()
    return wrapper

(need to include the KeyboardInterrupt type in there too)

If you encounter any related issues you can discuss those here: A guide to recovering from CUDA Out of Memory and other exceptions

40 Likes

@pierreguillou, ugh, show_batch(1) had a weird bug in there, seems like a regression. I think I fixed it here: https://github.com/fastai/fastai/pull/1498

Also fixed trying to show_batch(10) on smaller batch sizes.

2 Likes