Since there was a lot of confusions, in DataBunch
I’ve renamed the tfms
argument to dl_tfms
(people often used it for ds_tfms
in computer vision).
I have built a small version of Beam Search that seems promising. In the process, I looked carefully at the LanguageLearner.predict()
method. I am not sure if this is a bug or I am misunderstanding how it works.
When you call predict()
, you begin with an initial self.model.reset()
that sets the hidden states to zero. Then you pass through the sample text
and continue to append a new token each time to your list of generated tokens. However, your text
is now the full set of tokens you have generated from the start, but you have not reset the state, so you are predicting from the end of the last prediction state.
What am I missing here?
I think we should add the parameter cut
to unet_learner
to be able to use custom model.
Current signature of the function is unet_learner(
data:DataBunch,
arch:Callable,
pretrained:bool=
True,
blur_final:bool=
True,
norm_type:Optional[NormType]=
’NormType’,
split_on:Union[Callable, Collection[ModuleList], NoneType]=
None,
blur:bool=
False,
self_attention:bool=
False,
y_range:OptRange=
None,
last_cross:bool=
True,
bottle:bool=
False,
kwargs:Any)
I propose unet_learner(data:DataBunch, arch:Callable, pretrained:bool=True, blur_final:bool=True, norm_type:Optional[NormType]=NormType, split_on:Optional[SplitFuncOrIdxList]=None, blur:bool=False, self_attention:bool=False, y_range:Optional[Tuple[float,float]]=None, last_cross:bool=False, bottle:bool=False,cut:Union[int,Callable]=None, **kwargs:Any)->None:
and pass the parameter to create_body
as in create_cnn
I found this setup to debug PyTorch memory leaks on the Pyro forums: https://forum.pyro.ai/t/a-clever-trick-to-debug-tensor-memory/556
Maybe this is interesting for the library development.
That’s a nice version, @MicPie! Except it’s incomplete, it should be merged with this version: https://discuss.pytorch.org/t/how-to-debug-causes-of-gpu-memory-leaks/6741/24
We should put it somewhere in the docs for sure.
If you find other goodies please share!
Hi @sgugger cutout
for Data Augmentation was implemented in previous fastai (before v1) but not in v1. Do you plan to add it in vision.transform or this is not a relevant technique and will not be implemented ? Thanks.
We just forgot. Will implement it when I have a bit of time next week, send me a PM if I forget!
Thanks Sylvain !
@sgugger @pierreguillou I was procrastinating (instead of training LMs) and implemented it: https://github.com/fastai/fastai/pull/1489
Figured Sylvain’s busy with text stuff and I’ll help out a bit.
@pierreguillou, you can test if this works for you by using my fork, or wait till it’s merged (or till Sylvain implements it himself if my code sucks). Ping me if you’ll decide to try it out now and if you’ll have any question about it!
Thanks @xnutsive and @sgugger (Just the letter t
is missing at the end of the following phrase in fastai/docs_src/vision.transform.ipynb: “The normalization technique described in this paper: Improved Regularization of Convolutional Neural Networks with Cutou”).
Hello, this message concerns 2 issues with show_batch().
Note: I link it to my previous messages about plot_top_losses() as they have in common one issue about the DatasetType used by these 2 functions.
1) DatasetType: only train ?
The function show_batch()
asks as argument ds_type
(ie, the DatasetType) and has DatasetType.Train
as default. Right ? But in its code (see below), self.train_ds
is hard coded. Does it mean we can’t use show_batch()
to display a validation batch (we would need self.valid_ds
in this case, not ?)?
def show_batch(self, rows:int=5, ds_type:DatasetType=DatasetType.Train, **kwargs)->None:
"Show a batch of data in `ds_type` on a few `rows`."
x,y = self.one_batch(ds_type, True, True)
if self.train_ds.x._square_show: rows = rows ** 2
xs = [self.train_ds.x.reconstruct(grab_idx(x, i)) for i in range(rows)]
#TODO: get rid of has_arg if possible
if has_arg(self.train_ds.y.reconstruct, 'x'):
ys = [self.train_ds.y.reconstruct(grab_idx(y, i), x=x) for i,x in enumerate(xs)]
else : ys = [self.train_ds.y.reconstruct(grab_idx(y, i)) for i in range(rows)]
self.train_ds.x.show_xys(xs, ys, **kwargs)
2) When batch size is one (bs=1), show_batch() does not work.
When batch size is one (bs=1), data.show_batch()
(data is a ImageDataBunch) gives the following error, which is normal as the function tries to display by default 5x5=25 images from a train batch:
IndexError: index 1 is out of bounds for dimension 0 with size 1
However, data.show_batch(rows=1)
that should display 1 image gives as well an error:
TypeError: 'AxesSubplot' object is not iterable
And, even if the batch size is > 1, data.show_batch(rows=1)
gives the same error.
Then, the minimum to make show_batch()
worked is bs=4
and data.show_batch(rows=2)
.
How to solve this issue and make show_batch()
worked even for bs=1
?
Thanks.
Hm, I can look into that tomorrow. I saw that show_batch doesn’t work for small batches (if you do show_batch(1) it’ll work, it just tries to show rows*cols elements, and a batch of 1 doesn’t have enough elements, default rows is 5.
That’s an easy fix, and I could look into valid_ds issue too.
I agree with you. Thanks if you can fix it.
I think you wanted to write “it will not work”.
Great Thank you.
Yeah, got ahead of myself and sent the reply and then re-read the original message. Not my best Monday /shrug.
Thank for the detailed investigation. I’ll work on it tomorrow and get back to you guys with a PR hopefully.
train_ds
is only hard-coded when we are looking at the class of either the inputs or the labels, to call things like reconstruct
or show_xys
. Those are the same for all the datasets in your DataBunch
. The data is actually accessed in the first line, when we call one_batch
(and there we pass ds_type
).
Thanks Sylvain. You are right.
I just tried data.show_batch(row=5, ds_type=DatasetType.Valid)
and it worked in the pets notebook (bs=64)
Breaking change: to have the same name in ImageDataBunch
and TextDataBunch
, as well as avoid the confusion where some people thought it was a csv separator, sep
is now label_delim
in the data block API and ImageDataBunch
factory methods.
Docs have been updated accordingly.
Last few days I’ve been tracking the cause of unrecoverable Out of Memory and a mem leakage on manual interrupt of the nb run. I first found the solution to the problem, which I have been polishing for quite a while only to discard it after digging deeper and finding the cause, and then fixing the cause.
So when you get CUDA OOM and you can’t recover from it w/o restart, or when you get memory leaked when you hit stop during training, the cause is ipython. It stores the traceback of the exception. The traceback ties up the locals()
and they don’t get released until… another exception occurs, at which point it frees up the old tb, which allows gc.collect() to do its work. Ouch. It was quite a journey to figure it out and I have learned a lot about python on the way.
I submitted a fix here https://github.com/ipython/ipython/pull/11572 - it seems some tests that compare the exact tb no longer match, but I trust they will figure it out. Imagine that! a one line fix and now you can OOM as much you’d like and continue running your notebook! Amazing!
If you want to understand more about the problem, I explained the nuances of the problem of saving a traceback or an exception object here.
Until ipython sorts it out, if you need a solution today, you can either do a hotfix for your installed version of ipython so you can enjoy the change now,:
curl https://github.com/ipython/ipython/commit/657cde76ad07ec5b69470758d9bb6adbae88a1da.patch > /tmp/tb-leak-fix.patch
cd $CONDA_PREFIX/lib/python3.7/site-packages/
patch -p1 < /tmp/tb-leak-fix.patch
adjust the path of course, this for python 3.7
Alternatively, here is some magic code for you:
import functools
import traceback
def get_ref_free_exc_info():
"Free traceback from references to locals/globals to avoid circular reference leading to gc.collect() unable to reclaim memory"
type, val, tb = sys.exc_info()
traceback.clear_frames(tb)
return (type, val, tb)
def gpu_mem_restore(func):
"Reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
@functools.wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except:
type, val, tb = get_ref_free_exc_info() # must!
raise type(val).with_traceback(tb) from None
return wrapper
Now add before any of your functions:
@gpu_mem_restore
def fit(...
and OOM is now recoverable! And interrupts leak no memory!
Regardless of ipython’s fix this is now part of fastai, so you should be able to see the impact by just using the latest git. At the moment only functions that call fit() are positively affected.
Here is a notebook that demonstrates the OOM w/o the leak and that recovers almost 100% of memory w/o restart, using the current fastai git: https://github.com/fastai/fastai_docs/blob/master/dev_nb/mem_leaks/OOM_on_fit_recover.ipynb
And if you want to protect just a few lines of code, here is a context manager that does the same:
class gpu_mem_restore_ctx():
" context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
def __enter__(self): return self
def __exit__(self, exc_type, exc_val, exc_tb):
if not exc_val: return True
traceback.clear_frames(exc_tb)
raise exc_type(exc_val).with_traceback(exc_tb) from None
So now you can do:
with gpu_mem_restore_ctx():
learn.fit_one_cycle(1,1e-2)
with the same results. Except this one (fit functions) is already protected, this would be more useful for your custom code.
Both functions are now in https://github.com/fastai/fastai/blob/master/fastai/utils/mem.py so you will just need to from fastai.utils.mem import *
before you can use them.
BTW, another workaround is to throw another exception following the OOM exception:
# cell1 - if this leads to OOM leak
learn.fit_one_cycle(1,1e-2)
# cell 2 - this will release the memory, since it will reset %tb and free its locals()
assert False, "please liberate my GPU!"
If you want a more exact case where it only recovers from OOM, but the problem remains with any other exception it’d be:
def gpu_mem_restore(func):
"Reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
@functools.wraps(func)
def wrapper(*args, **kwargs):
oom_exc = False
try:
return func(*args, **kwargs)
except RuntimeError as e:
if "CUDA out of memory" in str(e):
oom_exc = True
type, val, tb = get_ref_free_exc_info() # must!
raise type(val).with_traceback(tb) from None
else: raise # re-raises the exact last exception
except: raise # any other types of errors
finally:
if oom_exc:
# reclaim memory
gc.collect()
if torch.cuda.is_available(): torch.cuda.empty_cache()
return wrapper
(need to include the KeyboardInterrupt type in there too)
If you encounter any related issues you can discuss those here: A guide to recovering from CUDA Out of Memory and other exceptions
@pierreguillou, ugh, show_batch(1)
had a weird bug in there, seems like a regression. I think I fixed it here: https://github.com/fastai/fastai/pull/1498
Also fixed trying to show_batch(10)
on smaller batch sizes.