Developer chat

sgugger · January 23, 2019, 4:52pm

We don’t have any workaround yet, no.

aayres · January 23, 2019, 6:08pm

I’m experiencing the same error.

I followed pytorch.org instructions to install from source, in order to use CUDA on MacOS (with eGPU hosted NVIDIA GPU). It seems CUDA is being used ok

In:
import torch
torch.cuda.set_device(0)
torch.cuda.is_available()
Out: True

but when I try to run the notebook cell;

data.show_batch(rows=3, figsize=(7,6))

in lesson1-pets.ipynb notebook I get the following Runtime error;

fastai/vision/transform.py", line 194, in _find_coeffs
return torch.gesv(B,A)[0][:,0]
RuntimeError: B should have at least 2 dimensions, but has 1 dimensions instead

the exception was raised here;

/torch/utils/data/dataloader.py(541)_process_next_batch()
539 self._put_indices()
540 if isinstance(batch, _utils.ExceptionWrapper):
→ 541 raise batch.exc_type(batch.exc_msg)
542 return batch

I’m running TORCH_VERSION 1.1.0. You mentioned the bug was fixed in the latest version of PyTorch. When I check for the latest PyTorch release at Releases · pytorch/pytorch · GitHub, I see it’s listed as v1.0.0 released on 7th Dec 2018, so I’m not sure how I ended up with v1.1.0 by cloning with;

git clone --recursive GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration

When I begin install, I see

Building wheel torch-1.1.0a0+04b8a2f

Which seems to correspond with the 1.1.0a0 version number in pytorch/setup.py at main · pytorch/pytorch · GitHub - merged two days ago.

If you have any advice about how I could get this error resolved, I’d appreciate it.

sgugger · January 23, 2019, 7:42pm

I meant it has been fixed in the master of fastai. So with v1.0.41 you shouldn’t have that bug.

aayres · January 23, 2019, 9:21pm

Thanks Sylvain! All working now (after updating Spacy to v2.0.18, then fastai to v1.0.41).

kcturgutlu · January 23, 2019, 11:45pm

Actually I am still getting OOM, I followed the steps above started a fresh kernel, set an enormous batch size and got OOM. Am I missing something, thanks!

stas · January 24, 2019, 12:09am

In all the excitement it’s easy to miss the point of this discovery. Nobody can eliminate the OOM situation until someone comes out with a bottomless card.

So you will still have just as many OOM events as before. The difference is that now you can recover from it and not need to restart the notebook. Now you can just reduce the bs size (or other parameters) re-run the same cell.

I will be writing proper documentation shortly. I’m just finishing up some improvements to the code we use in the fastai.

Besides, if you’re using fastai, you don’t need to patch ipython. Just use the fastai git master version and the workaround is already there (for fit() functions at the moment).

kcturgutlu · January 24, 2019, 12:12am

Yeap, I intended using it for fit methods. I re-ran the cells but I will update fastai and try again. Thanks a lot!

jcatanza · January 24, 2019, 1:54am

In my opinion, making code concise is not as important as making it readable.

stas · January 24, 2019, 3:19am

OK, so here is an update on this story.

It’ll take ipython time to sort it out, because my patch can’t be applied as is since it’ll break %debug magic, so they will have to make it configurable - let’s see how and when it gets resolved. In particular we need a simple magic to reset %tb.

Meanwhile, fastai (git master) has been instrumented with the following features that will provide you a solution to this problem today:

under non-ipython environment it doesn’t do anything special
under ipython it strips tb by default only for the “CUDA out of memory” exception, i.e. %debug magic will work under all circumstances but this one, and it’ll leak memory in all of those until tb is reset
The env var FASTAI_TB_CLEAR_FRAMES changes this behavior when run under ipython,
depending on its value:

“0”: never strip tb (makes it possible to always use %debug magic, but with leaks)
“1”: always strip tb (never need to worry about leaks, but %debug won’t work)

where ipython == ipython/ipython-notebook/jupyter-notebook

At the moment we are only doing this for the fit() family of functions. If you find other fastai API needing this please let us know.

You can set os.environ['FASTAI_TB_CLEAR_FRAMES']="0" (or "1") in your code or from the shell when you start jupyter.

Let me know whether I have missed any special cases, so that we have that one sorted out before we release 1.0.42.

Of course, all the tricks posted in my original message still apply.

I will end this with an easy to remember tip, if everything else fails or perhaps you’re not using fastai and you can’t recover from OOM in your notebook, just run a cell with this content:

1/0

and you should be back in the game w/o needing to restart the kernel.

This whole subject matter is now documented here: https://docs.fast.ai/troubleshoot.html#memory-leakage-on-exception

If you encounter any related issues you can discuss those here: A guide to recovering from CUDA Out of Memory and other exceptions

pierreguillou · January 24, 2019, 2:41pm

(fastai version: 1.0.42.dev0) Problems with verify_images().

For example verify_images(path_to_images_folder, delete=True) does not delete non-images (for example with extensions php, axd…) and print no warning about the non images.

xnutsive · January 24, 2019, 3:22pm

It should not be a default behavior for verify_images(delete=True), because, well, imagine you stored some .txt along your images and it just vanishes — the function name doesn’t make it obvious.

verify_images() only handles image files (by mime types), and ignores the rest.

I’ll make a pull request with another function to delete all non-images from folder that can be invoked by a flag in verify_image or separately.

pierreguillou · January 24, 2019, 4:07pm

About this issue you pointed to, I do not think verify_images() can delete a file which is not an image thanks to files = get_image_files(path) (code).

However, the point I focused on was that the verify_images() does not work well with its current code.

xnutsive · January 24, 2019, 6:29pm

Wait, so you don’t want to delete non-images from your dataset dirs, you just want a warning if there are any non-images?

You can always do [f for f in path.rglob('*') if f.suffix not in images_suffixes]; print(f). It’s one line to see what’s in your data, just not 100% sure if fastai has to have this internally.

pierreguillou · January 24, 2019, 6:48pm

Sorry, you’re right that my last message was strange :-). I rephrase: I want to correct the verify_images() code because it does not work well I think. It means, I do want to delete non-image files from my dataset folders through verify_images() (like xxx.php, xxx.html, xxx.axd… or any images that can not be opened) as it is the objective of this function (and keeping all actual arguments as max_size, dest, etc.). If you think, we need to take care of not deleting for example xxx.txt files, I agree with you.

stas · January 25, 2019, 3:57am

May be the easiest solution that will be the safest is not to delete anything at all, but to move all unfit files into a subdirectory? Then the user can make a final decision using the filesystem tools where they can review what they want to delete.

If you made a mistake and mistyped a directory you could end up causing a huge damage if the function is instrumented to delete any non-image files, regardless of how many exceptions you make (.txt files).

xnutsive · January 25, 2019, 5:17am

I agree, if fastai’s code even should move or delete unnecessary files, it should love them to a directory.

However I’m not sure fastai should provide a way to delete anything, extra files don’t hurt training or serving, right?

stas · January 25, 2019, 5:39am

If I understand it correctly the idea was to delete broken images, so that these won’t cause problems at training time. But since now you’re talking about other files, it’s probably safer not to delete any files and move broken images to another folder with any other non-image files to ensure training is unaffected and no data is lost.

The only “destructive” thing I personally added to the verify_image code is to remove invalid EXIF headers, which again may not be the best thing. Perhaps the original file should be moved into the “unfit” sub-folder and a copy w/o the broken EXIF header left in the main folder with other good images.

pierreguillou · January 25, 2019, 11:08am

I agree with you Stas and Nate: better to move non-images files, corrupted images files and images files that don’t open to a subfolder.

stas · January 26, 2019, 4:34am

This little tidbit might be of a practical interest to you:
https://docs.fast.ai/dev/gpu.html#peak-memory-usage
the pytorch forum thread it is linking to is full of excellent info.

karanchahal · January 26, 2019, 6:47am

Is anyone developing a LARS implementation for distributed learning in pytorch ? Frameworks such as Horovod use All reduce and LARS to scale out to a batch size of around 64000. I would like to start the LARS and All reduce implementation if the community thinks it is important.