We don’t have any workaround yet, no.
I’m experiencing the same error.
I followed pytorch.org instructions to install from source, in order to use CUDA on MacOS (with eGPU hosted NVIDIA GPU). It seems CUDA is being used ok
but when I try to run the notebook cell;
lesson1-pets.ipynb notebook I get the following Runtime error;
fastai/vision/transform.py", line 194, in _find_coeffs
RuntimeError: B should have at least 2 dimensions, but has 1 dimensions instead
the exception was raised here;
540 if isinstance(batch, _utils.ExceptionWrapper):
–> 541 raise batch.exc_type(batch.exc_msg)
542 return batch
I’m running TORCH_VERSION 1.1.0. You mentioned the bug was fixed in the latest version of PyTorch. When I check for the latest PyTorch release at https://github.com/pytorch/pytorch/releases, I see it’s listed as v1.0.0 released on 7th Dec 2018, so I’m not sure how I ended up with v1.1.0 by cloning with;
git clone --recursive https://github.com/pytorch/pytorch
When I begin install, I see
Building wheel torch-1.1.0a0+04b8a2f
Which seems to correspond with the 1.1.0a0 version number in https://github.com/pytorch/pytorch/blob/master/setup.py - merged two days ago.
If you have any advice about how I could get this error resolved, I’d appreciate it.
I meant it has been fixed in the master of fastai. So with v1.0.41 you shouldn’t have that bug.
Lesson 1 throwing error in ImageDataBunch stage in windows
Thanks Sylvain! All working now (after updating Spacy to v2.0.18, then fastai to v1.0.41).
Actually I am still getting OOM, I followed the steps above started a fresh kernel, set an enormous batch size and got OOM. Am I missing something, thanks!
In all the excitement it’s easy to miss the point of this discovery. Nobody can eliminate the OOM situation until someone comes out with a bottomless card.
So you will still have just as many OOM events as before. The difference is that now you can recover from it and not need to restart the notebook. Now you can just reduce the bs size (or other parameters) re-run the same cell.
I will be writing proper documentation shortly. I’m just finishing up some improvements to the code we use in the fastai.
Besides, if you’re using fastai, you don’t need to patch ipython. Just use the fastai git master version and the workaround is already there (for fit() functions at the moment).
Yeap, I intended using it for fit methods. I re-ran the cells but I will update fastai and try again. Thanks a lot!
In my opinion, making code concise is not as important as making it readable.
OK, so here is an update on this story.
It’ll take ipython time to sort it out, because my patch can’t be applied as is since it’ll break %debug magic, so they will have to make it configurable - let’s see how and when it gets resolved. In particular we need a simple magic to reset %tb.
Meanwhile, fastai (git master) has been instrumented with the following features that will provide you a solution to this problem today:
- under non-ipython environment it doesn’t do anything special
- under ipython it strips tb by default only for the “CUDA out of memory” exception, i.e.
%debugmagic will work under all circumstances but this one, and it’ll leak memory in all of those until tb is reset
- The env var
FASTAI_TB_CLEAR_FRAMESchanges this behavior when run under ipython,
depending on its value:
- “0”: never strip tb (makes it possible to always use %debug magic, but with leaks)
- “1”: always strip tb (never need to worry about leaks, but %debug won’t work)
where ipython == ipython/ipython-notebook/jupyter-notebook
At the moment we are only doing this for the fit() family of functions. If you find other fastai API needing this please let us know.
You can set
"1") in your code or from the shell when you start jupyter.
Let me know whether I have missed any special cases, so that we have that one sorted out before we release 1.0.42.
Of course, all the tricks posted in my original message still apply.
I will end this with an easy to remember tip, if everything else fails or perhaps you’re not using fastai and you can’t recover from OOM in your notebook, just run a cell with this content:
and you should be back in the game w/o needing to restart the kernel.
This whole subject matter is now documented here: https://docs.fast.ai/troubleshoot.html#memory-leakage-on-exception
If you encounter any related issues you can discuss those here: A guide to recovering from CUDA Out of Memory and other exceptions
(fastai version: 1.0.42.dev0) Problems with verify_images().
verify_images(path_to_images_folder, delete=True) does not delete non-images (for example with extensions php, axd…) and print no warning about the non images.
It should not be a default behavior for
verify_images(delete=True), because, well, imagine you stored some .txt along your images and it just vanishes — the function name doesn’t make it obvious.
verify_images() only handles image files (by mime types), and ignores the rest.
I’ll make a pull request with another function to delete all non-images from folder that can be invoked by a flag in
verify_image or separately.
About this issue you pointed to, I do not think
verify_images() can delete a file which is not an image thanks to
files = get_image_files(path) (code).
However, the point I focused on was that the
verify_images() does not work well with its current code.
Wait, so you don’t want to delete non-images from your dataset dirs, you just want a warning if there are any non-images?
You can always do
[f for f in path.rglob('*') if f.suffix not in images_suffixes]; print(f). It’s one line to see what’s in your data, just not 100% sure if fastai has to have this internally.
Sorry, you’re right that my last message was strange :-). I rephrase: I want to correct the
verify_images() code because it does not work well I think. It means, I do want to delete non-image files from my dataset folders through
verify_images() (like xxx.php, xxx.html, xxx.axd… or any images that can not be opened) as it is the objective of this function (and keeping all actual arguments as max_size, dest, etc.). If you think, we need to take care of not deleting for example xxx.txt files, I agree with you.
May be the easiest solution that will be the safest is not to delete anything at all, but to move all unfit files into a subdirectory? Then the user can make a final decision using the filesystem tools where they can review what they want to delete.
If you made a mistake and mistyped a directory you could end up causing a huge damage if the function is instrumented to delete any non-image files, regardless of how many exceptions you make (.txt files).
I agree, if fastai’s code even should move or delete unnecessary files, it should love them to a directory.
However I’m not sure fastai should provide a way to delete anything, extra files don’t hurt training or serving, right?
If I understand it correctly the idea was to delete broken images, so that these won’t cause problems at training time. But since now you’re talking about other files, it’s probably safer not to delete any files and move broken images to another folder with any other non-image files to ensure training is unaffected and no data is lost.
The only “destructive” thing I personally added to the
verify_image code is to remove invalid EXIF headers, which again may not be the best thing. Perhaps the original file should be moved into the “unfit” sub-folder and a copy w/o the broken EXIF header left in the main folder with other good images.
I agree with you Stas and Nate: better to move non-images files, corrupted images files and images files that don’t open to a subfolder.
This little tidbit might be of a practical interest to you:
the pytorch forum thread it is linking to is full of excellent info.
Is anyone developing a LARS implementation for distributed learning in pytorch ? Frameworks such as Horovod use All reduce and LARS to scale out to a batch size of around 64000. I would like to start the LARS and All reduce implementation if the community thinks it is important.