Developer chat

Thanks Sylvain! All working now :slight_smile: (after updating Spacy to v2.0.18, then fastai to v1.0.41).

Actually I am still getting OOM, I followed the steps above started a fresh kernel, set an enormous batch size and got OOM. Am I missing something, thanks!

In all the excitement it’s easy to miss the point of this discovery. Nobody can eliminate the OOM situation until someone comes out with a bottomless card.

So you will still have just as many OOM events as before. The difference is that now you can recover from it and not need to restart the notebook. Now you can just reduce the bs size (or other parameters) re-run the same cell.

I will be writing proper documentation shortly. I’m just finishing up some improvements to the code we use in the fastai.

Besides, if you’re using fastai, you don’t need to patch ipython. Just use the fastai git master version and the workaround is already there (for fit() functions at the moment).

2 Likes

Yeap, I intended using it for fit methods. I re-ran the cells but I will update fastai and try again. Thanks a lot!

1 Like

In my opinion, making code concise is not as important as making it readable.

OK, so here is an update on this story.

It’ll take ipython time to sort it out, because my patch can’t be applied as is since it’ll break %debug magic, so they will have to make it configurable - let’s see how and when it gets resolved. In particular we need a simple magic to reset %tb.

Meanwhile, fastai (git master) has been instrumented with the following features that will provide you a solution to this problem today:

  1. under non-ipython environment it doesn’t do anything special
  2. under ipython it strips tb by default only for the “CUDA out of memory” exception, i.e. %debug magic will work under all circumstances but this one, and it’ll leak memory in all of those until tb is reset
  3. The env var FASTAI_TB_CLEAR_FRAMES changes this behavior when run under ipython,
    depending on its value:
  • “0”: never strip tb (makes it possible to always use %debug magic, but with leaks)
  • “1”: always strip tb (never need to worry about leaks, but %debug won’t work)

where ipython == ipython/ipython-notebook/jupyter-notebook

At the moment we are only doing this for the fit() family of functions. If you find other fastai API needing this please let us know.

You can set os.environ['FASTAI_TB_CLEAR_FRAMES']="0" (or "1") in your code or from the shell when you start jupyter.

Let me know whether I have missed any special cases, so that we have that one sorted out before we release 1.0.42.

Of course, all the tricks posted in my original message still apply.

I will end this with an easy to remember tip, if everything else fails or perhaps you’re not using fastai and you can’t recover from OOM in your notebook, just run a cell with this content:

1/0

and you should be back in the game w/o needing to restart the kernel.

This whole subject matter is now documented here: https://docs.fast.ai/troubleshoot.html#memory-leakage-on-exception

If you encounter any related issues you can discuss those here: A guide to recovering from CUDA Out of Memory and other exceptions

6 Likes

(fastai version: 1.0.42.dev0) Problems with verify_images().

For example verify_images(path_to_images_folder, delete=True) does not delete non-images (for example with extensions php, axd…) and print no warning about the non images.

It should not be a default behavior for verify_images(delete=True), because, well, imagine you stored some .txt along your images and it just vanishes — the function name doesn’t make it obvious.

verify_images() only handles image files (by mime types), and ignores the rest.

I’ll make a pull request with another function to delete all non-images from folder that can be invoked by a flag in verify_image or separately.

About this issue you pointed to, I do not think verify_images() can delete a file which is not an image thanks to files = get_image_files(path) (code).

However, the point I focused on was that the verify_images() does not work well with its current code.

Wait, so you don’t want to delete non-images from your dataset dirs, you just want a warning if there are any non-images?

You can always do [f for f in path.rglob('*') if f.suffix not in images_suffixes]; print(f). It’s one line to see what’s in your data, just not 100% sure if fastai has to have this internally. :wink:

Sorry, you’re right that my last message was strange :-). I rephrase: I want to correct the verify_images() code because it does not work well I think. It means, I do want to delete non-image files from my dataset folders through verify_images() (like xxx.php, xxx.html, xxx.axd… or any images that can not be opened) as it is the objective of this function (and keeping all actual arguments as max_size, dest, etc.). If you think, we need to take care of not deleting for example xxx.txt files, I agree with you.

May be the easiest solution that will be the safest is not to delete anything at all, but to move all unfit files into a subdirectory? Then the user can make a final decision using the filesystem tools where they can review what they want to delete.

If you made a mistake and mistyped a directory you could end up causing a huge damage if the function is instrumented to delete any non-image files, regardless of how many exceptions you make (.txt files).

2 Likes

I agree, if fastai’s code even should move or delete unnecessary files, it should love them to a directory.

However I’m not sure fastai should provide a way to delete anything, extra files don’t hurt training or serving, right?

If I understand it correctly the idea was to delete broken images, so that these won’t cause problems at training time. But since now you’re talking about other files, it’s probably safer not to delete any files and move broken images to another folder with any other non-image files to ensure training is unaffected and no data is lost.

The only “destructive” thing I personally added to the verify_image code is to remove invalid EXIF headers, which again may not be the best thing. Perhaps the original file should be moved into the “unfit” sub-folder and a copy w/o the broken EXIF header left in the main folder with other good images.

2 Likes

I agree with you Stas and Nate: better to move non-images files, corrupted images files and images files that don’t open to a subfolder.

1 Like

This little tidbit might be of a practical interest to you:
https://docs.fast.ai/dev/gpu.html#peak-memory-usage
the pytorch forum thread it is linking to is full of excellent info.

Is anyone developing a LARS implementation for distributed learning in pytorch ? Frameworks such as Horovod use All reduce and LARS to scale out to a batch size of around 64000. I would like to start the LARS and All reduce implementation if the community thinks it is important.

Fellow coders, am sure someone gave this a thought before (think there was also a remark from Jeremy on that on a question), so thought it worthwhile picking some smart brains.

Could lr_find spit out a smart default recommendation for which lr to choose? while 3e-3 seems a good default choice, one could also try to inspect the graph and find a resonably long steep downward slop.

Not suggesting a complete automate here, but rather suggesting that learn.recorder.plot()(…,…) which you typically call after lr_find would print an advice, like: ‘here are 3 good tips for choosing lr’ instead of one reading just from the graph.

This?

1 Like

yes, thx.