Developer chat

I think it would be a useful addition. If you make a PR, please note that the doc string should just take one line that explains what your function does (with arguments between if they are mentioned). Then edit the doc notebook tabular.transform (since I think this function should go there) and document your new function with more length (no need to list the parameters like you do) then you can show actual examples.

Done! I hope it is good enough for the library.

While Im trying to run the lesson 10 , Im facing issues,
AttributeError: ‘numpy.ndarray’ object has no attribute ‘x’ at this line ,
trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)

And my environment is,

=== Software === 
python       : 3.6.6
fastai       : 1.0.38
fastprogress : 0.1.18
torch        : 1.0.0
torch cuda   : 9.0.176 / is **Not available** 

=== Hardware === 
No GPUs available 

=== Environment === 
platform     : Linux-4.4.0-1065-aws-x86_64-with-debian-9.5
distro       : Debian GNU/Linux 9 stretch
conda env    : Unknown
python       : /usr/local/bin/python
sys.path     : 
/usr/local/lib/python36.zip
/usr/local/lib/python3.6
/usr/local/lib/python3.6/lib-dynload
/usr/local/lib/python3.6/site-packages
/usr/src/app
/usr/local/lib/python3.6/site-packages/IPython/extensions
no supported gpus found on this system

Thanks in advance

Hello:

After following the install instruction on my laptop, I got the “AttributeError: module ‘typing’ has no attribute ‘_ClassVar’” error
during the “from fastai.vision import *”.

I searched the forum, but couldn’t find any posting relating to this error. Any advice would be greatly appreciated. [include image]

Why do we have get_preds's default ds Valid? Isn’t the main use for get_preds is with the test set?

So currently we need to use:

predictions = learn.get_preds(ds_type=DatasetType.Test)

to do that. Yuck. At least perhaps having a thin wrapper?

predictions = learn.get_preds_for_test()

And then there is an issue with docs which currently don’t show any default value. It looks like a bug in show_doc, all the methods below it have the same issue, showing: ds_type = ``
https://docs.fast.ai/basic_train.html#Learner.get_preds

1 Like

Hi guys,
Based on Jeremy lesson 6 pet nb I wrote a class to simplify the process of plotting Gradcam (optionally with guided backprop based on the Gradcam paper). I think it would be a nice complement to the ClassificationInterpretation and to deep learning model’s interpretation in general
The post is originally here https://forums.fast.ai/t/gradcam-and-guided-backprop-intergration-in-fastai-library/33462 and I am not sure how to move it to fastai dev topic…
Anyway I hope this is helpful and if there’s a way to add this to fastai let me know. I’d love to contribute this to the code base.

You just saved me hours of debugging :heart:, as we are overwriting the language model loader to create batches for bi-directional training, and everything was working except that the results were random.
Thank you! :slight_smile:

I’m doing distributed training on 4 machines with 8 GPUs each. Validation part of .fit() loop takes more time than anything else combined. I believe that is because there is no distributed inference. Is anybody working on that ATM?

Hi all, I have started working with some medical imaging. I have made some custom ItemLists for this 3d volumes and segmentation volumes, and I would like to share the work, but I’m wondering what is the common format for this kind of elements. Right now what I did is to build some bcolz carrays to store the datasets and my ItemLists are working in top of that (I have some limitations like num_workers=0). What do you recommend?

We make a github repro with the code/notebook and publish a link here: https://forums.fast.ai/t/share-your-work-here/27676

I’m not sure I understand the problem. Validation is done on the full dataset for each GPU, just so the stats printed are correct, but if this is taking that long, maybe reduce your validation set? There is no need to have a huge one.

I’ve used 80/20 training/validation split, so it wasn’t that huge. It takes a lot of time because fitting is fast in distributed environment. I’ve reduced the validation set and now it is doing ok. Distributed inference is not a real problem but a nice thing to have, Pytorch dev will implement it some day hopefully https://github.com/pytorch/examples/issues/461

Oh, no. I’ve started to train the bigger images and there is a problem with validation phase. I’m receiving “CUDA out of memory” on the validation phase of the loop. I can’t fix that by reducing the size of the validation set. I’ve already set batch size to 2, doesn’t help at all. The forward pass works nice even with batch size = 16.

If you use our Learner.distributed it’ll disable distributed validation automatically, FYI.

FYI, tests/test_callbacks_csv_logger.py fails very often on CI - inconsistent behavior:
https://dev.azure.com/fastdotai/fastai/_build/results?buildId=2625

New stuff is being worked on - your input is sought out:

I started working on gpu mem utils. You can see the initial implementation here: https://github.com/fastai/fastai/blob/master/fastai/utils/mem.py and the test suite is here: https://github.com/fastai/fastai/blob/master/tests/test_utils_mem.py. At the moment all the docs are in the code, and will make proper docs once the API is stable.

Currently, the main need behind this API, is to be able to measure GPU RAM in tests to detect memory leaks (see the last test in the test module linked above). But, of course, many other uses are possible.

We no longer need nvidia-smi, and use a much much faster nvml API.

This is all new, so feedback is welcome. The idea is to make the api easy to use w/ and w/o gpu, so less code is needed on the user side and as few try/except as possible.

Thank you.

edit: Added a test utils module: https://github.com/fastai/fastai/blob/master/tests/utils/mem.py
and documented them here: https://docs.fast.ai/dev/test.html#testing-memory-leaks
first leakage test that actually measures GPU RAM leaks: https://github.com/fastai/fastai/blob/master/tests/test_vision_train.py#L87

3 Likes

I’m working on a notebook that will demonstrate where fastai needs to embed gc.collect() calls to minimize GPU RAM fragmentation and allow for running a tighter ship memory-wise.

For example, currently fastai causes fragmentation and temp bad memory usage with learn.load. it must not allocate new gpu ram until it freed the memory used by the already loaded model. So, learn.load needs to clear the old model first, gc.collect and only then load a new one, thus not causing fragmentation and temporary memory overhead, which gpu might not be able to accommodate. The notebook will show that problem visually, since currently learn.load consumes twice the size of the model memory size until the moment gc.collect() arrives down the road, which could be too far for a user to be able to continue using the GPU. added a test demonstrating the problem: https://github.com/fastai/fastai/blob/master/tests/test_vision_train.py#L87

Where would be a good place to have such a notebook, so that we could have an ongoing way to visually diagnose things. For identified temp leaks/fragmentation I intend to make these into hard tests of course (that’s why I need all that fastai.utils.mem api).

side note: currently fastai has lots of issues with circular references, which lead to temporary memory leakages. python 3.4+ untangles circular references via gc.collect(), including problematic __del__ which in the past were leading to leaked memory that couldn’t be reclaimed. Except in the case of fastai we can’t wait for gc.collect() to arrive at some point in the future, but must call those explicitly in strategic points. Of course, untangling circular references would be an even better approach, but I’m not sure that it’ll happen. Until then we need a practical solution. I don’t think the recent attempt at weakref implementation made any difference. gc.collect still reports clearing circular references.

RAM fragmentation is a big problem, since you can have a ton of free memory, but not be able to use it.

2 Likes

So if I’m reading this correctly, testing for gpu mem leaks should be one of the top priorities for the test suite? (Improving/Expanding Tests).

I’d like to help. On the tests front, I’ll play around and ping you on the dev project thread. If you have a specific scenario or part of work in mind that would be helpful in the next few days, let me know, I’ll focus on that.

I won’t say it’s the top priority, since the current fastai code base doesn’t have too many issues with that. It’s just not utilizing all the available memory at times, because it doesn’t manage it tightly (1) due to cyclic references (2) due to fragmentation, caused by gpu mem allocation made before freeing the no longer needed memory in some situations. Ideally, the code should be cyclic reference free, so that when any object is removed it should be instantly reclaimed and if gpu is involved, its memory freed. But it’s not the case.

Thank you for the offer, @xnutsive. My plan is to add a few tests for the core functionality (create-learner-train-save-load sequence and its parts), and develop useful utils to make it easy to write them quickly. And then we can start expanding it to other parts. I know a few people are actively working on trying to get the ‘text’ classes to utilize less memory. e.g. LanguageModelLoader.

Have a look at https://github.com/fastai/fastai/blob/master/tests/test_vision_train.py#L87 (test_model_load_mem_leak) for a basic model. It’s now trivial to write leak tests, you just measure used memory before and after and you need to understand how to measure the real used memory.

I think the dev_nb folder in fastai_docs is probably the best place to share development notebooks. Thanks for investigating this!

1 Like