Nvml.dll loading issue in nvidia-ml-py3-7.352.0-py_0

partho · March 2, 2019, 3:41pm

Hi,

Do you know where the source code for nvidia-ml-py3-7.352.0-py_0 is? I would like to send a PR for a NVML loading bug in that library.

Issue is on my machine nvml.dll is in $env:WinDir\system32 (installed by nvidia installers). The nvidia-ml-py3-7.352.0-py_0 should then ideally search in both locations.

Thanks,
Partho

partho · March 3, 2019, 12:43am

Admins (@stas @sgugger @jeremy @rachel) can you please help?

stas · March 3, 2019, 1:06am

please don’t use @ broadcast unless you know the person asked you for it, especially not for large groups of people. Thank you.

Are you having the same issue as posted here:

Does it work better if you remove nvidia-ml-py3 and install this version instead:
GitHub - fbcotter/py3nvml: Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program. - only pip version is available.

But otherwise the repo is GitHub - nicolargo/nvidia-ml-py3: Python 3 Bindings for the NVIDIA Management Library

partho · March 3, 2019, 1:15am

please don’t use @ broadcast unless you know the person asked you for it, especially not for large groups of people. Thank you.

My apologies. I wasn’t aware of this. Is there an etiquette wiki page I can refer to so I avoid such things in the future.

Are you having the same issue as posted here:

Possibly - cannot confirm as the issue there doesn’t mention the stacktrace.

Does it work better if you remove nvidia-ml-py3 and install this version instead:

I know the fix and I have tested it by modifying the site-packages/pynvml.py. I will send a PR to GitHub - nicolargo/nvidia-ml-py3: Python 3 Bindings for the NVIDIA Management Library.

However the package is from fastai channel. Once the PR is merged - what is the process to publish the package to the fastai channel?

I prefer letting conda install do its thing instead of manually install and uninstalling packages.

stas · March 3, 2019, 1:38am

The etiquette is to wait until someone replies. And only ping a specific person if you have a prior connection there. Imagine if every user here were to ping everybody they wish every time they have an issue. And there are thousands of users on this forum.

Also if you find a bug, like in this situation, Filing an Issue Issues · fastai/fastai · GitHub will give it the fastest attention (and if it’s not a bug you will be asked to go back to forums, so please don’t abuse that feature).

Also you will get the fastest response if you find the correct thread to post in. e.g. in this particular case the install thread I linked to, which I monitor closely. Of course, please don’t post there unless it’s install related.

Does it work better if you remove nvidia-ml-py3 and install this version instead:

I know the fix and I have tested it by modifying the site-packages/pynvml.py. I will send a PR to GitHub - nicolargo/nvidia-ml-py3: Python 3 Bindings for the NVIDIA Management Library.

Excellent. Thank you, and once accepted please ask the maintainer to make a new release

However the package is from fastai channel. Once the PR is merged - what is the process to publish the package to the fastai channel?

I will make a new build. Just let me know when it’s ready.

Also, since a few by now reported this problem, I split off the python workaround in its own module, so that a normal user won’t need to have a working nvml to use fastai. So, please install the git version of fastai and let me know whether this removed the issue.

And then if you’d like to contribute and it sounds like you have the right know how, please help me to sort it out pynvml on win10, so that you could use all the gpu profiling functions there (fastai.utils.mem). We have just sorted it out on OSX, so windows is the only area to sort out. And perhaps your patch is all that’s needed.

stas · March 3, 2019, 1:41am

Also if you get a chance to try this one on win10 that would be great, so that we’d know whether we have an alternative that works on windows. I only work on linux, so rely on others for feedback on other OSes.

partho · March 3, 2019, 1:16pm

OK so here is the status of the items you asked me to do

#1: Does it work better if you remove nvidia-ml-py3 and install this version instead:
GitHub - fbcotter/py3nvml: Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program. - only pip version is available.

I did conda uninstall nvidia-ml-py3 --force followed by pip install py3nvml.

I get the following errors

In any case that library has the same issue. So I sent the PR for the fix.

#2: I split off the python workaround in its own module, so that a normal user won’t need to have a working nvml to use fastai. So, please install the git version of fastai and let me know whether this removed the issue.

I did the above and an confirm the following

Just importing fastai.vision doesn’t give the “unable to load nvml.dll” error
Importing fastai.utils.mem does give the error
With my fix (monkey patched at site-packages), running gpu_mem_get_all() gives the same result as nvidia-smi.exe

In short with my patch, both your objectives are achieved on Windows 10.

#3: Once nvidia-ml-py3 PR accepted please ask the maintainer to make a new release. I will make a new build. Just let me know when it’s ready.

I sent the PR. Though the maintainer is active on github, nvidia-ml-py3 itself was last updated ~2 years back. So I am not sure when this will get it.

We can certainly wait a bit. However since fastai v1 has a hard dependency on nvidia-ml-py3 and we are releasing nvidia-ml-py3 in through the fastai conda channel, it will be prudent to pull the above into fastai org and release it from there.

stas · March 4, 2019, 3:36am

Loving your proficiency and communication style, @partho!

Thank you for submitting the PRs and verifying that the fastai core w/o attempting to use memory profiling functions works just fine on win10. That gives us some breathing space.

The problem is simple - it needs to work on pypi and conda, so if we just release the improved conda package, there will be a mismatch (we can’t upload an alternative on pypi).

So we can wait a bit and see whether (1) the PR gets submitted (2) a new release is made - hotfix release would be fine.

Here is alternative solution until things get sorted out upstream - go into fastai/utils/pynvml_gate.py and monkey patch nvml there and PR it. That sounds like it’d be the fastest way to give users the best experience. And once upstream makes a new pypi release, we build a new conda package and remove the monkey patch.

If the original upstream fix isn’t happening after some time, but its py3nvml fork does integrate your fix and makes a new release - we probably just switch to that version. Since it means that it’s actively maintained and it would be a better version to rely on.

partho · March 4, 2019, 9:31am

On it. For now let’s go with your idea of patching pynvml shim we have in fastai.

Also feel free to send across all Win10 issues and testing my way. That’s my primary environment. And is highly underrated IMHO for DL purposes.

I’ll also send across a PR for a Win10 ‘Server setup’ & ‘Returning to work’ guides - and if you are OK with that you can accept it for the couse-v3 documentation.

partho · March 4, 2019, 2:52pm

ok sent you the PR

stas · March 4, 2019, 4:50pm

Also feel free to send across all Win10 issues and testing my way. That’s my primary environment. And is highly underrated IMHO for DL purposes.

That’s very kind of you! This one please, I think I was asking the user who reported it difficult questions, so you will probably know how to help there. Thank you, @partho!
Fastai-nbstripout: stripping notebook outputs and metadata for git storage - #143 by stas (And a few comments up for the initial report).

I’ll also send across a PR for a Win10 ‘Server setup’ & ‘Returning to work’ guides - and if you are OK with that you can accept it for the couse-v3 documentation.

Oh, I’m sure Jeremy or Sylvain will take care of it - since they both are at home on windows.

Thanks again for your help, @partho!

rother · March 12, 2019, 2:08pm

Hi,
you asked in the other thread where the dll is located. It is indeed in the System32 directory in my case so your proposed PR should fix the issue (Windows 10). My rather simple workaround was copying the DLL to the directory the original code is searching in.

partho · March 12, 2019, 7:09pm

It’s fine as a temporary workaround. But this sort of change can cause gross system instability.

1.0.48 has been released and it should have the fix so please install that & remove the copy from program files. Let me know if the issue persists.