SOLVED - Issues importing QRNN on local Linux setup (Ubuntu 18.04)

wgpubs · October 25, 2019, 2:27am

I’m getting this error:

ImportError: No module named 'forget_mult_cuda'

I’ve seen other folks struggling with this and so I’m hoping someone has some insights on how to resolve.

Here is the full stack-trace:

ImportError                               Traceback (most recent call last)
<timed exec> in <module>

~/anaconda3/envs/piegu-language-models/lib/python3.7/site-packages/fastai/text/learner.py in language_model_learner(data, arch, config, drop_mult, pretrained, pretrained_fnames, **learn_kwargs)
    202                            pretrained_fnames:OptStrTuple=None, **learn_kwargs) -> 'LanguageLearner':
    203     "Create a `Learner` with a language model from `data` and `arch`."
--> 204     model = get_language_model(arch, len(data.vocab.itos), config=config, drop_mult=drop_mult)
    205     meta = _model_meta[arch]
    206     learn = LanguageLearner(data, model, split_func=meta['split_lm'], **learn_kwargs)

~/anaconda3/envs/piegu-language-models/lib/python3.7/site-packages/fastai/text/learner.py in get_language_model(arch, vocab_sz, config, drop_mult)
    193     tie_weights,output_p,out_bias = map(config.pop, ['tie_weights', 'output_p', 'out_bias'])
    194     init = config.pop('init') if 'init' in config else None
--> 195     encoder = arch(vocab_sz, **config)
    196     enc = encoder.encoder if tie_weights else None
    197     decoder = LinearDecoder(vocab_sz, config[meta['hid_name']], output_p, tie_encoder=enc, bias=out_bias)

~/anaconda3/envs/piegu-language-models/lib/python3.7/site-packages/fastai/core.py in _init(self, *args, **kwargs)
     64         def _init(self,*args,**kwargs):
     65             self.__pre_init__()
---> 66             old_init(self, *args,**kwargs)
     67             self.__post_init__()
     68         x.__init__ = _init

~/anaconda3/envs/piegu-language-models/lib/python3.7/site-packages/fastai/text/models/awd_lstm.py in __init__(self, vocab_sz, emb_sz, n_hid, n_layers, pad_token, hidden_p, input_p, embed_p, weight_p, qrnn, bidir)
     86         if self.qrnn:
     87             #Using QRNN requires an installation of cuda
---> 88             from .qrnn import QRNN
     89             self.rnns = [QRNN(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz)//self.n_dir, 1,
     90                               save_prev_x=True, zoneout=0, window=2 if l == 0 else 1, output_gate=True, bidirectional=bidir) 

~/anaconda3/envs/piegu-language-models/lib/python3.7/site-packages/fastai/text/models/qrnn.py in <module>
      9     fastai_path = Path(fastai.__path__[0])/'text'/'models'
     10     files = ['forget_mult_cuda.cpp', 'forget_mult_cuda_kernel.cu']
---> 11     forget_mult_cuda = load(name='forget_mult_cuda', sources=[fastai_path/f for f in files])
     12     files = ['bwd_forget_mult_cuda.cpp', 'bwd_forget_mult_cuda_kernel.cu']
     13     bwd_forget_mult_cuda = load(name='bwd_forget_mult_cuda', sources=[fastai_path/f for f in files])

~/anaconda3/envs/piegu-language-models/lib/python3.7/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module)
    659         verbose,
    660         with_cuda,
--> 661         is_python_module)
    662 
    663 

~/anaconda3/envs/piegu-language-models/lib/python3.7/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module)
    839     if verbose:
    840         print('Loading extension module {}...'.format(name))
--> 841     return _import_module_from_library(name, build_directory, is_python_module)
    842 
    843 

~/anaconda3/envs/piegu-language-models/lib/python3.7/site-packages/torch/utils/cpp_extension.py in _import_module_from_library(module_name, path, is_python_module)
   1046 def _import_module_from_library(module_name, path, is_python_module):
   1047     # https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path
-> 1048     file, path, description = imp.find_module(module_name, [path])
   1049     # Close the .so file after load.
   1050     with file:

~/anaconda3/envs/piegu-language-models/lib/python3.7/imp.py in find_module(name, path)
    294         break  # Break out of outer loop when breaking out of inner loop.
    295     else:
--> 296         raise ImportError(_ERR_MSG.format(name), name=name)
    297 
    298     encoding = None

ImportError: No module named 'forget_mult_cuda'

wgpubs · October 25, 2019, 2:50am

Actually, the first time I run this code:

config = awd_lstm_lm_config.copy()
config['qrnn'] = True
config['n_hid'] = 1550 #default 1152
config['n_layers'] = 4 #default 3

perplexity = Perplexity()
learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained=False, 
                               metrics=[error_rate, accuracy, perplexity]).to_fp16()

… I get this error:

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

When I try running that code again I get the ImportError: No module named 'forget_mult_cuda' error. I’m thoroughly confused now.

sgugger · October 25, 2019, 12:55pm

Try to clean up the temp directory when they build the kernels (/tmp/torch_something IIRC). I had to do this when upgrading my PyTorch.

wgpubs · October 25, 2019, 4:18pm

Tried that (it’s /tmp/torch_extensions btw) But I still get the ninja error:

CalledProcessError                        Traceback (most recent call last)
~/anaconda3/envs/piegu-language-models/lib/python3.7/site-packages/torch/utils/cpp_extension.py in _build_extension_module(name, build_directory, verbose)
   1029                 cwd=build_directory,
-> 1030                 check=True)
   1031         else:

~/anaconda3/envs/piegu-language-models/lib/python3.7/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    486             raise CalledProcessError(retcode, process.args,
--> 487                                      output=stdout, stderr=stderr)
    488     return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

Not a C++ guy and non familiar with ninja, though that appears to be the problem.

sgugger · October 25, 2019, 4:23pm

The erro message isn’t really helpful. I have no other idea apart form checking your cuda install is compatible with the PyTorch you installed. It compiles normally for me.

wgpubs · October 25, 2019, 4:31pm

This is the solution: https://github.com/mapillary/inplace_abn/issues/106#issuecomment-475460496

I noticed that when I ran nvcc --version it was reporting 9.1 … but when I ran !python -m fastai.utils.show_install I was seeing the following:

python        : 3.7.4
fastai        : 1.0.58
fastprogress  : 0.1.21
torch         : 1.3.0
nvidia driver : 430.26
torch cuda    : 10.1.243 / is available
torch cudnn   : 7603 / is enabled

So I was like, “Why the hell is nvcc reporting 9.1?”

Did some googling and came across this github resolution which basically requires folks on 18.04 to be real explicit where everything is.

Maybe there is another way … but this is working

devon.kaberna · March 4, 2020, 4:01am

Can you help me out, based on what you wrote above (thank you, btw). What version of torch cuda should I use at this point? I really struggle to understand what versions of each outlined below I should use to make sure things work fine. Any best practices to ensure alignment would be greatly appreciated!!!

=== Software ===
python : 3.7.3
fastai : 1.0.60
fastprogress : 0.2.2
torch : 1.4.0
nvidia driver : 410.104
torch cuda : 10.1 / is Not available

=== Hardware ===
nvidia gpus : 1
Have 1 GPU(s), but torch can’t use them (check nvidia driver)

=== Environment ===
platform : Linux-4.15.0-88-generic-x86_64-with-debian-buster-sid
distro : #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020
conda env : base
python : /home/devon/anaconda3/bin/python
sys.path :
/home/devon/anaconda3/lib/python37.zip
/home/devon/anaconda3/lib/python3.7
/home/devon/anaconda3/lib/python3.7/lib-dynload
/home/devon/anaconda3/lib/python3.7/site-packages
/home/devon/anaconda3/lib/python3.7/site-packages/IPython/extensions

19:58 $ nvcc --version
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

wgpubs · March 4, 2020, 4:12am

Are you running on Ubuntu 18.04?

Assuming that is the case, then yes, 10.1 is what you want right now. This article describes how to clean up your system and get the right cuda/cudnn installed and verify all is right.

Once you get that installed, make sure you got the latest version of PyTorch installed (this assumes you are using anaconda, if not, look at the pytorch home page to figure out the right syntax):

conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

My experience with cuda and cudnn is that it’s really easy for things to get F’d up … especially as you upgrade it from version to version … so you’re not alone

Once you got everything looking right, look at the github link I mentioned above. Lmk how it goes.

devon.kaberna · March 4, 2020, 5:16am

First, thank you so much for the very timely and extremely helpful response - much appreciated!!!

But I’m still struggling with this. Yes, I am using 18.04.

Question 1: If it’s telling me " torch cuda : 10.1 / is Not available", why would that be acceptable?

Question 2: what’s the syntax for installing the pytorch wheel for the version 1.0.0 for cuda 10.0

Question 3: When I run this code snippet, I get the error shown further below:
config = awd_lstm_lm_config.copy()
config[‘qrnn’] = True

drop_mult=0.1
learn_1 = language_model_learner(data_lm_1, AWD_LSTM, config=config, drop_mult=drop_mult,pretrained=False).to_fp16()

RuntimeError: Error building extension ‘forget_mult_cuda’: [1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=forget_mult_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/devon/anaconda3/envs/fai_v1.0.50/lib/python3.7/site-packages/torch/include -isystem /home/devon/anaconda3/envs/fai_v1.0.50/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/devon/anaconda3/envs/fai_v1.0.50/lib/python3.7/site-packages/torch/include/TH -isystem /home/devon/anaconda3/envs/fai_v1.0.50/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/devon/anaconda3/envs/fai_v1.0.50/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=sm_61 --compiler-options ‘-fPIC’ -std=c++11 -c /home/devon/anaconda3/envs/fai_v1.0.50/lib/python3.7/site-packages/fastai/text/models/forget_mult_cuda_kernel.cu -o forget_mult_cuda_kernel.cuda.o FAILED: forget_mult_cuda_kernel.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=forget_mult_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/devon/anaconda3/envs/fai_v1.0.50/lib/python3.7/site-packages/torch/include -isystem /home/devon/anaconda3/envs/fai_v1.0.50/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/devon/anaconda3/envs/fai_v1.0.50/lib/python3.7/site-packages/torch/include/TH -isystem /home/devon/anaconda3/envs/fai_v1.0.50/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/devon/anaconda3/envs/fai_v1.0.50/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=sm_61 --compiler-options ‘-fPIC’ -std=c++11 -c /home/devon/anaconda3/envs/fai_v1.0.50/lib/python3.7/site-packages/fastai/text/models/forget_mult_cuda_kernel.cu -o forget_mult_cuda_kernel.cuda.o /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found ninja: build stopped: subcommand failed.

In [ ]:

#check structue of learner model…

wgpubs · March 4, 2020, 5:24am

You can find previous version installs here.

What kind of GPU are you working with? Did you try following those instruction to install cuda 10.1?

devon.kaberna · March 4, 2020, 5:37am

Yes, I followed the instructions you outlined in that article. I have the 1080TI GPU. I’ll keep trying things. thanks again!

wgpubs · March 4, 2020, 5:40am

That is exactly what I have.

So not sure what to tell ya … getting things working on the cuda/cudnn front is kinda a pain. You should be able to have 10.1 running on your machine and the latest pytorch as well.

Good luck. Let us know if you get things figured out … sorry I couldn’t help more.

devon.kaberna · March 5, 2020, 3:01am

Would you mind providing me an output of all the various versions you currently have installed? That way, I can compare and troubleshoot. Thanks so much.

wgpubs · March 5, 2020, 4:11am

Here ya go.

But I’m going to recommend that you right of the bat start the installs from scratch. Configuring CUDA/cudNN is a major PITA for just about everyone. I usually spend several hours f’n around with my machine and then once its working … I walk away real quiet like and try to touch it for as long as I can.

Anyways, hopefully this helps …

nvcc --version
# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2019 NVIDIA Corporation
# Built on Sun_Jul_28_19:07:16_PDT_2019
# Cuda compilation tools, release 10.1, V10.1.243

$PATH
# /usr/local/cuda-10.1/bin:/usr/local/cuda-10.1/bin:/home/wgilliam/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin: No such file or directory

$LD_LIBRARY_PATH
# /usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64: No such file or directory

You’re going to see something odd in the next big of info … nvidia-smi will report 10.2 (but its really 10.1). Yah, real f’ng annoying. See this SO question here for more info if you’re curious.

nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 440.59       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   37C    P8    10W / 180W |   2695MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:42:00.0 Off |                  N/A |
|  0%   24C    P8    12W / 250W |   1184MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:43:00.0 Off |                  N/A |
|  0%   33C    P8    19W / 250W |   3619MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

/usr/local/
# bin  cuda  cuda-10.0  cuda-10.1  cuda-10.2  etc  games  include  lib  man  sbin  share  src

devon.kaberna · March 5, 2020, 4:45am

Finally worked through the issues!!!

Thanks again for everything! Couldn’t have figured this out without you!!

wgpubs · March 5, 2020, 5:16am

Congrats! Everyone here knows the struggle (or will). Cheers!