PyTorch + CUDA 7.5 error - python

I have non-sudo access to a machine with NVIDIA GPUs and CUDA 7.5 installed. I installed PyTorch with CUDA 7.5 support, which seems to have worked:
>>> import torch
>>> torch.cuda.is_available()
True
To get some practice, I followed tutorial for machine translation using RNNs. When I set USE_CUDA = False and the CPUs are used, everything works quite alright. However, when want to utilize the GPUs with USE_CUDA = True I get the following error:
Traceback (most recent call last):
...
File "seq2seq.py", line 229, in train
encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden)
File "/.../python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "seq2seq.py", line 144, in forward
output, hidden = self.gru(embedded, hidden)
File "/.../python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "/.../python2.7/site-packages/torch/nn/modules/rnn.py", line 91, in forward
output, hidden = func(input, self.all_weights, hx)
...
File "/.../python2.7/site-packages/torch/backends/cudnn/rnn.py", line 42, in init_rnn_descriptor
cudnn.DropoutDescriptor(handle, dropout_p, fn.dropout_seed)
File "/usr/lib/python2.7/ctypes/__init__.py", line 383, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: python: undefined symbol: cudnnCreateDropoutDescriptor
Exception AttributeError: 'python: undefined symbol: cudnnDestroyDropoutDescriptor' in <bound method DropoutDescriptor.__del__ of <torch.backends.cudnn.DropoutDescriptor object at 0x7fe540efec10>> ignored
I've tried to use Google to search for that error but got no meaningful results. Since I'm rather a newbie with PyTorch and CUDA, I have no idea how to go on from here. The full setup is Ubuntu 14.04, Python 2.7, CUDA 7.5.

As stated in the comments: your error is with outdated CUDNN, and can be resolved by upgrading.
Install current versions of CUDA, CUDNN, and PyTorch, then you'll be fine.

Related

No NVIDIA GPU found error, even though I defined Torch to use CPU

UPDATE: I have edited and changed more code, now I dont get an error and it either works but taked hours, or it is stuck on step one
I have tried running Stable Diffusion, the new text2image model. The Problem is: I don´t have a NVIDIA GPU... After a bit of research, I found out you can "force" PyTorch to run on your CPU, not GPU. But up to this point, everything I tried while modifying the existing code, did not work. I always get to the point where it starts sampling, and prints the following error (everyting after the command):
Falling back to LAION 400M model...
Global seed set to 42
Loading model from models/ldm/text2img-large/model.ckpt
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 872.30 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
data: 0%| | 0/1 [00:00<?, ?it/s]
Sampling: 0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "scripts/txt2img.py", line 279, in <module>
main()
File "scripts/txt2img.py", line 233, in main
uc = model.get_learned_conditioning(batch_size * [""])
File "c:\users\louis\stable-diffusion\ldm\models\diffusion\ddpm.py", line 558, in get_learned_conditioning
c = self.cond_stage_model.encode(c)
File "c:\users\louis\stable-diffusion\ldm\modules\encoders\modules.py", line 111, in encode
return self(text)
File "C:\Users\louis\anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "c:\users\louis\stable-diffusion\ldm\modules\encoders\modules.py", line 103, in forward
tokens = self.tknz_fn(text)#.to(self.device)
File "C:\Users\louis\anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "c:\users\louis\stable-diffusion\ldm\modules\encoders\modules.py", line 74, in forward
tokens = batch_encoding["input_ids"].to(self.device)
File "C:\Users\louis\anaconda3\envs\ldm\lib\site-packages\torch\cuda\__init__.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
I already defined to use CPU in text2img.py, so the error of something like "user defined to use cuda, but no cuda device available" is fixed.
So my question(s):
-Is what I´m trying even possible?
-If yes, how should I edit the code to work?
(-Would it even be possible to modify it to work on AMD GPUs using ROCm?)
The Repo: https://github.com/CompVis/stable-diffusion
Using the LAION400m weights, because I currently don´t have access to the SD ones.
I got them using:
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt
Guide I followed:https://github.com/lstein/stable-diffusion

Error can't get attribute Net when saving PyTorch model with MLFlow

After installing MLFlow using one-click-mlflow I save a pytorch model using the default command that I found in the user guide. You can find the command bellow:
mlflow.pytorch.log_model(net, artifact_path="model", pickle_module=pickle)
The neural network saved is very simple, this is basically a two layer neural network with Xavier initialization and hyperbolic tangent as activation function.
class Net(T.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(n_features, 10)
self.hid2 = T.nn.Linear(10, 10)
self.oupt = T.nn.Linear(10, 1)
T.nn.init.xavier_uniform_(self.hid1.weight)
T.nn.init.zeros_(self.hid1.bias)
T.nn.init.xavier_uniform_(self.hid2.weight)
T.nn.init.zeros_(self.hid2.bias)
T.nn.init.xavier_uniform_(self.oupt.weight)
T.nn.init.zeros_(self.oupt.bias)
def forward(self, x):
z = T.tanh(self.hid1(x))
z = T.tanh(self.hid2(z))
z = self.oupt(z)
return z
Every things is runing fine in the Jupyter Notebook. I can log metrics and other artifact but when I save the model I got the following error message:
2021/10/13 09:21:00 WARNING mlflow.utils.requirements_utils: Found torch version (1.9.0+cu111) contains a local version label (+cu111). MLflow logged a pip requirement for this package as 'torch==1.9.0' without the local version label to make it installable from PyPI. To specify pip requirements containing local version labels, please use `conda_env` or `pip_requirements`.
2021/10/13 09:21:00 WARNING mlflow.utils.requirements_utils: Found torchvision version (0.10.0+cu111) contains a local version label (+cu111). MLflow logged a pip requirement for this package as 'torchvision==0.10.0' without the local version label to make it installable from PyPI. To specify pip requirements containing local version labels, please use `conda_env` or `pip_requirements`.
2021/10/13 09:21:01 ERROR mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/tmpnl9dsoye/model/data, flavor: pytorch)
Traceback (most recent call last):
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/environment.py", line 212, in infer_pip_requirements
return _infer_requirements(model_uri, flavor)
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/requirements_utils.py", line 263, in _infer_requirements
modules = _capture_imported_modules(model_uri, flavor)
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/requirements_utils.py", line 221, in _capture_imported_modules
_run_command(
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/requirements_utils.py", line 173, in _run_command
raise MlflowException(msg)
mlflow.exceptions.MlflowException: Encountered an unexpected error while running ['/home/ucsky/.virtualenv/mymodel/bin/python', '/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/_capture_modules.py', '--model-path', '/tmp/tmpnl9dsoye/model/data', '--flavor', 'pytorch', '--output-file', '/tmp/tmplyj0w2fr/imported_modules.txt', '--sys-path', '["/home/ucsky/project/ofi-ds-research/incubator/ofi-pe-fr/notebook/guillaume-simon/06-modelisation-pytorch", "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/git/ext/gitdb", "/usr/lib/python39.zip", "/usr/lib/python3.9", "/usr/lib/python3.9/lib-dynload", "", "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages", "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/IPython/extensions", "/home/ucsky/.ipython", "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/gitdb/ext/smmap"]']
exit status: 1
stdout:
stderr: Traceback (most recent call last):
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/_capture_modules.py", line 125, in <module>
main()
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/_capture_modules.py", line 118, in main
importlib.import_module(f"mlflow.{flavor}")._load_pyfunc(model_path)
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/pytorch/__init__.py", line 723, in _load_pyfunc
return _PyTorchWrapper(_load_model(path, **kwargs))
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/pytorch/__init__.py", line 626, in _load_model
return torch.load(model_path, **kwargs)
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/torch/serialization.py", line 607, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/torch/serialization.py", line 882, in _load
result = unpickler.load()
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/torch/serialization.py", line 875, in find_class
return super().find_class(mod_name, name)
AttributeError: Can't get attribute 'Net' on <module '__main__' from '/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/_capture_modules.py'>
Can somebody explain me what is wrong?

How to check torch gpu compatibility without initializing CUDA?

Older GPUs don't seem to support torch in spite of recent cuda versions.
In my case the crash has the following error:
/home/maxs/dev/mdb/venv38/lib/python3.8/site-packages/torch/cuda/__init__.py:83: UserWarning:
Found GPU%d %s which is of cuda capability %d.%d.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is %d.%d.
warnings.warn(old_gpu_warn.format(d, name, major, minor, min_arch // 10, min_arch % 10))
WARNING:lightwood-16979:Exception: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1. when training model: <lightwood.model.neural.Neural object at 0x7f9c34df1e80>
Process LearnProcess-1:13:
Traceback (most recent call last):
File "/home/maxs/dev/mdb/venv38/sources/lightwood/lightwood/model/helpers/default_net.py", line 59, in forward
output = self.net(input)
File "/home/maxs/dev/mdb/venv38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/maxs/dev/mdb/venv38/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/maxs/dev/mdb/venv38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/maxs/dev/mdb/venv38/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 96, in forward
return F.linear(input, self.weight, self.bias)
File "/home/maxs/dev/mdb/venv38/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
This happens in spite of:
assert torch.cuda.is_available() == True
torch.version.cuda == '10.2'
How can I check for an older GPU that doesn't support torch without actually try/catching a tensor-to-gpu transfer? The transfer initializes cuda, which wastes like 2GB of memory, something I can't afford since I'd be running this check in dozens of processes, all of which would then waste 2GB of memory extra due to the initialization.
Based on the code in torch.cuda.__init__ that was actually throwing the error the following check seems to work:
import torch
from torch.cuda import device_count, get_device_capability
def is_cuda_compatible():
compatible_device_count = 0
if torch.version.cuda is not None:
for d in range(device_count()):
capability = get_device_capability(d)
major = capability[0]
minor = capability[1]
current_arch = major * 10 + minor
min_arch = min((int(arch.split("_")[1]) for arch in torch.cuda.get_arch_list()), default=35)
if (not current_arch < min_arch
and not torch._C._cuda_getCompiledVersion() <= 9000):
compatible_device_count += 1
if compatible_device_count > 0:
return True
return False
Not sure if it's 100% correct but putting it out here for feedback and in case anybody else needs it.

Tensorflow: AttributeError: 'function' object has no attribute 'graph'

I am using Keras == 1.1.0 and tensorflow-gpu == 1.12.0.
The error is called after:
input_layer = Input(shape=(2, ))
layer = Dense(self._hidden[0], activation='relu')(input_layer)
and this is the Traceback
Traceback (most recent call last):
File "D:/Documents/PycharmProjects/DDPG-master-2/main.py", line 18, in <module>
main()
File "D:/Documents/PycharmProjects/DDPG-master-2/main.py", line 14, in main
agent = Agent(state_size=world.state_size, action_size=world.action_size)
File "D:\Documents\PycharmProjects\DDPG-master-2\ddpg.py", line 50, in __init__
batch_size=batch_size, tau=tau)
File "D:\Documents\PycharmProjects\DDPG-master-2\networks\actor.py", line 68, in __init__
self._generate_model()
File "D:\Documents\PycharmProjects\DDPG-master-2\networks\actor.py", line 132, in _generate_model
layer = Dense(self._hidden[0], activation='relu')(input_layer)
File "D:\Anaconda3\lib\site-packages\keras\engine\topology.py", line 487, in __call__
self.build(input_shapes[0])
File "D:\Anaconda3\lib\site-packages\keras\layers\core.py", line 695, in build
name='{}_W'.format(self.name))
File "D:\Anaconda3\lib\site-packages\keras\initializations.py", line 59, in glorot_uniform
return uniform(shape, s, name=name)
File "D:\Anaconda3\lib\site-packages\keras\initializations.py", line 32, in uniform
return K.random_uniform_variable(shape, -scale, scale, name=name)
File "D:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py", line 282, in random_uniform_variable
return variable(value, dtype=dtype, name=name)
File "D:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py", line 152, in variable
if tf.get_default_graph() is get_session().graph:
AttributeError: 'function' object has no attribute 'graph'
Process finished with exit code 1
I previously had tensorflow-gpu == 1.9 and uninstalled it and upgraded to 1.12, as I saw that it was a common solutions for similar problems. It did not work though.
EDIT (adding some relevant code related to the Traceback):
agent = DDPG(state_size=world.state_size, action_size=world.action_size)
self._actor = Actor(tensorflow_session=tensorflow_session,
state_size=state_size, action_size=action_size,
hidden_units=actor_hidden_units,
learning_rate=actor_learning_rate,
batch_size=batch_size, tau=tau)
def _generate_model(self):
"""
Generates the model based on the hyperparameters defined in the
constructor.
:return: at tuple containing references to the model, weights,
and input later
"""
input_layer = Input(shape=(self._state_size,))
layer = Dense(self._hidden[0], activation='relu')(input_layer)
layer = Dense(self._hidden[1], activation='relu')(layer)
output_layer = Dense(self._action_size, activation='sigmoid')(layer)
model = Model(input=input_layer, output=output_layer)
return model, model.trainable_weights, input_layer
The code is related to three different classes.
I experienced the same issue. Here are the things I did to resolve it:
Make sure that there is no other files in my project named tensorflow.py
re-install tensorflow with --no-cache-dir argument pip --no-cache-dir install tensorflow and removed pip cache files.
For linux:
rm -rf ~/.cache/pip/*
For windows, delete the files in this location: %LocalAppData%\pip\Cache
I hope this helps
Here is how I solved the issue.
1. Uninstall keras with pip uninstall keras
2. Check that no other versions were installed (for example through conda)
3. Delete the cache in %LocalAppData%\pip\Cache
4. Re-install keras

how to setup cuDnn with theano on Windows 7 64 bit

I have installed Theano framework and enabled CUDA on my machine, however when I "import theano" in my python console, I got the following message:
>>> import theano
Using gpu device 0: GeForce GTX 950 (CNMeM is disabled, CuDNN not available)
Now that "CuDNN not available", I downloaded cuDnn from Nvidia website. I also updated 'path' in environment, and added 'optimizer_including=cudnn' in '.theanorc.txt' config file.
Then, I tried again, but failed, with:
>>> import theano
Using gpu device 0: GeForce GTX 950 (CNMeM is disabled, CuDNN not available)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda2\lib\site-packages\theano\__init__.py", line 111, in <module>
theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
File "C:\Anaconda2\lib\site-packages\theano\sandbox\cuda\tests\test_driver.py", line 31, in test_nvidia_driver1
profile=False)
File "C:\Anaconda2\lib\site-packages\theano\compile\function.py", line 320, in function
output_keys=output_keys)
File "C:\Anaconda2\lib\site-packages\theano\compile\pfunc.py", line 479, in pfunc
output_keys=output_keys)
File "C:\Anaconda2\lib\site-packages\theano\compile\function_module.py", line 1776, in orig_function
output_keys=output_keys).create(
File "C:\Anaconda2\lib\site-packages\theano\compile\function_module.py", line 1456, in __init__
optimizer_profile = optimizer(fgraph)
File "C:\Anaconda2\lib\site-packages\theano\gof\opt.py", line 101, in __call__
return self.optimize(fgraph)
File "C:\Anaconda2\lib\site-packages\theano\gof\opt.py", line 89, in optimize
ret = self.apply(fgraph, *args, **kwargs)
File "C:\Anaconda2\lib\site-packages\theano\gof\opt.py", line 230, in apply
sub_prof = optimizer.optimize(fgraph)
File "C:\Anaconda2\lib\site-packages\theano\gof\opt.py", line 89, in optimize
ret = self.apply(fgraph, *args, **kwargs)
File "C:\Anaconda2\lib\site-packages\theano\gof\opt.py", line 230, in apply
sub_prof = optimizer.optimize(fgraph)
File "C:\Anaconda2\lib\site-packages\theano\gof\opt.py", line 89, in optimize
ret = self.apply(fgraph, *args, **kwargs)
File "C:\Anaconda2\lib\site-packages\theano\sandbox\cuda\dnn.py", line 2508, in apply
dnn_available.msg)
AssertionError: cuDNN optimization was enabled, but Theano was not able to use it. We got this error:
Theano can not compile with cuDNN. We got this error:
>>>
anyone can help me? Thanks.
There should be a way to do it by setting only the Path environment variable but I could never get that to work. The only thing that worked for me was to manually copy the CuDNN files into the appropriate folders in your CUDA installation.
For example, if your CUDA installation is in C:\CUDA\v7.0 and you extracted CuDNN to C:\CuDNN you would copy as follows:
The contents of C:\CuDNN\lib\x64\ would be copied to C:\CUDA\v7.0\lib\x64\
The contents of C:\CuDNN\include\ would be copied to C:\CUDA\v7.0\include\
The contents of C:\CuDNN\bin\ would be copied to C:\CUDA\v7.0\bin\
After that it should work.
In addition to all the stuffs you did I updated following content of .theanorc.txt in my home folder and it worked after that.
[lib]
#cnmem=1.0
cudnn=1.0

Categories