I am implementing some deep learning algorithms using theano. After I stop some programs running theano, occasionally the following error appears if I want to import theano again.
>>> import theano
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY. numdev=1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jjhu/.local/lib/python2.7/site-packages/theano/__init__.py", line 118, in <module>
theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
File "/home/jjhu/.local/lib/python2.7/site-packages/theano/sandbox/cuda/tests/test_driver.py", line 40, in test_nvidia_driver1
if not numpy.allclose(f(), a.sum()):
File "/home/jjhu/.local/lib/python2.7/site-packages/theano/compile/function_module.py", line 875, in __call__
storage_map=getattr(self.fn, 'storage_map', None))
File "/home/jjhu/.local/lib/python2.7/site-packages/theano/gof/link.py", line 317, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/home/jjhu/.local/lib/python2.7/site-packages/theano/compile/function_module.py", line 862, in __call__
self.fn() if output_subset is None else\
RuntimeError: Cuda error: kernel_reduce_ccontig_node_4894639462a290346189bb38dab7bb7e_0: out of memory. (grid: 1 x 1; block: 256 x 1 x 1)
Apply node that caused the error: GpuCAReduce{add}{1}(<CudaNdarrayType(float32, vector)>)
Toposort index: 0
Inputs types: [CudaNdarrayType(float32, vector)]
Inputs shapes: [(10000,)]
Inputs strides: [(1,)]
Inputs values: ['not shown']
Outputs clients: [[HostFromGpu(GpuCAReduce{add}{1}.0)]]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
I search for several solutions. Someone suggests to remove the compilation folder by rm -rf ./theano . I also check that the owner of ./theano is not root user. I also try setting my ./theanorc as following. But both do not work for me.
[global]
floatX = float32
device = cpu
optimizer=fast_run
[lib]
cnmem = 0.1
[cuda]
root = /usr/local/cuda
The only working solution is to reboot or log out the machine. It is very awkward. I don't know what causes this problem. Can anyone suggest some solutions?
Related
UPDATE: I have edited and changed more code, now I dont get an error and it either works but taked hours, or it is stuck on step one
I have tried running Stable Diffusion, the new text2image model. The Problem is: I don´t have a NVIDIA GPU... After a bit of research, I found out you can "force" PyTorch to run on your CPU, not GPU. But up to this point, everything I tried while modifying the existing code, did not work. I always get to the point where it starts sampling, and prints the following error (everyting after the command):
Falling back to LAION 400M model...
Global seed set to 42
Loading model from models/ldm/text2img-large/model.ckpt
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 872.30 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
data: 0%| | 0/1 [00:00<?, ?it/s]
Sampling: 0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "scripts/txt2img.py", line 279, in <module>
main()
File "scripts/txt2img.py", line 233, in main
uc = model.get_learned_conditioning(batch_size * [""])
File "c:\users\louis\stable-diffusion\ldm\models\diffusion\ddpm.py", line 558, in get_learned_conditioning
c = self.cond_stage_model.encode(c)
File "c:\users\louis\stable-diffusion\ldm\modules\encoders\modules.py", line 111, in encode
return self(text)
File "C:\Users\louis\anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "c:\users\louis\stable-diffusion\ldm\modules\encoders\modules.py", line 103, in forward
tokens = self.tknz_fn(text)#.to(self.device)
File "C:\Users\louis\anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "c:\users\louis\stable-diffusion\ldm\modules\encoders\modules.py", line 74, in forward
tokens = batch_encoding["input_ids"].to(self.device)
File "C:\Users\louis\anaconda3\envs\ldm\lib\site-packages\torch\cuda\__init__.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
I already defined to use CPU in text2img.py, so the error of something like "user defined to use cuda, but no cuda device available" is fixed.
So my question(s):
-Is what I´m trying even possible?
-If yes, how should I edit the code to work?
(-Would it even be possible to modify it to work on AMD GPUs using ROCm?)
The Repo: https://github.com/CompVis/stable-diffusion
Using the LAION400m weights, because I currently don´t have access to the SD ones.
I got them using:
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt
Guide I followed:https://github.com/lstein/stable-diffusion
I'm trying to get a graph neural network code to run on a cluster (where the code that I used, always used to work perfectly fine up to half a year ago). I have the following versions:
python 3.6.9
torch 1.6.0
torch-geometric 2.0.3
torch-scatter 2.0.5
torch-sparse 0.6.8
torchvision 0.7.0
If I run the code, I get the following error:
Traceback (most recent call last):
File "GNN_20210503_alltimes_trainsep.py", line 442, in
for train in train_dataloader:
File "/home/alkemade/.conda/envs/ML8/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/home/alkemade/.conda/envs/ML8/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 403, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/alkemade/.conda/envs/ML8/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/alkemade/.conda/envs/ML8/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py", line 20, in call
self.exclude_keys)
File "/home/alkemade/.conda/envs/ML8/lib/python3.7/site-packages/torch_geometric/data/batch.py", line 75, in from_data_list
exclude_keys=exclude_keys,
File "/home/alkemade/.conda/envs/ML8/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 109, in collate
out_store.batch = repeat_interleave(repeats, device=device)
File "/home/alkemade/.conda/envs/ML8/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 205, in repeat_interleave
outs = [torch.full((n, ), i, device=device) for i, n in enumerate(repeats)]
File "/home/alkemade/.conda/envs/ML8/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 205, in
outs = [torch.full((n, ), i, device=device) for i, n in enumerate(repeats)]
RuntimeError: Providing a bool or integral fill value without setting the optional dtype or out arguments is currently unsupported. In PyTorch 1.7, when dtype and out are not set a bool fill value will return a tensor of torch.bool dtype, and an integral fill value will return a tensor of torch.long dtype.
Downscaling the versions, as is suggested to others with this problem does unfortunately not work. Does anyone know how to fix this? Thank you in advance,
Best Rinske
Upgrade pytorch to 1.7.0.
This worked for me.
I am trying to reproduce the result generated by the code (adapted from https://github.com/alexarnimueller/LSTM_peptide)
This code has written in TF.1.X, I tried to modify this code for TF.2.X with my nanoscale knowledge in TensorFlow, but it never works and raised the error given below. While searching a lot I learned that issues could be related to graph and eager mode. how Can I solve this problem?
Code: Due to the char limit, I included the function which is showing error full code can be found here (https://github.com/alexarnimueller/LSTM_peptides/blob/master/LSTM_peptides.py):
def get_num_params(self):
"""Method to get the amount of trainable parameters in the model.
"""
trainable = int(np.sum([K.count_params(p) for p in set(self.model.trainable_weights)]))
non_trainable = int(np.sum([K.count_params(p) for p in set(self.model.non_trainable_weights)]))
print('\nMODEL PARAMETERS')
print('Total parameters: %i' % (trainable + non_trainable))
print('Trainable parameters: %i' % trainable)
print('Non-trainable parameters: %i' % non_trainable)
Error:
2020-05-17 14:20:12.184210: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Traceback (most recent call last):
File "LSTM_peptides_II.py", line 792, in <module>
finetune=args.finetune, references=args.refs)
File "LSTM_peptides_II.py", line 714, in main
dropoutfract=dropout, l2_reg=l2_rate, ask=True, seed=42)
File "LSTM_peptides_II.py", line 443, in __init__
self.initialize_model(seed=self.seed)
File "LSTM_peptides_II.py", line 495, in initialize_model
self.get_num_params()
File "LSTM_peptides_II.py", line 677, in get_num_params
trainable = int(np.sum([K.count_params(p) for p in set(self.model.trainable_weights)]))
File "/home/User1/anaconda3/envs/TF2/lib/python3.7/site-packages/tensorflow_core/python/ops/variables.py", line 1086, in __hash__
raise TypeError("Variable is unhashable if Tensor equality is enabled. "
TypeError: Variable is unhashable if Tensor equality is enabled. Instead, use tensor.experimental_ref() as the key.
Torch 0.4.1
Python 2.7.12
I was adapting NMP QC code (with some compatibility issues ironed out) to use multiple GPUs since my GPU couldn't handle the workload (crashed after running out of VRAM)
I'm new to pytorch, but I found a tutorial on using nn.DataParallel(model) to implement multi-gpu use
I modified main.py to use nn.DataParallel(model). Areas I changed have "#NEW" stuck to them.
The code runs fine even in multi-gpu mode if running on a single GPU, but gets a "arguments are located on different GPUs" error when running on 2 or more GPU
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs3
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs2
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/QKSBQ5PAFDDC3OMBEELQQETALQ:/var/lib/docker/overlay2/l/WWYI3IDQPNXGON7AHODBPSTVXL:/var/lib/docker/overlay2/l/Q54I2HYS4TKH4LDJKBTVTGWWO6:/var/lib/docker/overlay2/l/IUV2LFPNMPOS3MREOTT52TKL54:/var/lib/docker/overlay2/l/DB5GBUCI3DCBPX6TJG3O337YVB:/var/lib/docker/overlay2/l/DNYKXCZJH5FMFNJLNGYJJ2ITPI:/var/lib/docker/overlay2/l/7DZCTDVNSTPJISGW65UG7U3F75:/var/lib/docker/overlay2/l/VOEQO652VS63NLDLZZ4TCIJLO6:/var/lib/docker/overlay2/l/4SI6ZCRUIORG5'
Traceback (most recent call last):
File "main.py", line 332, in <module>
main()
File "main.py", line 190, in main
train(train_loader, model, criterion, optimizer, epoch, evaluation, logger)
File "main.py", line 251, in train
output = model(g, h, e)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:236
Since I was sending the inputs one at a time instead of at once like in the tutorial, I checked by using .get_device(), which confirmed all 4 arguments being sent (g, h, e, target) were on the same device (device 0)
I have non-sudo access to a machine with NVIDIA GPUs and CUDA 7.5 installed. I installed PyTorch with CUDA 7.5 support, which seems to have worked:
>>> import torch
>>> torch.cuda.is_available()
True
To get some practice, I followed tutorial for machine translation using RNNs. When I set USE_CUDA = False and the CPUs are used, everything works quite alright. However, when want to utilize the GPUs with USE_CUDA = True I get the following error:
Traceback (most recent call last):
...
File "seq2seq.py", line 229, in train
encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden)
File "/.../python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "seq2seq.py", line 144, in forward
output, hidden = self.gru(embedded, hidden)
File "/.../python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "/.../python2.7/site-packages/torch/nn/modules/rnn.py", line 91, in forward
output, hidden = func(input, self.all_weights, hx)
...
File "/.../python2.7/site-packages/torch/backends/cudnn/rnn.py", line 42, in init_rnn_descriptor
cudnn.DropoutDescriptor(handle, dropout_p, fn.dropout_seed)
File "/usr/lib/python2.7/ctypes/__init__.py", line 383, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: python: undefined symbol: cudnnCreateDropoutDescriptor
Exception AttributeError: 'python: undefined symbol: cudnnDestroyDropoutDescriptor' in <bound method DropoutDescriptor.__del__ of <torch.backends.cudnn.DropoutDescriptor object at 0x7fe540efec10>> ignored
I've tried to use Google to search for that error but got no meaningful results. Since I'm rather a newbie with PyTorch and CUDA, I have no idea how to go on from here. The full setup is Ubuntu 14.04, Python 2.7, CUDA 7.5.
As stated in the comments: your error is with outdated CUDNN, and can be resolved by upgrading.
Install current versions of CUDA, CUDNN, and PyTorch, then you'll be fine.