Arguments are located on different GPUs when using nn.DataParallel(model)

Arguments are located on different GPUs when using nn.DataParallel(model) - python

Torch 0.4.1
Python 2.7.12
I was adapting NMP QC code (with some compatibility issues ironed out) to use multiple GPUs since my GPU couldn't handle the workload (crashed after running out of VRAM)
I'm new to pytorch, but I found a tutorial on using nn.DataParallel(model) to implement multi-gpu use
I modified main.py to use nn.DataParallel(model). Areas I changed have "#NEW" stuck to them.
The code runs fine even in multi-gpu mode if running on a single GPU, but gets a "arguments are located on different GPUs" error when running on 2 or more GPU
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs3
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs2
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/QKSBQ5PAFDDC3OMBEELQQETALQ:/var/lib/docker/overlay2/l/WWYI3IDQPNXGON7AHODBPSTVXL:/var/lib/docker/overlay2/l/Q54I2HYS4TKH4LDJKBTVTGWWO6:/var/lib/docker/overlay2/l/IUV2LFPNMPOS3MREOTT52TKL54:/var/lib/docker/overlay2/l/DB5GBUCI3DCBPX6TJG3O337YVB:/var/lib/docker/overlay2/l/DNYKXCZJH5FMFNJLNGYJJ2ITPI:/var/lib/docker/overlay2/l/7DZCTDVNSTPJISGW65UG7U3F75:/var/lib/docker/overlay2/l/VOEQO652VS63NLDLZZ4TCIJLO6:/var/lib/docker/overlay2/l/4SI6ZCRUIORG5'
Traceback (most recent call last):
File "main.py", line 332, in <module>
main()
File "main.py", line 190, in main
train(train_loader, model, criterion, optimizer, epoch, evaluation, logger)
File "main.py", line 251, in train
output = model(g, h, e)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:236
Since I was sending the inputs one at a time instead of at once like in the tutorial, I checked by using .get_device(), which confirmed all 4 arguments being sent (g, h, e, target) were on the same device (device 0)

Related

Arch Linux can't find CUDA?

I'm trying to get StableTuner working on Arch Linux and while I've gotten far I'm currently facing a problem now when I run the .sh file used for training.
I'm getting this error when trying to run StableTuner:
[campfire#archlinux scripts]$ bash run.sh
Booting Up StableTuner
Please wait a moment as we load up some stuff...
/home/campfire/.local/lib/python3.10/site-packages/accelerate/accelerator.py:231: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
warnings.warn(
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/home/campfire/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('#/tmp/.ICE-unix/582,unix/archlinux'), PosixPath('local/archlinux')}
warn(
/home/campfire/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/org/freedesktop/DisplayManager/Session0')}
warn(
/home/campfire/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/org/freedesktop/DisplayManager/Seat0')}
warn(
/home/campfire/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//debuginfod.archlinux.org '), PosixPath('https')}
warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
/home/campfire/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
warn(
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary /home/campfire/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/home/campfire/.local/lib/python3.10/site-packages/bitsandbytes/cextension.py:48: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn(
/home/campfire/.local/lib/python3.10/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Creating Auto Bucketing Dataloader
Rounded resolution to: 512
Preloading images...
** Processing /home/campfire/Desktop: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 3562.34it/s]
** Number of buckets: 2
** Bucket (512, 512) found 1 images, will duplicate 34 images due to batch size 35
** Bucket (640, 384) found 2 images, will duplicate 33 images due to batch size 35
Number of image-caption pairs: 70
** Validation Set: val, steps: 2, repeats: 1
Generating latents cache...
Caching latents: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.17s/it]
Latents are ready.
0%| | 0/2 [00:00<?, ?it/s Starting Training!%| | 0/200 [00:00<?, ?it/s]
Fetching 15 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 28728.11it/s]
/home/campfire/.local/lib/python3.10/site-packages/transformers/models/clip/feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
warnings.warn(s: 0%| | 0/15 [00:00<?, ?it/s]
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
Steps To Epoch: 0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/campfire/StableTuner/scripts/scripts/trainer.py", line 2738, in <module>
main()
File "/home/campfire/StableTuner/scripts/scripts/trainer.py", line 2613, in main
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/home/campfire/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/campfire/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 489, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/campfire/.local/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/campfire/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 424, in forward
sample, res_samples = downsample_block(
File "/home/campfire/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/campfire/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 770, in forward
hidden_states = torch.utils.checkpoint.checkpoint(
File "/home/campfire/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/campfire/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/campfire/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 763, in custom_forward
return module(*inputs, return_dict=return_dict)
File "/home/campfire/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/campfire/.local/lib/python3.10/site-packages/diffusers/models/attention.py", line 216, in forward
hidden_states = block(hidden_states, encoder_hidden_states=encoder_hidden_states, timestep=timestep)
File "/home/campfire/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/campfire/.local/lib/python3.10/site-packages/diffusers/models/attention.py", line 490, in forward
hidden_states = self.attn1(norm_hidden_states, attention_mask=attention_mask) + hidden_states
File "/home/campfire/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/campfire/.local/lib/python3.10/site-packages/diffusers/models/attention.py", line 638, in forward
hidden_states = self._attention(query, key, value, attention_mask)
File "/home/campfire/.local/lib/python3.10/site-packages/diffusers/models/attention.py", line 654, in _attention
attention_scores = torch.baddbmm(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.75 GiB (GPU 0; 23.65 GiB total capacity; 13.25 GiB already allocated; 7.24 GiB free; 13.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps To Epoch: 0%| | 0/2 [00:00<?, ?it/s]
Overall Steps: 0%| | 0/200 [00:02<?, ?it/s]
Overall Epochs: 0%| | 0/100 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/home/campfire/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/campfire/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/campfire/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
simple_launcher(args)
File "/home/campfire/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 552, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'scripts/trainer.py', '--attention=xformers', '--model_variant=base', '--disable_cudnn_benchmark', '--use_text_files_as_captions', '--sample_step_interval=500', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--pretrained_vae_name_or_path=', '--output_dir=output/new_model', '--seed=3434554', '--resolution=512', '--train_batch_size=35', '--num_train_epochs=100', '--mixed_precision=fp16', '--use_bucketing', '--aspect_mode=dynamic', '--aspect_mode_action_preference=add', '--use_8bit_adam', '--gradient_checkpointing', '--gradient_accumulation_steps=1', '--learning_rate=3e-6', '--lr_warmup_steps=0', '--lr_scheduler=constant', '--regenerate_latent_cache', '--train_text_encoder', '--token_limit=75', '--concepts_list=stabletune_concept_list.json', '--num_class_images=200', '--save_every_n_epoch=100', '--n_save_sample=1', '--sample_height=512', '--sample_width=512', '--dataset_repeats=1', '--sample_on_training_start']' returned non-zero exit status 1.
I was told this is due to the CUDA path not being defined and that I needed to set
LD_LIBRARY_PATH=/opt/cuda/lib64:$LD_LIBRARY_PATHin the .sh or before I run the program in terminal, however inside /opt/ there isn't a CUDA folder.
I already have CUDA installed with pytorch(as it was a requirement) inside the "ST" conda env.
torch 1.13.1+cu116
torchaudio 0.13.1+cu116
torchmetrics 0.11.1
torchvision 0.14.1+cu116
When I enter pip show torch I get a location of Location: /home/campfire/.local/lib/python3.10/site-packages I am assuming since the pytorch version came with cu116 that that is where I need to point the path to?
How would I solve this issue? Do I need to point the CUDA PATH to the "ST" conda env instead?

you need to install cuda from sudo pacman -S cuda. Then you will have /opt/cuda. This is assuming you are on arch linux considering the arch linux tag on the post. The cuda package provides cuda-toolkit, cuda-sdk, and other libraries that you require.

No NVIDIA GPU found error, even though I defined Torch to use CPU

UPDATE: I have edited and changed more code, now I dont get an error and it either works but taked hours, or it is stuck on step one
I have tried running Stable Diffusion, the new text2image model. The Problem is: I don´t have a NVIDIA GPU... After a bit of research, I found out you can "force" PyTorch to run on your CPU, not GPU. But up to this point, everything I tried while modifying the existing code, did not work. I always get to the point where it starts sampling, and prints the following error (everyting after the command):
Falling back to LAION 400M model...
Global seed set to 42
Loading model from models/ldm/text2img-large/model.ckpt
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 872.30 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
data: 0%| | 0/1 [00:00<?, ?it/s]
Sampling: 0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "scripts/txt2img.py", line 279, in <module>
main()
File "scripts/txt2img.py", line 233, in main
uc = model.get_learned_conditioning(batch_size * [""])
File "c:\users\louis\stable-diffusion\ldm\models\diffusion\ddpm.py", line 558, in get_learned_conditioning
c = self.cond_stage_model.encode(c)
File "c:\users\louis\stable-diffusion\ldm\modules\encoders\modules.py", line 111, in encode
return self(text)
File "C:\Users\louis\anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "c:\users\louis\stable-diffusion\ldm\modules\encoders\modules.py", line 103, in forward
tokens = self.tknz_fn(text)#.to(self.device)
File "C:\Users\louis\anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "c:\users\louis\stable-diffusion\ldm\modules\encoders\modules.py", line 74, in forward
tokens = batch_encoding["input_ids"].to(self.device)
File "C:\Users\louis\anaconda3\envs\ldm\lib\site-packages\torch\cuda\__init__.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
I already defined to use CPU in text2img.py, so the error of something like "user defined to use cuda, but no cuda device available" is fixed.
So my question(s):
-Is what I´m trying even possible?
-If yes, how should I edit the code to work?
(-Would it even be possible to modify it to work on AMD GPUs using ROCm?)
The Repo: https://github.com/CompVis/stable-diffusion
Using the LAION400m weights, because I currently don´t have access to the SD ones.
I got them using:
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt
Guide I followed:https://github.com/lstein/stable-diffusion

Error can't get attribute Net when saving PyTorch model with MLFlow

After installing MLFlow using one-click-mlflow I save a pytorch model using the default command that I found in the user guide. You can find the command bellow:
mlflow.pytorch.log_model(net, artifact_path="model", pickle_module=pickle)
The neural network saved is very simple, this is basically a two layer neural network with Xavier initialization and hyperbolic tangent as activation function.
class Net(T.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(n_features, 10)
self.hid2 = T.nn.Linear(10, 10)
self.oupt = T.nn.Linear(10, 1)
T.nn.init.xavier_uniform_(self.hid1.weight)
T.nn.init.zeros_(self.hid1.bias)
T.nn.init.xavier_uniform_(self.hid2.weight)
T.nn.init.zeros_(self.hid2.bias)
T.nn.init.xavier_uniform_(self.oupt.weight)
T.nn.init.zeros_(self.oupt.bias)
def forward(self, x):
z = T.tanh(self.hid1(x))
z = T.tanh(self.hid2(z))
z = self.oupt(z)
return z
Every things is runing fine in the Jupyter Notebook. I can log metrics and other artifact but when I save the model I got the following error message:
2021/10/13 09:21:00 WARNING mlflow.utils.requirements_utils: Found torch version (1.9.0+cu111) contains a local version label (+cu111). MLflow logged a pip requirement for this package as 'torch==1.9.0' without the local version label to make it installable from PyPI. To specify pip requirements containing local version labels, please use `conda_env` or `pip_requirements`.
2021/10/13 09:21:00 WARNING mlflow.utils.requirements_utils: Found torchvision version (0.10.0+cu111) contains a local version label (+cu111). MLflow logged a pip requirement for this package as 'torchvision==0.10.0' without the local version label to make it installable from PyPI. To specify pip requirements containing local version labels, please use `conda_env` or `pip_requirements`.
2021/10/13 09:21:01 ERROR mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/tmpnl9dsoye/model/data, flavor: pytorch)
Traceback (most recent call last):
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/environment.py", line 212, in infer_pip_requirements
return _infer_requirements(model_uri, flavor)
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/requirements_utils.py", line 263, in _infer_requirements
modules = _capture_imported_modules(model_uri, flavor)
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/requirements_utils.py", line 221, in _capture_imported_modules
_run_command(
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/requirements_utils.py", line 173, in _run_command
raise MlflowException(msg)
mlflow.exceptions.MlflowException: Encountered an unexpected error while running ['/home/ucsky/.virtualenv/mymodel/bin/python', '/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/_capture_modules.py', '--model-path', '/tmp/tmpnl9dsoye/model/data', '--flavor', 'pytorch', '--output-file', '/tmp/tmplyj0w2fr/imported_modules.txt', '--sys-path', '["/home/ucsky/project/ofi-ds-research/incubator/ofi-pe-fr/notebook/guillaume-simon/06-modelisation-pytorch", "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/git/ext/gitdb", "/usr/lib/python39.zip", "/usr/lib/python3.9", "/usr/lib/python3.9/lib-dynload", "", "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages", "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/IPython/extensions", "/home/ucsky/.ipython", "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/gitdb/ext/smmap"]']
exit status: 1
stdout:
stderr: Traceback (most recent call last):
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/_capture_modules.py", line 125, in <module>
main()
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/_capture_modules.py", line 118, in main
importlib.import_module(f"mlflow.{flavor}")._load_pyfunc(model_path)
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/pytorch/__init__.py", line 723, in _load_pyfunc
return _PyTorchWrapper(_load_model(path, **kwargs))
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/pytorch/__init__.py", line 626, in _load_model
return torch.load(model_path, **kwargs)
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/torch/serialization.py", line 607, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/torch/serialization.py", line 882, in _load
result = unpickler.load()
File "/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/torch/serialization.py", line 875, in find_class
return super().find_class(mod_name, name)
AttributeError: Can't get attribute 'Net' on <module '__main__' from '/home/ucsky/.virtualenv/mymodel/lib/python3.9/site-packages/mlflow/utils/_capture_modules.py'>
Can somebody explain me what is wrong?

Why would a TensorFlow restored checkpoint run out of memory but the original script wouldn't?

I am running some TensorFlow code that restores and re-starts training from a checkpoint. Whenever I restore from a cpu build it seems to work perfectly fine. But if I try to restore when I run my code with gpu it seems to not work. In particular I get the error:
Traceback (most recent call last):
File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module>
large_main_hp.main_large_hp_ckpt(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt
run_hyperparam_search(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search
main_hp.main_hp(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp
with tf.Session(graph=graph) as sess:
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
next(self.gen)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
I see that it says I am running out of memory, but when I increase the memory to say 10GBs it doesn't really change anything. This only happens with my gpu build because the cpu one restores perfectly fine.
Anyway have any idea or starting ideas of what might be causing this?
The gpu's are essentially assigned automatically so I'm not quite sure what might be causing it or what are the starting steps to even debug this.
full error:
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
Traceback (most recent call last):
File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module>
large_main_hp.main_large_hp_ckpt(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt
run_hyperparam_search(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search
main_hp.main_hp(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp
with tf.Session(graph=graph) as sess:
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
next(self.gen)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Tensorflow CPU use benefits from both physical and virtual memory giving you almost unlimited memory to manipulate your models. Your first step in debugging is to build a smaller model by simply removing some weights/layers and running on the GPU to ensure you have no coding errors. Then slowly increase the layers/weights until the you run out of memory again. This will confirm that you have a memory issue on the GPU. I would recommend initially building your graph on the GPU that way you know it will fit later when you train on it. If you need the large graph then consider allocating parts of the graph to different GPU if you have them.

Import error of theano occasionally after interruption of the program

I am implementing some deep learning algorithms using theano. After I stop some programs running theano, occasionally the following error appears if I want to import theano again.
>>> import theano
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY. numdev=1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jjhu/.local/lib/python2.7/site-packages/theano/__init__.py", line 118, in <module>
theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
File "/home/jjhu/.local/lib/python2.7/site-packages/theano/sandbox/cuda/tests/test_driver.py", line 40, in test_nvidia_driver1
if not numpy.allclose(f(), a.sum()):
File "/home/jjhu/.local/lib/python2.7/site-packages/theano/compile/function_module.py", line 875, in __call__
storage_map=getattr(self.fn, 'storage_map', None))
File "/home/jjhu/.local/lib/python2.7/site-packages/theano/gof/link.py", line 317, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/home/jjhu/.local/lib/python2.7/site-packages/theano/compile/function_module.py", line 862, in __call__
self.fn() if output_subset is None else\
RuntimeError: Cuda error: kernel_reduce_ccontig_node_4894639462a290346189bb38dab7bb7e_0: out of memory. (grid: 1 x 1; block: 256 x 1 x 1)
Apply node that caused the error: GpuCAReduce{add}{1}(<CudaNdarrayType(float32, vector)>)
Toposort index: 0
Inputs types: [CudaNdarrayType(float32, vector)]
Inputs shapes: [(10000,)]
Inputs strides: [(1,)]
Inputs values: ['not shown']
Outputs clients: [[HostFromGpu(GpuCAReduce{add}{1}.0)]]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
I search for several solutions. Someone suggests to remove the compilation folder by rm -rf ./theano . I also check that the owner of ./theano is not root user. I also try setting my ./theanorc as following. But both do not work for me.
[global]
floatX = float32
device = cpu
optimizer=fast_run
[lib]
cnmem = 0.1
[cuda]
root = /usr/local/cuda
The only working solution is to reboot or log out the machine. It is very awkward. I don't know what causes this problem. Can anyone suggest some solutions?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Arguments are located on different GPUs when using nn.DataParallel(model) - python

Related

Arch Linux can't find CUDA?

No NVIDIA GPU found error, even though I defined Torch to use CPU

Error can't get attribute Net when saving PyTorch model with MLFlow

Why would a TensorFlow restored checkpoint run out of memory but the original script wouldn't?

Import error of theano occasionally after interruption of the program

Categories

Resources