CUDA_HOME environment variable is not set

CUDA_HOME environment variable is not set - python

I have a working environment for using pytorch deep learning with gpu, and i ran into a problem when i tried using mmcv.ops.point_sample, which returned :
ModuleNotFoundError: No module named 'mmcv._ext'
I have read that you should actually use mmcv-full to solve it, but i got another error when i tried to install it:
pip install mmcv-full
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
Which seems logic enough since i never installed cuda on my ubuntu machine(i am not the administrator), but it still ran deep learning training fine on models i built myself, and i'm guessing the package came in with minimal code required for running cuda tensors operations.
So my main question is where is cuda installed when used through pytorch package, and can i use the same path as the environment variable for cuda_home?
Additionaly if anyone knows some nice sources for gaining insights on the internals of cuda with pytorch/tensorflow I'd like to take a look (I have been reading cudatoolkit documentation which is cool but this seems more targeted at c++ cuda developpers than the internal working between python and the library)

you can chek it and check the paths with these commands :
which nvidia-smi
which nvcc
cat /usr/local/cuda/version.txt

Related

Pip not recognizing PyTorch with ROCm installation

On the PyTorch website it lists two blocks of commands for the ROCm version installation. The first one, that installs torch itself, goes well, but when I try to import it shows this message.
ImportError: libtinfo.so.5: cannot open shared object file: No such file or directory
Also, when trying to install the torchvision package with the second block of commands, it shows a similar error.
ModuleNotFoundError: No module named 'torch'
This only happens for with the ROCm compute platform. Installing with CUDA works just fine, but unfortunately I don't have a NVidia GPU.

I believe it was a bug that haven't been fixed. You can make a local symbolic link named libtinfo.6.so to /usr/lib/libtinfo5.so, in the same folder as libtaichi_core.so
This should solve it,

Installing TensorFlow Profiler for Dummies

I am trying to follow this guide to install the TensorFlow Profiler to better understand why my recently installed Keras does run on the GPU, but hardly uses any ressources (and is really slow). However, I am unable to come to any results, since the guide does not provide me with sufficient information, since I am not a programmer by trade and obiously lack necessary knowledge.
What have I tried so far?
I use Anaconda and have a running version of python 3.7 installed. I also installed tensorflow and the necessary drivers and such so that tensorflow is able to access my GPUs. Following the linked guide, I downloaded the "install_and_run.py" and tried executing it using the conda prompt. I get asked to specify --envdir and --logdir. Where do I point these? Is the environment directory just the directory to my current conda environment? Since I tried pointing both envdir and logdir into that direction and ended up with the error that the command
True" is unknown and "True' returned non-zero exit status 1.
I could not come up with any solution for this. It should probably be mentioned that I have very little experience in using the conda prompt to run .py-files and usually only use it to install packages.
I am also unsure what is meant by the subsequent steps that talk about the CUPTI path. The given path is no complete path as far as i know. Where am I supposed to look for it? Or am I meant to exectute some of this
/sbin/ldconfig -N -v $(sed 's/:/ /g' <<< $LD_LIBRARY_PATH) | \
grep libcupti
as a command? I have tried running /sbin/ldconfig -N -v $ but my system could not find the path (potentially because I started looking from the wrong directory?).
Any help is much appreciated. Sorry for the potentially confusing post from a confused person.
Thank you!

Tensorflow profiler is no longer bundled with Tensorboard. There is a tutorial on how to install and run it, when fitting Keras model.
The summary is:
Inside your env run pip install tensorboard_plugin_profile
Declare a tensorboard callback as you normally would
tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs,
histogram_freq = 1,
profile_batch = '500,520')
Fit your model (with declared tensorboard callback)
On a separe terminal (with your env activated) run tensorboard --logdir=path/to/logs
The Profiler tab shown in the tutorial may not be visisble, but there should be a profile option available in the drop down menu on the top right corner.

The kernel appears to have died. It will restart automatically python 3

Whenever I try to import tensorflow in my windows machine, its saying that The kernel appears to have died. It will restart automatically and then its not even working.
The below is the following message given by the jupyter terminal.
Warning! HDF5 library version mismatched error
The HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.
Data corruption or segmentation faults may occur if the application continues.
This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.
You should recompile the application or check your shared library related
settings such as 'LD_LIBRARY_PATH'.
You can, at your own risk, disable this warning by setting the environment
variable 'HDF5_DISABLE_VERSION_CHECK' to a value of '1'.
Setting it to 2 or higher will suppress the warning messages totally.
Headers are 1.10.1, library is 1.10.2
what could solve this problem.
My version of Python is 3.6.3
and I updated the conda package also.
I have a Windows 10 Machine with 16GB RAM, so it cant be a memory issue also.
I was working with tensorflow previously, but now its not working.
This started to happen like 2 months back! when i was working on my university assignment this happened. The same code was working properly and I ran the code once again on the very same day jupyter notebook it crashed, since then I'm facing this problem.
I also tried to import tensorflow in the command prompt, its still showing the same error.
Has anyone encountered the same problem? What could be the fix?

Message says, Headers are 1.10.1, library is 1.10.2
You need to install 1.10.1 version
conda install -c anaconda hdf5=1.10.1

Tensorflow GPU setup: error with CUDA on PyCharm

I have TF 0.8 installed on Python3, MacOSX El Capitan.
When running a simple test code for TF I get this message:
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so, 10):
Library not loaded: #rpath/libcudart.7.5.dylib
Referenced from: /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
Reason: image not found
My .bash_profile is as follows:
export PATH=/usr/local/bin:$PATH
export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-7.5/lib:/usr/local/cuda/lib
in /Developer/NVIDIA/CUDA-7.5/lib I have a file called libcudart.7.5.dylib
in /usr/local/cuda/lib I have an alias called libcudart.7.5.dylib
I have tried several permutations of .bash_profile without success. ANy idea what may be wrong?
Note that I can successfully use my GPU with Theano so there's no reason to believe the GPU/cuDNN/CUDA install may be faulty.

If you're getting this error, make sure you installed CUDA, cuDNN correctly as described in the Tensorflow install instructions. Note the TF, CUDA, cuDNN version you're installing and what Python version you're using.
Filenames, paths etc frequently vary, so small tweaks in filenames and paths may be needed in your case if errors are cropping up. Sometimes it's hard for others to help you because your system may have a very specific path setup/version that cannot be understood by someone in a forum.
If you're getting exactly the error I'm describing in the OP, take a step back and check:
is this happening in PyCharm?
is this happening in iPython within PyCharm?
is it happening in both?
is this happening in iPython in Terminal?
In my case it was happening only in PyCharm. In iPython outside of PyCharm (that is using the Mac 'Terminal' software) everything worked fine. But when doing iPython in PyCharm, or when running the test file through PyCharm, I would get the error. This means it has something to do with PyCharm, not the Tensorflow install.
Make sure your DYLD_LIBRARY_PATH is correctly pointing to the libcudart.7.5.dylib file. Navigate there with Finder, do a Spotlight search search and find the file or its alias. Then place that path in your .bash_profile. In my case, this is working:
export DYLD_LIBRARY_PATH=/usr/local/cuda/lib
If your problem is PyCharm, a specific configuration is needed. Go to the upper right corner of the GUI and click the gray down arrow.
Choose "Edit Configuration". You will see an Environment option where you need to click on the ... box and enter the DYLD_LIBRARY_PATH that applies to your case.
Note that there's an Environment option for the specific file you're working on (it will be highlighted in the left panel) and for Defaults (put DYLD_... there as well if you want future files you create to have this). Note that you need to save this config or else when you close PyCharm it won't stick.

As an extension to pepe answer, which is the correct one, I don't mind if the following is integrated to the original answer.
I would like to add that if you wish to make this change permanent in pyCharm (affects only the current project) and not having it to do for every net file, is possible, from the interface shown above by pepe, by going under the "Default" to set the DYLD_LIBRARY_PATH.
Keep in mind, that this change by itself it doesn't alter the run configuration of the current script, which eventually still need to be manually changed ( or deleted and regenerated from the new defaults)

Could you try TF 0.9, which adds the MacOX GPU support?
https://github.com/tensorflow/tensorflow/blob/r0.9/RELEASE.md

It appears that you are not finding CUDA on your system. This could be for a number of reasons including installing CUDA for one version of python while running a different version of python that isn't aware of the other versions installed files.
Please take a look at my answer here.
https://stackoverflow.com/a/41073045/1831325

In my case, tensorflow version is 1.1, the dlopen error happens in both
ipython and pycharm
Environment:
Cuda version:8.0.62
cudnn version:6
error is a little different in pycharm and ipython. I cannot remeber too much detail, but ipython says there is no libcudnn.5.dylib, but pycharm just says there is
import error, image not found
Solution:
Download cudnn version 5, from
https://developer.nvidia.com/rdp/cudnn-download
Unzip the cudnn. Copy lib/ to /usr/local/cuda/lib. Copy include/ to /usr/local/cuda/include
unzip cuda.zip
cd cuda
sudo cp -r lib /usr/local/cuda/lib
sudo cp include/cudnn.h /usr/local/cuda/include
Add the lib directory path to your DYLD_LIBRARY_PATH. like this in my ~/.bash_profile:
export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-8.0/lib:/usr/local/cuda/lib${DYLD_LIBRARY_PATH:+:${DYLD_LIBRARY_PATH}}
In the tensorflow official installation guide, it says need cudnn 5.1, so ,this is all about careless。
https://www.tensorflow.org/install/install_mac
Requirements to run TensorFlow with GPU support.
If you are installing TensorFlow with GPU support using one of the mechanisms described in this guide, then the following NVIDIA software must be installed on your system:
...
The NVIDIA drivers associated with CUDA Toolkit 8.0.
cuDNN v5.1. For details, see NVIDIA's documentation.
...

I think the problem is in the SIP (System Integrity Protection). Restricted processes run with cleared environment variables and you got this error.
You need to go to Recovery Mode, start Terminal and input
$ csrutil disable
, and reboot

sklearn OMP: Error #15 ("Initializing libiomp5md.dll, but found mk2iomp5md.dll already initialized.") when fitting models

I have recently uninstalled a nicely working copy of Enthought Canopy 32-bit and installed Canopy version 1.1.0 (64 bit). When I try to use sklearn to fit a model my kernel crashes, and I get the following error:
The kernel (user Python environment) has terminated with error code 3. This may be due to a bug in your code or in the kernel itself.
Output captured from the kernel process is shown below.
OMP: Error #15: Initializing libiomp5md.dll, but found mk2iomp5md.dll already initialized.
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
The same code was running just fine under Canopy's 32 bit. The code is really just a simple fit of a
linear_model.SGDClassifier(loss='log')
(same error for Logistic Regression, haven't tried other models)
How do I fix this?

I had the same problem, coming from conflicting installations in numpy and from canopy. Resolved it by writing:
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
Not an elegant solution, but it did the job for me.

You will almost certainly be able to get past this error by setting the environment parameter
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
However, setting this parameter is not the recommended solution (as it says in the error message). Instead, you may try setting up your conda environment without using the Intel Math Kernel Library, by running:
conda install nomkl
You may need to install some packages again after doing this, if the versions you had were based on MKL (though conda should just do this for you). If not, then you will need to do the following:
install packages that would normally include MKL or depend on packages that include MKL, such as scipy, numpy, and pandas. Conda will install the non-MKL versions of these packages together with their dependencies
(see here)
For macOS, nomkl is a good option, since the optimization MKL provides is in fact already provided by Apple's Accelerate Framework, which uses OpenMP already. This is in fact the reason that the error ('...multiple copies of the OpenMP runtime...') is triggered, it seems (as stated in this answer).

I tried manually deleting the old libiomp5md.dll file. This file is in your anaconda3/lib directory. You should remove the old libiomp5.dll file. Then it should work.

Had the same issue with easyocr, which is also mkl dependent.
Reinstalling other mkl dependent modules seemed to work. For example, I did pip uninstall numpy, and then pip install numpy, and that made import easyocr work.

I have read the docs on Intel support research (http://www.intel.com/software/products/support/) and the reason in this case, incl. for me, was the numpy library.
I had installed it separately and also as part of the PyTorch install.
So it was giving an error.
Basically you should create a new environment, and install all your dependencies there.

Perhaps this solution will help for sklearn as well. Confronted with the same error #15 for tensorflow, none of the solutions to-date (5 Feb 2021) fully worked despite being helpful. However, I did manage to solve it while avoiding: dithering with dylib libraries, installing from source, or setting the environment variable KMP_DUPLICATE_LIB_OK=TRUE and its downsides of being an “unsafe, unsupported, undocumented workaround” and its potential “crashes or silently produce incorrect results”.
The trouble was that conda wasn’t picking up the non-mkl builds of tensorflow (v2.0.0) despite loading the nomkl package. What finally made this solution work was to:
ensure I was loading packages from the defaults channel (ie. from a channel with a non-mkl version of tensorflow. As of 5 Feb 2021, conda-forge does not have a tensorflow version of 2.0 or
greater).
specify the precise build of the tensorflow version I wanted: tensorflow>=2.*=eigen_py37h153756e_0. Without this, conda kept loading the mkl_... version of the package despite the nomkl package also being loaded.
I created a conda environment using the following environment.yml file (as per the conda documentation for managing environments) :
name: tf_nomkl
channels:
- conda-forge
- defaults
dependencies:
- nomkl
- python>=3.7
- numpy
- scipy
- pandas
- jupyter
- jupyterlab
- nb_conda
- nb_conda_kernels
- ipykernel
- pathlib
- matplotlib
- seaborn
- tensorflow>=2.*=eigen_py37h153756e_0
You could try to do the same without an environment.yml file, but it’s better to load all the packages you want in an environment in one go if you can.
This solution works on MacOS Big Sur v11.1.

the error:
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
solving this error can be done by running these two lines
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
instead of
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
For an example of this error see the image below

OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
I was facing the similar problem for python, I fixed it by deleting the libiomp5md.dll duplicate file from Anaconda environment folder C:\Users\aqadir\Anaconda3\envs\your_env_name\Library\bin\libiomp5md.dll
In my case the file libiomp5md.dll was already in the base Anaconda bin folder C:\Users\aqadir\Anaconda3\Library\bin

To solve this and allow the code to continue running, I added the following to the Windows environment variables:
Key: KMP_DUPLICATE_LIB_OK
Value: TRUE
Then, started a new command line and ran the code again, it worked without any issues.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.