detectron2 - CUDA is not available - python

I am trying out detectron2 and want to train the sample model.
When running the following code I get (<class 'RuntimeError'>, RuntimeError('No CUDA GPUs are available'), <traceback object at 0x7f42b094ebc0>). Find below the code:
import detectron2
from detectron2.utils.logger import setup_logger
setup_logger()
# import some common libraries
import matplotlib.pyplot as plt
import numpy as np
import cv2
# import some common detectron2 utilities
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog, DatasetCatalog
from detectron2.data.datasets import register_coco_instances
import random
from detectron2.engine import DefaultTrainer
from detectron2.config import get_cfg
import os
# To verify the data loading is correct, let's visualize the annotations of randomly selected samples in the training set:
register_coco_instances("fruits_nuts", {}, "../data/trainval.json", "../data/images")
fruits_nuts_metadata = MetadataCatalog.get("fruits_nuts")
dataset_dicts = DatasetCatalog.get("fruits_nuts")
'''
for d in random.sample(dataset_dicts, 3):
img = cv2.imread(d["file_name"])
visualizer = Visualizer(img[:, :, ::-1], metadata=fruits_nuts_metadata, scale=0.5)
vis = visualizer.draw_dataset_dict(d)
cv2.imshow('new', vis.get_image()[:, :, ::-1])
cv2.waitKey(0)
'''
# train model
cfg = get_cfg()
cfg.merge_from_file("../detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.DATASETS.TRAIN = ("fruits_nuts",)
cfg.DATASETS.TEST = () # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl" # initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.02
cfg.SOLVER.MAX_ITER = 300 # 300 iterations seems good enough, but you can certainly train longer
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 # faster, and good enough for this toy dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 3 # 3 classes (data, fig, hazelnut)
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
I ran the script collect_env.py from torch:
/home/project/.venv/bin/python /home/project/src/collect_env.py
Collecting environment information...
PyTorch version: 1.10.2+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.13.0-27-generic-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.1
[pip3] torch==1.10.2
[pip3] torchvision==0.11.3
[conda] Could not collect
Process finished with exit code 0
I am having on the system a RTX3080 graphic card. However, it seems to me that its not found.
Any suggestions why?
Is there a way to run the training without CUDA?
I appreciate your replies!

I'm not sure if this works for you. But let's see from a Windows user perspective.
I'm using Detectron2 on Windows 10 with RTX3060 Laptop GPU CUDA enabled.
The first thing you should check is the CUDA. You can check by using the command:
nvcc -V
It should be shown this message:
C:\Users\User>nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:41:42_Pacific_Daylight_Time_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
And to check if your Pytorch is installed with CUDA enabled, use this command (reference from their website):
import torch
torch.cuda.is_available()
As on your system info shared in this question, you haven't installed CUDA on your system. And your system doesn't detect any GPU (driver) available on your system.
As far as I know, they recommended installing Pytorch CUDA to run Detectron2 by (Nvidia) GPU.
(you can check on Pytorch website and Detectron2 GitHub repo for more details).
Or, you can use this option:
Add this line of code to your python program (as reference of this issues#300):
cfg.MODEL.DEVICE = "cpu"
I hope it helps. Cheers.

Related

AttributeError: module transformers has no attribute TFGPTNeoForCausalLM

I cloned this repository/documentation https://huggingface.co/EleutherAI/gpt-neo-125M
I get the below error whether I run it on google collab or locally. I also installed transformers using this
pip install git+https://github.com/huggingface/transformers
and made sure the configuration file is named as config.json
5 tokenizer = AutoTokenizer.from_pretrained("gpt-neo-125M/",from_tf=True)
----> 6 model = AutoModelForCausalLM.from_pretrained("gpt-neo-125M",from_tf=True)
7
8
3 frames
/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in __getattr__(self, name)
AttributeError: module transformers has no attribute TFGPTNeoForCausalLM
Full code:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M",from_tf=True)
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M",from_tf=True)
transformers-cli env results:
transformers version: 4.10.0.dev0
Platform: Linux-4.4.0-19041-Microsoft-x86_64-with-glibc2.29
Python version: 3.8.5
PyTorch version (GPU?): 1.9.0+cpu (False)
Tensorflow version (GPU?): 2.5.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:
Both collab and locally have TensorFlow 2.5.0 version
Try without using from_tf=True flag like below:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M")
from_tf expects the pretrained_model_name_or_path (i.e. the first parameter) to be a path to load saved Tensorflow checkpoints from.
My solution was to first edit the source code to remove the line that adds "TF" in front of the package as the correct transformers module is GPTNeoForCausalLM
, but somewhere in the source code it manually added a "TF" in front of it.
Secondly, before cloning the repository it is a must to run
git lfs install.
This link helped me install git lfs properly https://askubuntu.com/questions/799341/how-to-install-git-lfs-on-ubuntu-16-04

GPU Issue Tensorflow 2.4.1

I would like to train a Tensorflow model by using my GPU
I'm using :
tensorboard 2.4.1
tensorboard-plugin-wit 1.8.0
tensorflow-estimator 2.4.0
tensorflow-gpu 2.4.1
cuda 11.0
cdnn 8.0.4
gpu RTX 3060 Laptop 6Gb
Nvidia FrameView SDK 1.1.4923.29548709
Nvidia Graphics Drivers 461.72
Nvidia PhysX 9.19.0218
Python 3.8.5
IDE Spyder 4.2.1
OS Windows 10 LTSC-2019 (modified)
What did I do before posting this help ?
1/ I've installed Nvidia Graphics Drivers
2/ I've followed this Tensorflow tutorial : https://www.tensorflow.org/install/gpu
So I've copied cuda folder from cdnn download archive in C:\tools\
I've also added all variables required to Path
3/ Tried to train my model (all works if I'm using CPU instead) :
with tf.device("/GPU:0"):
history = model.fit(images, imagesID, epochs=50, validation_split=0.2)
Error :
2021-03-14 15:07:16.145096: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-03-14 15:07:16.145335: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] Error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
2021-03-14 15:07:16.146411: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-03-14 15:07:16.146595: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] Error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
2021-03-14 15:07:16.146845: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_fused_impl.h:697 : Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
So I've found on the Internet this : https://github.com/tensorflow/tensorflow/issues/45779
Thus, I've implemented this code at the top to limit GPU memory :
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)
Error :
Physical devices cannot be modified after being initialized
So I've found this : https://github.com/tensorflow/tensorflow/issues/25138
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
But I still have the same error :
2021-03-14 15:07:16.145096: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
...
I'm completely lost because I have a lack of knowledges about Tensorflow-GPU errors...
Detail of all logs is here : https://pastebin.com/Xtsv3mLe
I'm not very good at writing posts, I hope I was clear enough.
Thank you in advance !!
You need Cuda 11.0 not 11.1. You can get more information on what you need here https://www.tensorflow.org/install/gpu. This may be a lot more helpful than reading the install guide, although you should, https://alejandro-gc.medium.com/setting-up-your-gpu-for-tensorflow-2-4-2021-d98cac79a686.

How to force tensorflow and Keras run on GPU?

I have TensorFlow, NVIDIA GPU (CUDA)/CPU, Keras, & Python 3.7 in Linux Ubuntu.
I followed all the steps according to this tutorial:
https://www.youtube.com/watch?v=dj-Jntz-74g
when I run the following code of:
# What version of Python do you have?
import sys
import tensorflow.keras
import pandas as pd
import sklearn as sk
import tensorflow as tf
print(f"Tensor Flow Version: {tf.__version__}")
print(f"Keras Version: {tensorflow.keras.__version__}")
print()
print(f"Python {sys.version}")
print(f"Pandas {pd.__version__}")
print(f"Scikit-Learn {sk.__version__}")
gpu = len(tf.config.list_physical_devices('GPU'))>0
print("GPU is", "available" if gpu else "NOT AVAILABLE")
I get the these results:
Tensor Flow Version: 2.4.1
Keras Version: 2.4.0
Python 3.7.10 (default, Feb 26 2021, 18:47:35)
[GCC 7.3.0]
Pandas 1.2.3
Scikit-Learn 0.24.1
GPU is available
However; I don't know how to run my Keras model on GPU. When I run my model, and I get $ nvidia-smi -l 1, GPU usage is almost %0 during the run.
from keras import layers
from keras.models import Sequential
from keras.layers import Dense, Conv1D, Flatten
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
from keras.callbacks import EarlyStopping
model = Sequential()
model.add(Conv1D(100, 3, activation="relu", input_shape=(32, 1)))
model.add(Flatten())
model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="linear"))
model.compile(loss="mse", optimizer="adam", metrics=['mean_squared_error'])
model.summary()
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=70)
history = model.fit(partial_xtrain_CNN, partial_ytrain_CNN, batch_size=100, epochs=1000,\
verbose=0, validation_data=(xval_CNN, yval_CNN), callbacks = [es])
Do I need to change any parts of my code or add a part to force it run on GPU??
To tensorflow work on GPU, there are a few steps to be done and they are rather difficult.
First of compatibility of these frameworks with NVIDIA is much better than others so you could have less problem if the GPU is an NVIDIA and should be in this list.
The second thing is that you need to install all of the requirements which are:
1- The last version of your GPU driver
2- CUDA instalation shown here
3- then install Anaconda add anaconda to environment while installing.
After completion of all the installations run the following commands in the command prompt.
conda install numba & conda install cudatoolkit
Now to assess the results use this code:
from numba import jit, cuda
import numpy as np
# to measure exec time
from timeit import default_timer as timer
# normal function to run on cpu
def func(a):
for i in range(10000000):
a[i]+= 1
# function optimized to run on gpu
#jit(target ="cuda")
def func2(a):
for i in range(10000000):
a[i]+= 1
if __name__=="__main__":
n = 10000000
a = np.ones(n, dtype = np.float64)
b = np.ones(n, dtype = np.float32)
start = timer()
func(a)
print("without GPU:", timer()-start)
start = timer()
func2(a)
print("with GPU:", timer()-start)
Parts of this answer is from here which you can read for more.
I found a solution for my question.
I think the problem was about the incompatibility of the NVIDIA driver, Cudnn, and TensorFlow. because I had the new NVIDIA graphic card (RTX 3060) on my laptop, and it has NVIDIA Ampere Architecture GPU, and probably it was not compatible with others.
Instead I referred to these links to download the 21.02 docker container, then I mount this docker. In this container that is provided by NVIDIA everything is tested and should give good performance.
https://docs.nvidia.com/deeplearning/frameworks/tensorflow-wheel-release-notes/tf-wheel-rel.html
https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/rel_21-02.html#rel_21-02
Also, to install a docker in Linux you can follow the procedure explained here:
https://towardsdatascience.com/deep-learning-with-docker-container-from-ngc-nvidia-gpu-cloud-58d6d302e4b2

How to fix tensorflow error 'GPU sync failed' on NVIDIA Tegra TX2

I am trying to run a prediction of a model build in keras on my NVIDIA Tegra TX2 using Tensorflow and Python (2.7) and I am quite randomly running in tensorflow giving me the following exception:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4504 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-10-04 16:17:50.786531: E tensorflow/stream_executor/cuda/cuda_driver.cc:1032] could not synchronize on CUDA context: CUDA_ERROR_UNKNOWN: unknown error :: *** Begin stack trace ***
stream_executor::gpu::GpuDriver::SynchronizeContext(stream_executor::gpu::GpuContext*)
stream_executor::StreamExecutor::SynchronizeAllActivity()
tensorflow::GPUUtil::SyncAll(tensorflow::Device*)
*** End stack trace ***
...
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed
Sometimes after a few reboots / waiting time the problem is solved and I can run the prediction again but 8 out of 10 times this error appears.
I've already tried the following:
Change the query amount and memory usage as follows:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.per_process_gpu_memory_fraction = 0.7
session = tf.Session(config=config, ...)
Re-Install Tensorflow build for TX2 and Jetpack v3.3
I would be really happy for any further suggestions.
Since this problem is intermittent, this may be happening when tensorflow is not getting the required memory. It is a known issue and you have already tried the basic trouble shooting steps. Since you are still getting the issue, try below steps as well:
reinstall libhdf5-dev, python-h5py
sudo apt-get install libhdf5-dev
sudo apt-get install python-h5py
and then set gpu allow growth as per "https://github.com/keras-team/keras/issues/4161#issuecomment-366031228"
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
set_session(sess)
Adding the following line solved the issue. Note my TensorFlow version is 2.1>
import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.InteractiveSession(config=config)

deepwater with tensorflow on RHEL6

I am trying to get Python + deepwater + tensorflow to run on RHEL 6.7. Using conda, I have installed python 3.6.0, tensorflow 1.1.0 and also gcc 4.8.5. TF is working fine.
I have installed the following libraries using pip install: h2o-3.11.0.3904-py2.py3-none-any.whl and h2o-3.11.0-py2.py3-none-any.whl.
I tried to run the following example from the h2o tutorial
import h2o
from h2o.estimators.deepwater import H2ODeepWaterEstimator
h2o.init()
train = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/mnist/train.csv.gz")
features = list(range(0,784))
target = 784
train[target] = train[target].asfactor()
model = H2ODeepWaterEstimator(epochs=100, activation="Rectifier", hidden=[200,200], ignore_const_cols=False,
mini_batch_size=256, input_dropout_ratio=0.1, hidden_dropout_ratios=[0.5,0.5], stopping_rounds=3,
stopping_tolerance=0.05, stopping_metric="misclassification", score_interval=2, score_duty_cycle=0.5,
score_training_samples=1000, score_validation_samples=1000, nfolds=5, gpu=False, seed=1234, backend="tensorflow")
model.train(x=features, y=target, training_frame=train)
The following exception is thrown
Exception: Unable to initialize the native Deep Learning backend: Cannot find TensorFlow native library for OS: linux, architecture: x86_64. See https://github.com/tensorflow/tensorflow/tree/master/java/README.md for possible solutions (such as building the library from source).
Is there anything else that I am missing? Would I need to build the bits from scratch for this platform?

Categories