https://github.com/NVIDIA/DeepRecommender
According to the above page, I tried to run the NVIDIA's DeepRecommender program.After I activated the pytorch, I run the program as below but it failed.
[I run this Command]
$ python run.py --gpu_ids 0 \
--path_to_train_data Netflix/NF_TRAIN \
--path_to_eval_data Netflix/NF_VALID \
--hidden_layers 512,512,1024 \
--non_linearity_type selu \
--batch_size 128 \
--logdir model_save \
--drop_prob 0.8 \
--optimizer momentum \
--lr 0.005 \
--weight_decay 0 \
--aug_step 1 \
--noise_prob 0 \
--num_epochs 12 \
--summary_frequency 1000
[The comments of the Guide.]
Note that you can run Tensorboard in parallel
$ tensorboard --logdir=model_save
[My Question]
The guide says as above.I don't know how to run in parallel.Please tell me the way. Shoud I open 2 terminal windows?
[Enviroment]
The detail of the enviroment is as follow.
---> Ubuntu 18.04 LTS, python 3.6, Pytorch 1.2.0, CUDA V10.1.168
[The 1st trial]
After I activated the pytorch,
$source activate pytorch
$python run.py --gpu_ids 0 \ (The long parameters are abbreviated here.)
[The Error messages of the 1st trial]
Traceback (most recent call last):
File "run.py", line 13, in
from logger import Logger
File "/home/user/NVIDIA_DeepRecommender/DeepRecommender-mp_branch/logge r.py", line 4, in
import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'
[The 2nd trial]
After I activated the tensorflow-gpu,
$ source activate tensorflow-gpu
$python run.py --gpu_ids 0 \ (The long parameters are abbreviated here.)
[The Error messages of the 2nd trial.]
Traceback (most recent call last):
File "run.py", line 2, in
import torch
ModuleNotFoundError: No module named 'torch'
[Expected result]
$ python run.py --gpu_ids 0 \
The program can run with no error and finish training the model.
Try either installing tensorflow-gpu in your pytorch environment or pytorch in your tensorflow-gpu environemnt and use that environment to run your program.
Related
I'm trying to install PyTorch into my Conda environment using Anaconda Prompt(Miniconda3) but running into a similar issue seen here
First, I created my Conda environment:
conda create --name token python=3.
Next, I activate my environment and download PyTorch without fail using the following:
activate token
conda install pytorch pytorch-cuda=11.6 -c pytorch -c nvidia
Next, I go into my python console and try to import torch but find the following error:
python
>>>import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\.conda\envs\token\lib\site-packages\torch\__init__.py", line 139, in <module>
raise err
OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\.conda\envs\token\lib\site-packages\torch\lib\shm.dll" or one of its dependencies.
When running the list command:
(base) C:\Users>conda list -n token pytorch
# packages in environment at C:\Users\.conda\envs\token:
#
# Name Version Build Channel
pytorch 1.13.0 py3.10_cuda11.6_cudnn8_0 pytorch
pytorch-cuda 11.6 h867d48c_0 pytorch
pytorch-mutex 1.0 cuda pytorch
Is there a way to check which dependencies are missing or know why PyTorch is not being recognised?
I am trying to run a python script in a virtual environment in a server using oarsub :
So firstly I run this command in a server name "a" :
oarsub -l /host=1/gpu=1,walltime=2:00:00 './training_corpus1.sh'
training_corpus1.sh looks like this at the beguinning :
#!/bin/bash
cd /home/ge/ke/anaconda3/envs
source activate env
cd ~/eXP/bert
CUDA_VISIBLE_DEVICES=0,1 python training.py \
--exp_name bert \ ...
At the beguinning, I am suppose to open my virtual environment and then run the script but I am always getting this error in the OAR.18651.stderr file :
./training_corpus1.sh: line 5: activate: No such file or directory
Traceback (most recent call last):
File "training.py", line 18, in <module>
from xlm.slurm import init_signal_handler, init_distributed_mode
File "/home/ge/ke/eXP/bert/xlm/slurm.py", line 11, in <module>
import torch
ImportError: No module named torch
Torch is located in my virtual environment , it seems that It did not open.Whenusing "conda " a thte place of "source" I get:
./training_corpus1.sh: line 5: conda: command not found
For a virtual environment and bash the activation command is
source env/bin/activate
where env is the directory of the virtual environment to activate.
PS. Let me advice you to start any script with set -e to allow fast failing on any error:
#!/usr/bin/env bash
set -e
…
I was installing the tf serving by compiling the source code.
The command line was:
$ git clone -b r1.10 https://github.com/tensorflow/serving.git
$ cd serving
$ bazel test tensorflow_serving/...
and my bazel vision is
$ bazel --version
bazel 2.1.0
The error information is as follows:
ERROR: An error occurred during the fetch of repository 'local_config_python':
Traceback (most recent call last):
File "/private/var/tmp/_bazel_limingliang/1c00e8fe288c428416b8275600af1770/external/org_tensorflow/third_party/py/python_configure.bzl", line 345
_create_local_python_repository(<1 more arguments>)
File "/private/var/tmp/_bazel_limingliang/1c00e8fe288c428416b8275600af1770/external/org_tensorflow/third_party/py/python_configure.bzl", line 292, in _create_local_python_repository
_check_python_bin(<2 more arguments>)
File "/private/var/tmp/_bazel_limingliang/1c00e8fe288c428416b8275600af1770/external/org_tensorflow/third_party/py/python_configure.bzl", line 234, in _check_python_bin
_fail(<1 more arguments>)
File "/private/var/tmp/_bazel_limingliang/1c00e8fe288c428416b8275600af1770/external/org_tensorflow/third_party/py/python_configure.bzl", line 27, in _fail
fail(<1 more arguments>)
Python Configuration Error: --define PYTHON_BIN_PATH='/usr/bin/python3' is not executable. Is it the python binary?
I find that in the "/usr/bin/" directory, there is not the python3 bin file.
$ cd /usr/bin/
$ ls | grep python
python
python-config
python2.7
python2.7-config
pythonw
pythonw2.7
So I change the value of PYTHON_BIN_PATH to "/usr/bin/python"
$ export PYTHON_BIN_PATH=/usr/bin/python
$ echo $PYTHON_BIN_PATH
/usr/bin/python
However, it still doesn't work.
The error message is defining the python path as '/usr/bin/python3', so adding /usr/bin/python' to the path may not matter. Have you tried linking /usr/bin/python3 to /usr/bin/python?
For example, ln -s /usr/bin/python3 /usr/bin/python
I am trying to optimize YoloV3 using tensorRT
I came this post called Have you Optimized your Deep Learning Model Before Deployment?
Docker is used in the post.
Used Enabling GPUs in the Container Runtime Ecosystem to install nvidia-docker2
Pulled the latest version of the docker image docker pull aminehy/tensorrt-opencv-python3:version-1.3 from aminehy/tensorrt-opencv-python3
Here are the images
$ sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/cuda 10.1-cudnn7-devel-ubuntu18.04 b4879c167fc1 2 weeks ago 3.67GB
aminehy/tensorrt-opencv-python3 version-1.3 0302e477816d 4 months ago 5.36GB
aminehy/tensorrt-opencv-python3 latest 604502819d12 4 months ago 4.94GB
aminehy/tensorrt-opencv-python3 version-1.1 d693210c500c 4 months ago 4.94GB
I ran
$sudo docker run -it --rm -v $(pwd):/workspace --runtime=nvidia -w /workspace -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=unix$DISPLAY aminehy/tensorrt-opencv-python3:version-1.3```
=====================
== NVIDIA TensorRT ==
=====================
NVIDIA Release 19.05 (build 6392482)
NVIDIA TensorRT 5.1.5 (c) 2016-2019, NVIDIA CORPORATION. All rights reserved.
Container image (c) 2019, NVIDIA CORPORATION. All rights reserved.
https://developer.nvidia.com/tensorrt
To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh
root#a38b20eeb740:/workspace# cd /opt/tensorrt/python/
root#a38b20eeb740:/opt/tensorrt/python# chmod +x python_setup.sh
root#a38b20eeb740:/opt/tensorrt/python# ./python_setup.sh
Requirement already satisfied: Pillow in /usr/local/lib/python3.5/dist-packages (from -r /opt/tensorrt/samples/sampleSSD/requirements.txt (line 1)) (6.0.0)
WARNING: You are using pip version 19.2.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Ignoring torch: markers 'python_version == "3.7"' don't match your environment
......
......
......
Setting up graphsurgeon-tf (5.1.5-1+cuda10.1) ...
Setting up uff-converter-tf (5.1.5-1+cuda10.1) ...
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/uff/__init__.py", line 1, in <module>
from uff import converters, model # noqa
File "/usr/lib/python2.7/dist-packages/uff/model/__init__.py", line 1, in <module>
from . import uff_pb2 as uff_pb # noqa
File "/usr/lib/python2.7/dist-packages/uff/model/uff_pb2.py", line 6, in <module>
from google.protobuf.internal import enum_type_wrapper
ImportError: No module named google.protobuf.internal
chmod: cannot access '/bin/convert_to_uff.py': No such file or directory
Can't seem to find any file called convert_to_uff.py inside bin
Whats is going?
Where did I go wrong?
Try reinstalling protobuf to be sure:
pip install protobuf
Okay. So I am working on a python 2.7 based package called starkit. I am installing it using the following commands:
curl -O https://raw.githubusercontent.com/starkit/starkit/master/starkit_env27.yml
# create env
conda env create --file starkit_env27.yml -n starkit
# activate
source activate starkit
# get starkit
git clone https://github.com/starkit/starkit
cd starkit
# install
python setup.py install
When I run the python setup command I get the following error:
Traceback (most recent call last):
File "setup.py", line 65, in <module>
get_debug_option(PACKAGENAME))
File "/Users/97amarnathk/Documents/starkit/astropy_helpers/astropy_helpers/setup_helpers.py", line 125, in get_debug_option
if any(cmd in dist.commands for cmd in ['build', 'build_ext']):
File "/Users/97amarnathk/Documents/starkit/astropy_helpers/astropy_helpers/setup_helpers.py", line 125, in <genexpr>
if any(cmd in dist.commands for cmd in ['build', 'build_ext']):
AttributeError: Distribution instance has no attribute 'commands'
But when I again do python setup.py install, it works perfectly. I am not able to find why the package is not installing on the first attempt but on the second attempt?
This happens irrespective of the underlying computer. And when I clone this repository somewhere else, the same error occurs on the first try, but not on the second. Why?