How to run Tensorboard in parallel - python

https://github.com/NVIDIA/DeepRecommender
According to the above page, I tried to run the NVIDIA's DeepRecommender program.After I activated the pytorch, I run the program as below but it failed.
[I run this Command]
$ python run.py --gpu_ids 0 \
--path_to_train_data Netflix/NF_TRAIN \
--path_to_eval_data Netflix/NF_VALID \
--hidden_layers 512,512,1024 \
--non_linearity_type selu \
--batch_size 128 \
--logdir model_save \
--drop_prob 0.8 \
--optimizer momentum \
--lr 0.005 \
--weight_decay 0 \
--aug_step 1 \
--noise_prob 0 \
--num_epochs 12 \
--summary_frequency 1000
[The comments of the Guide.]
Note that you can run Tensorboard in parallel
$ tensorboard --logdir=model_save
[My Question]
The guide says as above.I don't know how to run in parallel.Please tell me the way. Shoud I open 2 terminal windows?
[Enviroment]
The detail of the enviroment is as follow.
---> Ubuntu 18.04 LTS, python 3.6, Pytorch 1.2.0, CUDA V10.1.168
[The 1st trial]
After I activated the pytorch,
$source activate pytorch
$python run.py --gpu_ids 0 \ (The long parameters are abbreviated here.)
[The Error messages of the 1st trial]
Traceback (most recent call last):
File "run.py", line 13, in
from logger import Logger
File "/home/user/NVIDIA_DeepRecommender/DeepRecommender-mp_branch/logge r.py", line 4, in
import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'
[The 2nd trial]
After I activated the tensorflow-gpu,
$ source activate tensorflow-gpu
$python run.py --gpu_ids 0 \ (The long parameters are abbreviated here.)
[The Error messages of the 2nd trial.]
Traceback (most recent call last):
File "run.py", line 2, in
import torch
ModuleNotFoundError: No module named 'torch'
[Expected result]
$ python run.py --gpu_ids 0 \
The program can run with no error and finish training the model.

Try either installing tensorflow-gpu in your pytorch environment or pytorch in your tensorflow-gpu environemnt and use that environment to run your program.

Related

Module could not be found err or loading dependencies Pytorch Python

I'm trying to install PyTorch into my Conda environment using Anaconda Prompt(Miniconda3) but running into a similar issue seen here
First, I created my Conda environment:
conda create --name token python=3.
Next, I activate my environment and download PyTorch without fail using the following:
activate token
conda install pytorch pytorch-cuda=11.6 -c pytorch -c nvidia
Next, I go into my python console and try to import torch but find the following error:
python
>>>import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\.conda\envs\token\lib\site-packages\torch\__init__.py", line 139, in <module>
raise err
OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\.conda\envs\token\lib\site-packages\torch\lib\shm.dll" or one of its dependencies.
When running the list command:
(base) C:\Users>conda list -n token pytorch
# packages in environment at C:\Users\.conda\envs\token:
#
# Name Version Build Channel
pytorch 1.13.0 py3.10_cuda11.6_cudnn8_0 pytorch
pytorch-cuda 11.6 h867d48c_0 pytorch
pytorch-mutex 1.0 cuda pytorch
Is there a way to check which dependencies are missing or know why PyTorch is not being recognised?

How to execute a script in a virtual environment from a bash script on server?

I am trying to run a python script in a virtual environment in a server using oarsub :
So firstly I run this command in a server name "a" :
oarsub -l /host=1/gpu=1,walltime=2:00:00 './training_corpus1.sh'
training_corpus1.sh looks like this at the beguinning :
#!/bin/bash
cd /home/ge/ke/anaconda3/envs
source activate env
cd ~/eXP/bert
CUDA_VISIBLE_DEVICES=0,1 python training.py \
--exp_name bert \ ...
At the beguinning, I am suppose to open my virtual environment and then run the script but I am always getting this error in the OAR.18651.stderr file :
./training_corpus1.sh: line 5: activate: No such file or directory
Traceback (most recent call last):
File "training.py", line 18, in <module>
from xlm.slurm import init_signal_handler, init_distributed_mode
File "/home/ge/ke/eXP/bert/xlm/slurm.py", line 11, in <module>
import torch
ImportError: No module named torch
Torch is located in my virtual environment , it seems that It did not open.Whenusing "conda " a thte place of "source" I get:
./training_corpus1.sh: line 5: conda: command not found
For a virtual environment and bash the activation command is
source env/bin/activate
where env is the directory of the virtual environment to activate.
PS. Let me advice you to start any script with set -e to allow fast failing on any error:
#!/usr/bin/env bash
set -e
…

Errors occur when compiling tensorflow serving by bazel: Python Configuration Error: --define PYTHON_BIN_PATH='/usr/bin/python3' is not executable

I was installing the tf serving by compiling the source code.
The command line was:
$ git clone -b r1.10 https://github.com/tensorflow/serving.git
$ cd serving
$ bazel test tensorflow_serving/...
and my bazel vision is
$ bazel --version
bazel 2.1.0
The error information is as follows:
ERROR: An error occurred during the fetch of repository 'local_config_python':
Traceback (most recent call last):
File "/private/var/tmp/_bazel_limingliang/1c00e8fe288c428416b8275600af1770/external/org_tensorflow/third_party/py/python_configure.bzl", line 345
_create_local_python_repository(<1 more arguments>)
File "/private/var/tmp/_bazel_limingliang/1c00e8fe288c428416b8275600af1770/external/org_tensorflow/third_party/py/python_configure.bzl", line 292, in _create_local_python_repository
_check_python_bin(<2 more arguments>)
File "/private/var/tmp/_bazel_limingliang/1c00e8fe288c428416b8275600af1770/external/org_tensorflow/third_party/py/python_configure.bzl", line 234, in _check_python_bin
_fail(<1 more arguments>)
File "/private/var/tmp/_bazel_limingliang/1c00e8fe288c428416b8275600af1770/external/org_tensorflow/third_party/py/python_configure.bzl", line 27, in _fail
fail(<1 more arguments>)
Python Configuration Error: --define PYTHON_BIN_PATH='/usr/bin/python3' is not executable. Is it the python binary?
I find that in the "/usr/bin/" directory, there is not the python3 bin file.
$ cd /usr/bin/
$ ls | grep python
python
python-config
python2.7
python2.7-config
pythonw
pythonw2.7
So I change the value of PYTHON_BIN_PATH to "/usr/bin/python"
$ export PYTHON_BIN_PATH=/usr/bin/python
$ echo $PYTHON_BIN_PATH
/usr/bin/python
However, it still doesn't work.
The error message is defining the python path as '/usr/bin/python3', so adding /usr/bin/python' to the path may not matter. Have you tried linking /usr/bin/python3 to /usr/bin/python?
For example, ln -s /usr/bin/python3 /usr/bin/python

'/bin/convert_to_uff.py': No such file or directory

I am trying to optimize YoloV3 using tensorRT
I came this post called Have you Optimized your Deep Learning Model Before Deployment?
Docker is used in the post.
Used Enabling GPUs in the Container Runtime Ecosystem to install nvidia-docker2
Pulled the latest version of the docker image docker pull aminehy/tensorrt-opencv-python3:version-1.3 from aminehy/tensorrt-opencv-python3
Here are the images
$ sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/cuda 10.1-cudnn7-devel-ubuntu18.04 b4879c167fc1 2 weeks ago 3.67GB
aminehy/tensorrt-opencv-python3 version-1.3 0302e477816d 4 months ago 5.36GB
aminehy/tensorrt-opencv-python3 latest 604502819d12 4 months ago 4.94GB
aminehy/tensorrt-opencv-python3 version-1.1 d693210c500c 4 months ago 4.94GB
I ran
$sudo docker run -it --rm -v $(pwd):/workspace --runtime=nvidia -w /workspace -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=unix$DISPLAY aminehy/tensorrt-opencv-python3:version-1.3```
=====================
== NVIDIA TensorRT ==
=====================
NVIDIA Release 19.05 (build 6392482)
NVIDIA TensorRT 5.1.5 (c) 2016-2019, NVIDIA CORPORATION. All rights reserved.
Container image (c) 2019, NVIDIA CORPORATION. All rights reserved.
https://developer.nvidia.com/tensorrt
To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh
root#a38b20eeb740:/workspace# cd /opt/tensorrt/python/
root#a38b20eeb740:/opt/tensorrt/python# chmod +x python_setup.sh
root#a38b20eeb740:/opt/tensorrt/python# ./python_setup.sh
Requirement already satisfied: Pillow in /usr/local/lib/python3.5/dist-packages (from -r /opt/tensorrt/samples/sampleSSD/requirements.txt (line 1)) (6.0.0)
WARNING: You are using pip version 19.2.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Ignoring torch: markers 'python_version == "3.7"' don't match your environment
......
......
......
Setting up graphsurgeon-tf (5.1.5-1+cuda10.1) ...
Setting up uff-converter-tf (5.1.5-1+cuda10.1) ...
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/uff/__init__.py", line 1, in <module>
from uff import converters, model # noqa
File "/usr/lib/python2.7/dist-packages/uff/model/__init__.py", line 1, in <module>
from . import uff_pb2 as uff_pb # noqa
File "/usr/lib/python2.7/dist-packages/uff/model/uff_pb2.py", line 6, in <module>
from google.protobuf.internal import enum_type_wrapper
ImportError: No module named google.protobuf.internal
chmod: cannot access '/bin/convert_to_uff.py': No such file or directory
Can't seem to find any file called convert_to_uff.py inside bin
Whats is going?
Where did I go wrong?
Try reinstalling protobuf to be sure:
pip install protobuf

Python package not installing on the first attempt but installing on the second attempt

Okay. So I am working on a python 2.7 based package called starkit. I am installing it using the following commands:
curl -O https://raw.githubusercontent.com/starkit/starkit/master/starkit_env27.yml
# create env
conda env create --file starkit_env27.yml -n starkit
# activate
source activate starkit
# get starkit
git clone https://github.com/starkit/starkit
cd starkit
# install
python setup.py install
When I run the python setup command I get the following error:
Traceback (most recent call last):
File "setup.py", line 65, in <module>
get_debug_option(PACKAGENAME))
File "/Users/97amarnathk/Documents/starkit/astropy_helpers/astropy_helpers/setup_helpers.py", line 125, in get_debug_option
if any(cmd in dist.commands for cmd in ['build', 'build_ext']):
File "/Users/97amarnathk/Documents/starkit/astropy_helpers/astropy_helpers/setup_helpers.py", line 125, in <genexpr>
if any(cmd in dist.commands for cmd in ['build', 'build_ext']):
AttributeError: Distribution instance has no attribute 'commands'
But when I again do python setup.py install, it works perfectly. I am not able to find why the package is not installing on the first attempt but on the second attempt?
This happens irrespective of the underlying computer. And when I clone this repository somewhere else, the same error occurs on the first try, but not on the second. Why?

Categories