OpenMPI: Permission denied error while trying to use mpirun

OpenMPI: Permission denied error while trying to use mpirun - python

I would like to display "hello world" via MPI on different Google cloud compute instances with the help of the following code:
from mpi4py import MPI
size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()
print("Hello, World! I am process/rank {} of {} on {}.\n".format(rank, size, name))
.
The problem is, that even so I can ssh-connect across all of these instances without problem, I get a permission denied error message when I try to run my script. I use following command to envoke my script:
mpirun --host localhost,instance_1,instance_2 python hello_world.py
.
And get the following error message:
Permission denied (publickey).
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
.
Additional information:
I installed open-MPI on all of my nodes
I have Google automatically set all of my ssh-keys by using gcloud to log into each instance from each instance
instance-type: n1-standard-1
instance-OS: Linux Debian (default)
.
Thanks you for your help :-)
.
New Information:
(thanks # Zulan for pointing out that I should edit my previous post instead of creating a new answer for new information)
So, I tried to do the same with mpich instead of openmpi. However, I run into a similar error message.
Command:
mpirun --host localhost,instance_1,instance_2 python hello_world.py
.
Error message:
Host key verification failed.
.
I can ssh-connect between my two instances without problems, and through the gcloud commands the ssh-keys should automatically be set up properly.
So, has somebody an idea what the problem could be? I also checked the path, the firewall rules, and my ability to write startup scripts in the temp-folder. Can someone please try to recreate this problem? + Should I raise this question to Google? (never done such thing before, Im quite unsure :S)
Thanks for helping :)

so I finally found a solution. Wow, problem was driving me nuts.
So it turned out, that I needed to generate ssh-keys manually for the script to work. I have no idea why, because google-services already set up the keys by using
gcloud compute ssh , but well, it worked :)
Steps I did:
instance_1 $ ssh-keygen -t rsa
instance_1 $ cd .ssh
instance_1 $ cat id_rsa.pub >> authorized_keys
instance_1 $ gcloud compute copy-files id_rsa.pub
instance_1 $ gcloud compute ssh instance_2
instance_2 $ cd .ssh
instance_2 $ cat id_rsa.pub >> authorized_keys
.
I will open another topic and ask why I cannot use ssh instance_2, even so gcloud compute ssh instance_2 is working. See: Difference between the commands "gcloud compute ssh" and "ssh"

Related

Permission denied: calling a shell script from Python in a Jenkins job

Trying to provide the minimal amount of information necessary here, so I've left a lot out. Lots of similar questions around, but the most common answer (use chmod +x) isn't working for me.
I have a Python script and a shell script that sit next to each in a GitHub Enterprise repository:
Next, in Jenkins I check the code in this repository out. The two key steps in my Jenkinsfile are like so:
dir ("$WORK/python")
{
sh "chmod +x test.sh"
sh "python3 foo.py -t '${AUTH}'"
}
Here, $WORK is the location on the Jenkins node that the code checks out to, and python (yes, poorly named) is the folder in the repository that the Python and shell script live in. Now, foo.py calls the shell script in this way:
try:
cmd = f'test.sh {repo_name}'
subprocess.Popen(cmd.split())
except Exception as e:
print(f'Error during repository scan: {e}')
Here, repo_name is just an argument that I define above this snippet, that I'm asking the shell script to do something with. When I run the job in Jenkins, it technically executes without error, but the exception branch above does run:
11:37:24 Error during repository scan - [Errno 13] Permission denied: 'test.sh'
I wanted to be sure that the chmod in the Jenkinsfile was running, so I opened a terminal to the machine that the code checked out to and found that the execute permissions were indeed correctly set:
-rw-r--r-- 1 adm domain users 4106 Feb 6 14:24 foo.py
-rwxr-xr-x 1 adm domain users 619 Feb 6 14:37 test.sh
I've gone around on this most of the day. What the heck is going on?

Doing this fixed it:
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
cmd = f'{BASE_DIR}/test.sh {repo_name}'
Popen(cmd.split())
Why? In my ignorance associated with picking up someone else's code, I neglected to see, buried further up before the call to the shell script, that we change to a folder one level down:
os.chdir('repo_content')
The shell script does not, of course, live there. So calling it without specifying the path won't work. I'm not sure why this results in [Errno 13] Permission denied: 'test.sh', as that would imply that the shell script was found, but fully qualifying the path as above is doing exactly what I had hoped.

mitmproxy & python - ignore all hosts with https/ssl

I've done alot of research, and I can't find anything which actually solves my issue.
Since basically no site accepts mitmdumps certificate for https, I want to ignore those hosts. I can access a specific website with "--ignore-hosts (ip)" like normal, but I need to ignore all HTTPS/SSL hosts.
Is there any way I can do this at all?
Thanks alot!

There is a script file called tls_passthrough.py on the mitmproxy GitHub which ignores hosts which has previously failed a handshake due to the user not trusting the new certificate. Although it does not save for other sessions.
What this also means is that the first SSL connection from this perticular host the will always fail. What I suggest you do is write out all the IPs which has failed previously into a text document and ignore all hosts which are in that text file.
tls_passthrough.py
To simply start it, you just add it with the script argument "-s (tls_passthrough.py path)"
Example,
mitmproxy -s tls_passthrough.py

you need a simple addon script to ignore all tls connections.
import mitmproxy
class IgnoreAllTLS:
def __init__(self) -> None:
pass
def tls_clienthello(self, data: mitmproxy.proxy.layers.tls.ClientHelloData):
'''
ignore all tls event
'''
# LOGC("tls hello from "+str(data.context.server)+" ,ignore_connection="+str(data.ignore_connection))
data.ignore_connection = True
addons = [
IgnoreAllTLS()
]
the latest version ( 7.0.4 for now) is not support ignore_connection feature yet,so u need to install the main source version:
git clone https://github.com/mitmproxy/mitmproxy.git
cd mitmproxy
python3 -m venv venv
activate the venv before startup the proxy
source /path/to/mitmproxy/venv/bin/activate
startup mitmproxy
mitmproxy -s ignore_all_tls.py

You can ignore all https/SSL traffic by using a wildcard:
mitmproxy --ignore-hosts '.*'

Git fetch failing in cron. Runs fine manually

The entire script runs fine. I will also note that if I copy and paste the cron job into the shell and run it manually it works with no issues.
Base = '/home/user/git/'
GIT_out = Base + ("git_file.txt")
FILE_NAME = Base + 'rules/file.xml'
CD_file = open(Base + "rules/reports/CD.txt", 'r')
os.chdir(Base + 'rules')
gitFetchPull = "git fetch --all ;sleep 3 ; git pull --all"
git1 = subprocess.Popen(gitFetchPull, shell=True, stdout=subprocess.PIPE)
gitOut = git1.stdout.read()
print(gitOut)
When I read the output from cron it appears to not be able to authenticate
Received disconnect from 172.17.3.18: 2: Too many authentication failures for tyoffe4
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
cron job
* * * /usr/bin/python /home/tyoffe4/git/rules/reports/cd_release.py >/home/tyoffe4/git/rules/reports/cd_release.out 2>&1

This is likely an issue of the cron environment not having the environment variables set up by your ssh agent. Therefore when git makes an ssh connection, it can't authenticate, because it can't contact your ssh agent and get keys.
This answer probably has what you're looking for:
ssh-agent and crontab -- is there a good way to get these to meet?
If for some reason it's not ssh-agent related, try print os.environ at the top of your script to dump the value of all environment variables.
Compare the output from cron and running env in your bash shell. There are likely some differences, and one of them is the source of your error.
If you set up the same environment variables in your shell as you have in cron, the behavior should reproduce.

fabric context manager with sudo

Is it possible to have a context manager that just keeps the state of the previous run execution. In code:
EDIT: Not a working solution, something I expected
with sudo('. myapp'): #this runs a few things and sets many env variables
run('echo $ENV1') # $ENV1 isn't set because the sudo command ran independently
I am trying to run several commands but want to keep state between each command ?
I tried using the prefix context manager but it doesn't work with the shell_env context manager: When running this code
with shell_env(ENV1="TEST"):
with prefix(". myapp"):
run("echo $ENV2")
I expected my ENV to be set then run my application which should have set env2 but the prefix runs before the shell_env ?

Don't really understand the question asked here. Could you give a little more detail in what you are trying to accomplish. However I tried the same thing (with sudo('. myapp)) you did which threw an AttributeError __exit__ exception.
Finally I've tried the to use prefix to source the bash file and executing a sudo command line within this context, which works just fine.
#fab.task
def trythis():
with fab.prefix('. testenv'):
fab.sudo('echo $ENV1')
When executing the task I get the following output.
[host] Executing task 'trythis'
[host] sudo: echo $ENV1
[host] out: sudo password:
[host] out: testing
[host] out:
Done.
Disconnecting from host... done.

with shell_env(ENV1="TEST"):
with prefix(". myapp"):
run("echo $ENV2")
I expected my ENV to be set then run my application which should have set env2 but the prefix runs before the shell_env ?
Given fabric's documentation the code you've written will generate:
export ENV1="TEST" && . myapp && echo $ENV2
Given that myapp creates ENV2, your code should work the way you want it to work, though not all shell interpret the dot operator the same way, using source is always a better idea.
with shell_env(ENV1="TEST"):
with prefix("source myapp"):
run("echo $ENV2")
You may consider a bug in myapp though, and/or double check that all path and working directory are correctly set.

Python errno.ESTALE

I am getting python errno.ESTALE error on red hat5.4 NFSv3 with cache enabled.
I looked up and found that:
"A filehandle becomes stale whenever the file or directory referenced by the handle is removed by another host, while your client still holds an active reference to the object. A typical example occurs when the current directory of a process, running on your client, is removed on the server (either by a process running on the server or on another client)."
I found it that if you chown or listdir, etc. you can flush the cache and hence it wont be stale but this approch hasnt worked for me.
Anyone have other solutions?

I assume this is NFS and you are running a client on Linux.
You should try to remount your NFS filesystem.
like this:
$ mount -o remount [your filesystem]
also you could try to flush the cache as you mentioned.
# To free pagecache
$ echo 1 > /proc/sys/vm/drop_caches
# To free dentries and inodes
$ echo 2 > /proc/sys/vm/drop_caches
# To free pagecache, dentries and inodes
$ echo 3 > /proc/sys/vm/drop_caches

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.