SSHOperator fails to connect to remote host - python

command="./rmdt_tester -s -S 4G",
command="rmdt_tester -l :2601",
Here is two sshhooks and two ssh operators which is failed during the performing,the following error i see in the logs:
Failed to connect. Sleeping before retry attempt 1
Failed to connect. Sleeping before retry attempt 2


CyberArk ITATS004E Authentication failure for User in python script

I'm trying to implement a python script that executes local bash scripts or simple commands on remote CyberArk machines. Here is my code:
if __name__ == '__main__':
for ip in IP_LIST:
bash_cmd = f"ssh -o stricthostkeychecking=no {USER}%{LOCAL_USER}%{ip}#{PROXY} 'bash -s' < {BASH_SCRIPT}"
exit_code =, shell=True)
bash_cmd = f"scp {USER}%{LOCAL_USER}%{ip}#{PROXY}:server_info_PY.txt ."
exit_code =, shell=True)
The main problem is that i get this CyberArk authentication error most of the times, but not always, so it's kind of random and i don't know why:
PSPSD072E Perform session error occurred. Reason: PSPSD033E Error receiving PSM For SSH server
response (Extra information: [289E [4426b00e-cc44-11ec-bca1-005056b74f99] Failed to impersonate as
user <user>. Error: [ITATS004E Authentication failure for User <user>.
In this case the ssh exit code is 255, but if i check sshd service logs on the remote machine, there are no errors. I even tried with the os library to execute bash commands, but I got same result.
I was thinking of multiple ssh sessions hanging after executing this script a lot of times, but on the remote machine i only find the one i'm using.
Could someone explain what is happening or do you have any ideas?
Notes: I don't have any access to the PSM server, that is stored in the variable PROXY
Edit 1: I tried to use Paramiko library to create the ssh connection, but i get an authentication error related to Paramiko and not related to CyberArk. I also tried Fabric library which is based on Paramiko, so it didn't work.
If i try to run the same ssh command manually from my terminal it works and i can see that it first connects to the PROXY and then to the ip of the remote machine. From the script side it looks like he can't even connect to the PROXY because of the CyberArk authentication error.
Edit 2: I logged some informations about all commands running when executing the python script and i found out that the first command which is launched is /bin/sh/ -c plus the ssh string:
/bin/sh -c ssh <user>#<domain>
Could be this the main problem? The prepending of /bin/sh -c? Or it's a normal behaviour when using subprocess library? There is a way to just execute the ssh command without this prepend?
Edit 3: I removed shell=True but got same auhtentication error. So, if i execute manually the ssh command i get no error, but if it is executed from the python script i get the error, but i can't find any contradiction at proccess level using ps aux in both cases.
Since the authentication error is kind of random, I just added a while loop that resets known_hosts file and runs the ssh command for n retries.
succeeded_cmd_exec = False
retries = 5
while not succeeded_cmd_exec:
if retries == 0:
bash_cmd = f'ssh-keygen -f "{Configs.KNOWN_HOSTS}" -R "{Configs.PROXY}"'
_, _, exit_code = exec_cmd(bash_cmd)
if exit_code == 0:
radius_password = generate_password(Configs.URI, Configs.PASSWORD)
bash_cmd = f"sshpass -p \"{radius_password}\" ssh -o stricthostkeychecking=no {Configs.USER}%{Configs.LOCAL_USER}%{ip}#{Configs.PROXY} 'ls'"
stdout, stderr, exit_code = exec_cmd(bash_cmd)
if exit_code == 0:
print('Output from SSH command:\n')
succeeded_cmd_exec = True
retries = retries - 1
print('SSH command failed, retrying ... ')
print('Sleeping 15 seconds')
print('Reset known hosts files failed, retrying ...')
if retries == 0 and not succeeded_cmd_exec:
print(f'Failed processing IP {ip}')
The exec_cmd function is defined like this:
def exec_cmd(bash_cmd: str):
process = subprocess.Popen(bash_cmd, shell=True, executable='/bin/bash', stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
return stdout.decode('utf-8'), stderr.decode('utf-8'), process.returncode

Is this an error that occurred after the spark operation?

I ran the following command:
$ spark-submit --master yarn --deploy-mode cluster
So, below log is continuous print:
2021-12-23 06:07:50,158 INFO yarn.Client: Application report for application_1640239254568_0002 (state: ACCEPTED)
2021-12-23 06:07:51,162 INFO yarn.Client: Application report for application_1640239254568_0002 (state: ACCEPTED)
and I check the result through my 8088(Logs for container web UI), but there is nothing in stdout.
I was disappointed and tried to force the park operation to end, but suddenly the new log is print like below:
2021-12-23 06:09:06,694 INFO yarn.Client: Application report for application_1640239254568_0002 (state: RUNNING)
2021-12-23 06:09:06,695 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: master
ApplicationMaster RPC port: 40451
queue: default
start time: 1640239668020
final status: UNDEFINED
tracking URL: http://master2:8088/proxy/application_1640239254568_0002/
user: root
2021-12-23 06:09:07,707 INFO yarn.Client: Application report for application_1640239254568_0002 (state: RUNNING)
And after some time, an error log occurred as shown below:
2021-12-23 06:10:25,003 INFO retry.RetryInvocationHandler: End of File Exception between local host is: "master/"; destination host is: "master2":8032; :; For more details see:, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm2. Trying to failover immediately.
2021-12-23 06:10:25,003 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
2021-12-23 06:10:25,004 INFO retry.RetryInvocationHandler: Call From master/ to master:8032 failed on connection exception: Connection refused; For more details see:, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm1 after 1 failover attempts. Trying to failover after sleeping for 18340ms.
2021-12-23 06:10:43,347 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
I understand that sparks have completed the resource manager allocation after work, so it is normal for the above error log to appear.
Q1. Is the above job normal?
Q2. After this work, where can I check the results? Can I check them on "containerlogs web UI"?
IMPORTANT!! ADD. I re-ran the command. and check the status: SUCCEEDED. Why does the park-submit operation sometimes succeed and sometimes stop in the middle?

Dask - new cluster creation fails, HDFS files owed by "dask" user

I have setup dask on my MapR cluster's edge node following the directions here:
Per those directions, I'm testing the install by running the following in a JupyterHub spawned ipython notebook:
from dask_gateway import Gateway
gateway = Gateway("http://sa1x-hadoopedg-np1.hchc.local:9010")
cluster = gateway.new_cluster()
However, when it tries to start the new cluster via YARN, I get the following error in the YARN application's log:
Diagnostics: User a059571(user id 1425180742) does not have access to maprfs:///user/a059571/.skein/application_1605411890003_0222/809B8EAF0CC3524F90366F449C11C97E/tmpv8cbv2ag
Even though dask is supposed to be running as the requesting user (in this case a059571), it appears to be creating directories as the user running the dask-gateway-server (in this case the user mapr):
hdfs dfs -ls -d maprfs:///user/a059571/.skein/application_1605411890003_0222
drwx------ - mapr mapr 7 2021-01-19 17:37 maprfs:///user/a059571/.skein/application_1605411890003_0222
I feel like I'm missing something obvious.
Here are my configs, for full disclosure:
c.DaskGateway.backend_class = (
c.DaskGateway.address= ''
c.Proxy.address = ''
c.Proxy.tcp_address = ''
c.YarnClusterConfig.scheduler_cmd = "/opt/anaconda3/bin/dask-scheduler"
c.YarnClusterConfig.worker_cmd = "/opt/anaconda3/bin/dask-worker"
c.YarnClusterConfig.queue = 'root.default'
c.DaskGateway.log_level= 'DEBUG'
Snippet from inside my core_site.xml
And, some interesting lines from the dask-gateway-server logs:
[DaskGateway] - HTTP routes listening at
[DaskGateway] - Scheduler routes listening at gateway://
[Proxy] Unexpected failure fetching routing table, retrying in 0.5s: Get dial tcp connect: connection refused
[DaskGateway] Removed 0 expired clusters from the database
[Proxy] Unexpected failure fetching routing table, retrying in 1.0s: Get dial tcp connect: connection refused
[Proxy] Unexpected failure fetching routing table, retrying in 2.0s: Get dial tcp connect: connection refused
[Proxy] Unexpected failure fetching routing table, retrying in 4.0s: Get dial tcp connect: connection refused
INFO skein.Driver: Driver started, listening on 44262
[DaskGateway] Backend started, clusters will contact api server at
[DaskGateway] Dask-Gateway server started
[DaskGateway] - Private API server listening at
Note: sa1x-hadoopedg-np1.hchc.local ==, an RHEL 7.x server. MapR cluster is 6.x.

Confluent_kafka Producer does not publish messages into topic

I tried to install Kafka on my Raspberry. And test it on 'hello-kafka' topic:
~ $ /usr/local/kafka/bin/ --broker-list localhost:9092 --topic hello-kafka
>Test message 1
>Test message 2
>Test message 3
[4]+ Stopped /usr/local/kafka/bin/ --broker-list localhost:9092 --topic hello-kafka
$ /usr/local/kafka/bin/ --bootstrap-server localhost:9092 --topic hello-kafka --from-beginning
Test message 1
Test message 2
Test message 3
^CProcessed a total of 3 messages
Then I tried to check that server works from another machine.
Checking zookeeper:
(venv)$ telnet 2181
Connected to
Escape character is '^]'.
Zookeeper version: 3.6.0--b4c89dc7f6083829e18fae6e446907ae0b1f22d7, built on 02/25/2020 14:38 GMT
Latency min/avg/max: 0/0.8736/59
Received: 10146
Sent: 10149
Connections: 2
Outstanding: 0
Zxid: 0x96
Mode: standalone
Node count: 139
Connection closed by foreign host.
And Kafka:
(venv) $ telnet 9092
Connected to
Escape character is '^]'.
Connection closed by foreign host.
Then I wrote a Python script:
# -*- coding: utf-8 -*-
from confluent_kafka import Producer
def callback(err, msg):
if err is not None:
print(f'Failed to deliver message: {str(msg)}: {str(err)}')
print(f'Message produced: {str(msg)}')
config = {
'bootstrap.servers': ''
producer = Producer(config)
producer.produce('hello-kafka', value=b"Hello from Python", callback=callback)
There is script output (no any prints):
(venv) $ python
(venv) $ python
(venv) $
And no new messages in Kafka:
$ /usr/local/kafka/bin/ --bootstrap-server localhost:9092 --topic hello-kafka --from-beginning
Test message 1
Test message 2
Test message 3
^CProcessed a total of 3 messages
$ ^C
Somebody can tell me what I am doing wrong?
The correct fix is to update your broker configuration in to set the advertised listener correctly. If your client cannot resolve raspberrypi then change the advertised listener to something that your client can reach, i.e. the IP address:
Changing the /etc/hosts file on your client is a workaround that for a test project with a Raspberry Pi is fine, but as a general best practice should be discouraged (because the client will break as soon as it's moved to another machine which doesn't have the /etc/hosts hack)
I turned on log and saw next message:
WARNING:kafka.conn:DNS lookup failed for raspberrypi:9092, exception was [Errno 8] nodename nor servname provided, or not known. Is your advertised.listeners (called before Kafka 9) correct and resolvable?
ERROR:kafka.conn:DNS lookup failed for raspberrypi:9092 (AddressFamily.AF_UNSPEC)
Then I added to /etc/hosts on client machine next string: raspberrypi
And it completely fix this situation.

How to check if SSH connection was established with AWS instance

I'm trying to connect to the Amazon EC2 instance via SSH using boto. I know that ssh connection can be established after some time after instance was created. So my questions are:
Can I somehow check if SSH is up on the instance? (if so, how?)
Or how can I check for the output of boto.manage.cmdshell.sshclient_from_instance()? I mean for example if the output prints out Could not establish SSH connection, than try again.
That's what I tried so far, but have no luck:
if instance.state == 'running':
retry = True
while retry:
print 'Connecting to ssh'
key_path = os.path.join(os.path.expanduser('~/.ssh'), 'secret_key.pem')
cmd = boto.manage.cmdshell.sshclient_from_instance(instance,
print instance.update()
if cmd:
retry = False
print 'Going to sleep'
SSH Connection refused, will retry in 5 seconds
SSH Connection refused, will retry in 5 seconds
SSH Connection refused, will retry in 5 seconds
SSH Connection refused, will retry in 5 seconds
SSH Connection refused, will retry in 5 seconds
Could not establish SSH connection
And of course everything is working properly, because I can launch the same code after some time and will get a connection, and will be able to use
The message "SSH Connection refused, will retry in 5 seconds" is coming from boto:
Initially, 'running' just implicates that the instance has started booting. As long as sshd is not up, connections to port 22 are refused. Hence, what you observe is absolutely to be expected if sshd does not come up within the first 25 seconds of 'running' state.
Since it is not predictable when sshd comes up exactly and in case you do not want to waste time by just defining a constant long waiting period, you could implement your own polling code that in e.g. 1 to 5 second intervals checks if port 22 is reachable. Only if it is invoke boto.manage.cmdshell.sshclient_from_instance().
A simple way to test if a certain TCP port of a certain host is reachable is via the socket module:
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('hostname', 22))
print "Port 22 reachable"
except socket.error as e:
print "Error on connect: %s" % e
I have 2 parts, one to check if the instance is running, then another one to check if the instance is reachable
# Get instance status till it is running
status_output=$(aws ec2 describe-instance-status --instance-ids $instance_id)
instance_status=$(jq -n "$status_output" | jq .InstanceStatuses[0] | jq .InstanceState.Name)
echo $instance_status
while [ ${instance_status:1:-1} != running ]
status_output=$(aws ec2 describe-instance-status --instance-ids $instance_id)
instance_status=$(jq -n "$status_output" | jq .InstanceStatuses[0] | jq .InstanceState.Name)
echo $instance_status
# Get instance reachability till it is ready
status_output=$(aws ec2 describe-instance-status --instance-ids $instance_id)
instance_reachability=$(jq -n "$status_output" | jq .InstanceStatuses[0] | jq .InstanceStatus.Status)
echo $instance_reachability
while [ ${instance_reachability:1:-1} != ok ]
status_output=$(aws ec2 describe-instance-status --instance-ids $instance_id)
instance_reachability=$(jq -n "$status_output" | jq .InstanceStatuses[0] | jq .InstanceStatus.Status)
echo $instance_reachability
