Dask - new cluster creation fails, HDFS files owed by "dask" user

Dask - new cluster creation fails, HDFS files owed by "dask" user - python

I have setup dask on my MapR cluster's edge node following the directions here: https://gateway.dask.org/install-hadoop.html
Per those directions, I'm testing the install by running the following in a JupyterHub spawned ipython notebook:
from dask_gateway import Gateway
gateway = Gateway("http://sa1x-hadoopedg-np1.hchc.local:9010")
cluster = gateway.new_cluster()
However, when it tries to start the new cluster via YARN, I get the following error in the YARN application's log:
Diagnostics: User a059571(user id 1425180742) does not have access to maprfs:///user/a059571/.skein/application_1605411890003_0222/809B8EAF0CC3524F90366F449C11C97E/tmpv8cbv2ag
Even though dask is supposed to be running as the requesting user (in this case a059571), it appears to be creating directories as the user running the dask-gateway-server (in this case the user mapr):
hdfs dfs -ls -d maprfs:///user/a059571/.skein/application_1605411890003_0222
drwx------ - mapr mapr 7 2021-01-19 17:37 maprfs:///user/a059571/.skein/application_1605411890003_0222
I feel like I'm missing something obvious.
Here are my configs, for full disclosure:
/etc/dask-gateway/dask_gateway_config.py
c.DaskGateway.backend_class = (
"dask_gateway_server.backends.yarn.YarnBackend"
)
c.DaskGateway.address= '12.190.113.133:9010'
c.Proxy.address = '12.190.113.133:9011'
c.Proxy.tcp_address = '12.190.113.133:9012'
c.YarnClusterConfig.scheduler_cmd = "/opt/anaconda3/bin/dask-scheduler"
c.YarnClusterConfig.worker_cmd = "/opt/anaconda3/bin/dask-worker"
c.YarnClusterConfig.queue = 'root.default'
c.DaskGateway.log_level= 'DEBUG'
Snippet from inside my core_site.xml
<property>
<name>hadoop.proxyuser.mapr.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.mapr.groups</name>
<value>*</value>
</property>
And, some interesting lines from the dask-gateway-server logs:
[DaskGateway] - HTTP routes listening at http://12.190.113.133:9011
[DaskGateway] - Scheduler routes listening at gateway://12.190.113.133:9012
[Proxy] Unexpected failure fetching routing table, retrying in 0.5s: Get http://12.190.113.133:9010/api/v1/routes: dial tcp 12.190.113.133:9010: connect: connection refused
[DaskGateway] Removed 0 expired clusters from the database
[Proxy] Unexpected failure fetching routing table, retrying in 1.0s: Get http://12.190.113.133:9010/api/v1/routes: dial tcp 12.190.113.133:9010: connect: connection refused
[Proxy] Unexpected failure fetching routing table, retrying in 2.0s: Get http://12.190.113.133:9010/api/v1/routes: dial tcp 12.190.113.133:9010: connect: connection refused
[Proxy] Unexpected failure fetching routing table, retrying in 4.0s: Get http://12.190.113.133:9010/api/v1/routes: dial tcp 12.190.113.133:9010: connect: connection refused
INFO skein.Driver: Driver started, listening on 44262
[DaskGateway] Backend started, clusters will contact api server at http://12.190.113.133:9011/api
[DaskGateway] Dask-Gateway server started
[DaskGateway] - Private API server listening at http://12.190.113.133:9010
Note: sa1x-hadoopedg-np1.hchc.local == 12.190.113.133, an RHEL 7.x server. MapR cluster is 6.x.

Related

Cannot connect to DataStax Enterprise cluster from Python app

I am having some difficulty in connecting to a Centos 7.x server hosted DataStax Cassandra 6.8.
I am able to successfully connect locally inside the Centos Shell and the nodetool status shows the cluster Up and Normal.
Things I tried in cassandra.yaml file -
changed the listen_address parameter from localhost to the IP address of the server. Result -> DSE is not starting.
Commented the listen_address line. Result -> DSE is not starting
Left the parameter of listen_address blank. Result -> DSE in not starting.
as mentioned above -
OS - CentOS 7
DSE Version - 6.8
Install method RPM
Python program -
#cluster = Cluster()
cluster = Cluster(['192.168.1.223'])
# To establish connection and begin executing queries, need a session
session = cluster.connect()
row = session.execute("select release_version from system.local;").one()
if row:
print(row[0])
else:
print("An error occurred.")
Exception thrown from python ->
NoHostAvailable: ('Unable to connect to any servers', {'192.168.1.223:9042': ConnectionRefusedError(10061, "Tried connecting to [('192.168.1.223', 9042)]. Last error: No connection could be made because the target machine actively refused it")})
Both my PC and my server are on the same network and I am able to ping from each other.
Any help is highly appreciated.
Thanks

The same question was asked on https://community.datastax.com/questions/12174/ so I'm re-posting my answer here.
This error indicates that you are connecting to a node which is not listening for CQL connections on IP 192.168.1.223 and CQL port 9042:
No connection could be made because the target machine actively refused it
The 2 most likely causes are:
DSE is not running
DSE isn't listening for client connections on the right IP
You indicated already that you are not able to start DSE. You 'll need to review the logs located in /var/log/cassandra by default for clues as to why it's not running.
The other possible issue is that you haven't configured native_transport_address (rpc_address in open-source Cassandra). You need to set this to an IP address that is accessible to clients (your app) otherwise, it will default to localhost (127.0.0.1).
In cassandra.yaml, configure the node with:
listen_address: private_ip
native_transport_address: public_ip
If you are just testing it on a local network, set both properties to the server's IP address. Cheers!
[EDIT] I just saw your conversation with #Alex Ott. I'm posting my response here because it won't fit in a comment.
This startup error means that the node couldn't talk to any seed nodes so it won't be able to join the cluster:
ERROR [DSE main thread] 2021-08-25 06:40:11,413 CassandraDaemon.java:932 - \
Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any peers
If you only have 1 node in the cluster, configure the seeds list in cassandra.yaml with the server's own IP address:
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "192.168.1.223"

Connecting to Teradata using teradatasql module in Python

I am trying to connect to Teradata using teradatasql module in Python. The code is running fine on localhost, but once deployed on the server as part of the server code, it is throwing the error.
the code:
import teradatasql
try:
host, username, password = 'hostname', 'username', '****'
session = teradatasql.connect(host=host, user=username, password=password, logmech="LDAP")
except Exception as e:
print(e)
Error I am getting on server:
[Version 16.20.0.60] [Session 0] [Teradata SQL Driver] Failure receiving Config Response message header↵ at gosqldriver/teradatasql.
(*teradataConnection).makeDriverError TeradataConnection.go:1101↵ at gosqldriver/teradatasql.
(*teradataConnection).sendAndReceive TeradataConnection.go:1397↵ at gosqldriver/teradatasql.newTeradataConnection TeradataConnection.go:180↵ at gosqldriver/teradatasql.(*teradataDriver).
Open TeradataDriver.go:32↵ at database/sql.dsnConnector.Connect sql.go:600↵ at database/sql.(*DB).conn sql.go:1103↵ at database/sql.
(*DB).Conn sql.go:1619↵ at main.goCreateConnection goside.go:275↵ at main.
_cgoexpwrap_212fad278f55_goCreateConnection _cgo_gotypes.go:240↵ at runtime.call64 asm_amd64.s:574↵ at runtime.cgocallbackg1 cgocall.go:316↵ at runtime.cgocallbackg cgocall.go:194↵ at runtime.cgocallback_gofunc asm_amd64.s:826↵ at runtime.goexit asm_amd64.s:2361↵Caused by read tcp IP:PORT->IP:PORT: wsarecv: An existing connection was forcibly closed by the remote host

The root cause of this error is outlined here by tomnolan:
The stack trace indicates that a TCP socket connection was made to the database, then the driver transmitted a Config Request message to the database, then the driver timed out waiting for a Config Response message from the database.
In other words, the driver thought that it had established a TCP socket connection, but the TCP socket connection was probably not fully successful, because a failure occurred on the initial message handshake between the driver and the database.
The most likely cause is that some kind of networking problem prevented the driver from properly connecting to the database.
I had this issue today and resolved it by altering my host. I am also on a VPN and found that the actual host name in DNS didn't work, but the ALIAS available did. For example on Windows:
C:\WINDOWS\system32>nslookup MYDB-TEST # <-- works
Server: abcd.domain.com
Address: <OMITTED>
Name: MYDB.domain.com # <-- doesn't work
Address: <OMITTED>
Aliases: mydb-test.domain.com # <-- works
I recognize this may be a specific solution option that may not work for everyone, but the root of the problem is confirmed to be a TCP connection issue from my experience.

HTTPS connection closed after SSL handshake with no exception

I'm using a library (sentry.io observer) that should connect to a remote server via https and upload some data. I do not control the server, but I can see that no data is uploaded. I set the urllib logger level to debug and I see two log messages
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): <server_url>:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): <server_url>:443
but no exception is thrown. I used wireshark to sniff packets and I see two SSL handshakes are executed, but the FIN packet is sent right after server finishes the handshake. Packets exchange looks like:
< - client sends message
> - server sends message
< TCP handshake [syn, syn ack, ack]
< Client hello
> Server hello, certificate, server key exchange, server hello done
< Client key exchange, change cipher spec, finished
> New session ticket, change cipher spec, finished
< TCP connection termination [fin ack, fin ack, ack]
This packet exchange is done twice, as urllib tries to connect to the remote server twice. The server certificate is valid, but the connection is cancelled by client. I set the library and urllib loggers to debug, but no error messages or anything that could help me narrow the issue down appears.
The issue only appears when requests are done from docker (based on centos 7), but when launching the app on ubuntu host it works fine, connection is established and data is uploaded. What could be the cause of the issue?

P4 python connection broken SSL error

I'm already using P4V client and everything is fine, no connection error.
Error:
I've got some SSL errors when I try to execute p4 command from Python.
And it's random, If i re run the script, error isnt thrown everytime
From the client, the output is :
SSL receive failed.\nread: Operation succeed : WSAECONNRESET
From the server side logs, i've got :
Connection from 90.XX.XX.93:53929 broken. SSL receive failed. read:
Connection reset by peer: Connection reset by peer
After le P4 Connection with p4.connect(), I run a p4.run_trust() command and the result seems ok
Trust already established
This error is trown doing a p4 fetch, of p4 edit myfile
Configuration
I'm starting my python script from the same computer running the P4V client. I'm using the same configuration ( user, workspace, url+port > ssl:p4.our-url.domain:1666 ). The SSL error happened with or without the P4V client started.
The SSL certificate was generated during the Perforce Server installation and configuration.
There is no apache server behind our subdomain p4.our-domain, so I can't test the SSL certificate using online SSL checker ( my network knowledge reach its limit there )
When i do a p4 info there is a "peer address", basically my IP with a random generated port (53929). What is this port ? Do i need to set a fixed port and redirect to my computer runing the script ?
Do you have any ideas where that error come from ? Is that a bad server configuration ( weird cause every p4v client in the office works).
Do i need to establish and distribute a new certificate to all users of the P4Python script ?
Python 3.5.4
PyOpenssl 18.0.0
P4Python 2017.2.1615960
Thanks a lot for any advice.
ANSWER suggested by Sam Stafford
Sam was right, It seems I got a timeout. I was opening the P4 connection and connecting to the server on the script launch, then processing was launched to generate files before using p4 fetch/add/submit. Here is a workaround to reconnect in case on disconnection from the server
# self.myp4 = P4() was created on init, files are added
submited = False
maxTry = 5
while not submited and maxTry > 0:
try:
reslist = self.p4.run_submit(ch)
except P4Exception as p4e:
print(str(self.p4.errors))
self.myp4.disconnect()
maxTry -= 1
self.myp4.connect()
submited = reslist is not None and len(reslist) > 0
That works if you want to keep the connection open. I guess the best way to avoid timeout is to call P4.connect() method just before any P4.run_*method*() and close it after. Instead of wating for timeout to restart the connection.

"Connection reset by peer" is a TCP error.
What does "connection reset by peer" mean?
Maybe your script is holding its connection open longer than P4V does, and a transient network failure during that period causes the connection to be reset? The best fix is probably to have the script catch the error, open a new connection, and pick up where it left off.

Rabbitmq connection issue when using a username and password

I am trying to start some background processing through rabbitmq, but when I send the request, I get the below error in the rabbitmq log. But, I think I am providing the right credentials, as my celery works are able to connect to rabbitmq server using the same username/password combination.
=ERROR REPORT==== 12-Jun-2012::20:50:29 ===
exception on TCP connection from 127.0.0.1:41708
{channel0_error,starting,
{amqp_error,access_refused,
"AMQPLAIN login refused: user 'guest' - invalid credentials",
'connection.start_ok'}}

To get resolve connection with rabbitmq need to inspect below points:
Connectivity from client machine to rabbitmq server machine [in case if client and server are running on separate machine], need to check
along with port as well.
Credential (username and password), a user must be onboarded into RabbitMQ which will be used to connect with RabbitMQ
Permission to User must be given (permission may be attached to VHOST as well so need to provide permission carefully)

The best way to debug permissions issues in the amqp protocol is to look at the request:
transport://userid:password#hostname:port/virtual_host
from http://docs.celeryproject.org/en/latest/configuration.html#conf-broker-settings

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dask - new cluster creation fails, HDFS files owed by "dask" user - python

Related

Cannot connect to DataStax Enterprise cluster from Python app

Connecting to Teradata using teradatasql module in Python

HTTPS connection closed after SSL handshake with no exception

P4 python connection broken SSL error

Rabbitmq connection issue when using a username and password

Categories

Resources