how to debug a CommClosedError in Dask Gateway deployed in Kubernetes

how to debug a CommClosedError in Dask Gateway deployed in Kubernetes - python

I have deployed dask_gateway 0.8.0 (with dask==2.25.0 and distributed==2.25.0) in a Kubernetes cluster.
When I create a new cluster with:
cluster = gateway.new_cluster(public_address = gateway._public_address)
I get this error:
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 297, in _
handshake = await asyncio.wait_for(comm.read(), 1)
File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-f74f0/x86_64-centos7-gcc8-opt/lib/python3.6/asyncio/tasks.py", line 351, in wait_for
yield from waiter
concurrent.futures._base.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 304, in _
raise CommClosedError() from e
distributed.comm.core.CommClosedError
However, if I check the pods, the cluster has actually been created, and I can scale it up, and everything seems fine in the dashboard (I can even see the workers).
However, I cannot get the client:
> client = cluster.get_client()
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 297, in _
handshake = await asyncio.wait_for(comm.read(), 1)
File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-f74f0/x86_64-centos7-gcc8-opt/lib/python3.6/asyncio/tasks.py", line 351, in wait_for
yield from waiter
concurrent.futures._base.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 304, in _
raise CommClosedError() from e
distributed.comm.core.CommClosedError
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
321 if not comm:
--> 322 _raise(error)
323 except FatalCommClosedError:
~/.local/lib/python3.6/site-packages/distributed/comm/core.py in _raise(error)
274 )
--> 275 raise IOError(msg)
276
OSError: Timed out trying to connect to 'gateway://traefik-dask-gateway:80/jhub.0373ea68815d47fca6a6c489c8f7263a' after 100 s: connect() didn't finish in time
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
<ipython-input-19-affca45186d3> in <module>
----> 1 client = cluster.get_client()
~/.local/lib/python3.6/site-packages/dask_gateway/client.py in get_client(self, set_as_default)
1066 set_as_default=set_as_default,
1067 asynchronous=self.asynchronous,
-> 1068 loop=self.loop,
1069 )
1070 if not self.asynchronous:
~/.local/lib/python3.6/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, connection_limit, **kwargs)
743 ext(self)
744
--> 745 self.start(timeout=timeout)
746 Client._instances.add(self)
747
~/.local/lib/python3.6/site-packages/distributed/client.py in start(self, **kwargs)
948 self._started = asyncio.ensure_future(self._start(**kwargs))
949 else:
--> 950 sync(self.loop, self._start, **kwargs)
951
952 def __await__(self):
~/.local/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
337 if error[0]:
338 typ, exc, tb = error[0]
--> 339 raise exc.with_traceback(tb)
340 else:
341 return result[0]
~/.local/lib/python3.6/site-packages/distributed/utils.py in f()
321 if callback_timeout is not None:
322 future = asyncio.wait_for(future, callback_timeout)
--> 323 result[0] = yield future
324 except Exception as exc:
325 error[0] = sys.exc_info()
/cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/lib/python3.6/site-packages/tornado/gen.py in run(self)
1131
1132 try:
-> 1133 value = future.result()
1134 except Exception:
1135 self.had_exception = True
~/.local/lib/python3.6/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
1045
1046 try:
-> 1047 await self._ensure_connected(timeout=timeout)
1048 except (OSError, ImportError):
1049 await self._close()
~/.local/lib/python3.6/site-packages/distributed/client.py in _ensure_connected(self, timeout)
1103 try:
1104 comm = await connect(
-> 1105 self.scheduler.address, timeout=timeout, **self.connection_args
1106 )
1107 comm.name = "Client->Scheduler"
~/.local/lib/python3.6/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
332 backoff = min(backoff, 1) # wait at most one second
333 else:
--> 334 _raise(error)
335 else:
336 break
~/.local/lib/python3.6/site-packages/distributed/comm/core.py in _raise(error)
273 error,
274 )
--> 275 raise IOError(msg)
276
277 backoff = 0.01
OSError: Timed out trying to connect to 'gateway://traefik-dask-gateway:80/jhub.0373ea68815d47fca6a6c489c8f7263a' after 100 s: Timed out trying to connect to 'gateway://traefik-dask-gateway:80/jhub.0373ea68815d47fca6a6c489c8f7263a' after 100 s: connect() didn't finish in time
How do I debug this? Any pointer would be greatly appreciated.
I already tried increasing all the timeouts, but nothing changed:
os.environ["DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT"]="100s"
os.environ["DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP"]="600s"
os.environ["DASK_DISTRIBUTED__COMM__RETRY__DELAY__MIN"]="1s"
os.environ["DASK_DISTRIBUTED__COMM__RETRY__DELAY__MAX"]="60s"
I wrote a tutorial about the steps I took to deploy dask gateway, see https://zonca.dev/2020/08/dask-gateway-jupyterhub.html.
I am quite sure this was working fine a few weeks ago, but I cannot identify what changed...

You need to use compatible versions of dask and dask-distributed everywhere.
I believe this is an error related to an upgrade in the communications protocol for distributed. See https://github.com/dask/dask-gateway/issues/316#issuecomment-702947730
These are the pinned versions of the dependencies for the docker images as of Nov 10, 2020 (in conda environment.yml compatible format):
- python=3.7.7
- dask=2.21.0
- distributed=2.21.0
- cloudpickle=1.5.0
- toolz=0.10.0

Related

Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:32778)

I have pyspark dataframe like this:
x y
656 78
766 87
677 63
. .
. .
. .
Where I have around 72 million rows. Now I want to plot histogram for this pyspark df for column y.
I have tried collect() and toPandas().
But collect method is throwing error.
[val.y for val in df.select('y').collect()]
Out:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:32778)
Traceback (most recent call last):
File "/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 535, in collect
sock_info = self._jdf.collectToPython()
File "/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o180.collectToPython
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1067, in start
self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/pyspark/sql/dataframe.py in collect(self)
534 with SCCallSiteSync(self._sc) as css:
--> 535 sock_info = self._jdf.collectToPython()
536 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
335 "An error occurred while calling {0}{1}{2}".
--> 336 format(target_id, ".", name))
337 else:
Py4JError: An error occurred while calling o180.collectToPython
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in _get_connection(self)
928 try:
--> 929 connection = self.deque.pop()
930 except IndexError:
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
ConnectionRefusedError Traceback (most recent call last)
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in start(self)
1066 try:
-> 1067 self.socket.connect((self.address, self.port))
1068 self.stream = self.socket.makefile("rb")
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Py4JNetworkError Traceback (most recent call last)
/tmp/ipykernel_29360/1990936990.py in <module>
----> 1 non_routine_saving = [val.savingsPercent for val in non_routine.select('savingsPercent').collect()]
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/pyspark/sql/dataframe.py in collect(self)
533 """
534 with SCCallSiteSync(self._sc) as css:
--> 535 sock_info = self._jdf.collectToPython()
536 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
537
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/pyspark/traceback_utils.py in __exit__(self, type, value, tb)
76 SCCallSiteSync._spark_stack_depth -= 1
77 if SCCallSiteSync._spark_stack_depth == 0:
---> 78 self._context._jsc.setCallSite(None)
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1253 proto.END_COMMAND_PART
1254
-> 1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
1257 answer, self.gateway_client, self.target_id, self.name)
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in send_command(self, command, retry, binary)
981 if `binary` is `True`.
982 """
--> 983 connection = self._get_connection()
984 try:
985 response = connection.send_command(command)
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in _get_connection(self)
929 connection = self.deque.pop()
930 except IndexError:
--> 931 connection = self._create_connection()
932 return connection
933
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in _create_connection(self)
935 connection = GatewayConnection(
936 self.gateway_parameters, self.gateway_property)
--> 937 connection.start()
938 return connection
939
/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in start(self)
1077 "server ({0}:{1})".format(self.address, self.port)
1078 logger.exception(msg)
-> 1079 raise Py4JNetworkError(msg, e)
1080
1081 def _authenticate_connection(self):
Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:32778)
Spark Config:
from __future__ import print_function
from platform import python_version
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import json
import csv
import pickle as pkl
import seaborn as sns
import plotly.express as px
from datetime import date, datetime, timedelta
import findspark
findspark.init()
from pyspark import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.types as T
from pyspark.sql import functions as f
from pyspark.sql.functions import col, countDistinct
from pyspark.sql.window import Window
pd.set_option('display.max_columns', None)
# Constants for application
APPLICATION_NAME = "p13n_data_introduction"
CHECKPOINT_DIRECTORY = "gs://p13n-storage2/user/s1b0jec"
spark_config = {}
spark_config["spark.executor.memory"] = "32G"
# spark_config["spark.executor.memoryOverhead"] = "4G"
spark_config["spark.executor.cores"] = "32"
spark_config["spark.driver.memory"] = "32G"
# spark_config["spark.shuffle.memoryFraction"] = "0"
# Executor config
spark_config["spark.dyamicAllocation.enable"] = "true"
spark_config["spark.dynamicAllocation.minExecutors"] = "100"
spark_config["spark.dynamicAllocation.maxExecutors"] = "300"
spark_config["spark.submit.deployMode"] = "client"
spark_config["spark.hive.mapred.supports.subdirectories"] = "true"
spark_config["spark.yarn.queue"] = "default"
spark_config["spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive"] = "true"
spark_config["spark.hadoop.hive.exec.dynamic.partition"] = "true"
spark_config["spark.hadoop.hive.exec.dynamic.partition.mode"] = "nonstrict"
spark_config["spark.hadoop.hive.exec.max.dynamic.partitions.pernode"] = "100"
spark_config["spark.yarn.dist.archives"] = "gs://p13n-storage2/user/s1b0jec/envs/spark.zip#mypython"
# spark_config["spark.yarn.appMasterEnv.PYSPARK_PYTHON"] =
# spark_config["spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON"] = "python" # For client mode it's the default `python` executable, whereas for cluster mode we use the distributed python environment
os.environ['PYSPARK_PYTHON'] = "./mypython/spark/bin/python"
os.environ['PYSPARK_DRIVER_PYTHON'] = "python"
spark_config["spark.jars"] = "/opt/lib/bfdms-ien/dp1.5/lib/apache-hive-1.3.0-SNAPSHOT-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.3.0-SNAPSHOT.jar,/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar,/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar,/opt/lib/bfdms-ien/dp1.5/lib/spark-2.4.8-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar"
spark_conf = SparkConf().setAll(spark_config.items())
spark = SparkSession.builder.appName(APPLICATION_NAME) \
.config(conf=spark_conf).enableHiveSupport().getOrCreate()
print("Spark session created: ", spark.sparkContext.applicationId)
spark.sparkContext.setCheckpointDir(CHECKPOINT_DIRECTORY)
import warnings
warnings.filterwarnings("ignore")

Parallel loop in python with joblib throws weird error

i am trying to run a very simple parallel loop in python
from joblib import Parallel, delayed
my_array = np.zeros((2,3))
def foo(array,x):
for i in [0,1,2]:
array[x][i]=25
print(array, id(array), 'arrays in workers')
def main(array):
print(id(array), 'Original array')
inputs = [0,1]
if __name__ == '__main__':
Parallel(n_jobs=8, verbose = 0)((foo)(array,i) for i in inputs)
# print(my_array, id(array), 'Original array')
main(my_array)
which does alter the array in the end but i get the following error
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/john/.local/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
r = call_item()
File "/home/john/.local/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/john/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/home/john/.local/lib/python3.8/site-packages/joblib/parallel.py", line 252, in __call__
return [func(*args, **kwargs)
File "/home/john/.local/lib/python3.8/site-packages/joblib/parallel.py", line 253, in <listcomp>
for func, args, kwargs in self.items]
TypeError: cannot unpack non-iterable NoneType object
"""
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-74-e1b992b5617f> in <module>
15 # print(my_array, id(array), 'Original array')
16
---> 17 main(my_array)
<ipython-input-74-e1b992b5617f> in main(array)
12 inputs = [0,1]
13 if __name__ == '__main__':
---> 14 Parallel(n_jobs=8, verbose = 0)((foo)(array,i) for i in inputs)
15 # print(my_array, id(array), 'Original array')
16
~/.local/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1040
1041 with self._backend.retrieval_context():
-> 1042 self.retrieve()
1043 # Make sure that we get a last message telling us we are done
1044 elapsed_time = time.time() - self._start_time
~/.local/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
919 try:
920 if getattr(self._backend, 'supports_timeout', False):
--> 921 self._output.extend(job.get(timeout=self.timeout))
922 else:
923 self._output.extend(job.get())
~/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
/usr/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
442 raise CancelledError()
443 elif self._state == FINISHED:
--> 444 return self.__get_result()
445 else:
446 raise TimeoutError()
/usr/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
387 if self._exception:
388 try:
--> 389 raise self._exception
390 finally:
391 # Break a reference cycle with the exception in self._exception
TypeError: cannot unpack non-iterable NoneType object
Now, since the array has been altered, i can just wrap everything in a try, except syntax and pretend it works but i am curious as to how to actually make this error go away.
Thank you for your time
best

What you are missing is the delayed function in python joblib, putting the delayed in the parallel call statement executes your code without any error. e.g.
import numpy as np
from joblib import Parallel, delayed
my_array = np.zeros((2,3))
def foo(array, x):
for i in [0,1,2]:
array[x][i]=25
print(array, id(array), 'arrays in workers')
def main(array):
print(id(array), 'Original array')
inputs = [0, 1]
if __name__ == '__main__':
Parallel(n_jobs=8, verbose = 0, prefer='threads')([delayed(foo)(array, i) for i in inputs])
# print(my_array, id(array), 'Original array')
main(my_array)
The theoretical or technical details of this function is here, read the accepted answer to get knowhow about the role of delayed in your code.

How to abbreviate traceback in Jupyter Notebook?

I documented an XML-API with Jupyter Notebook, so documentation and specification cannot drift apart.
This works great.
As the API also has to handle invalid input, Jupyter Notebook shows - correctly - the traceback.
The traceback is very verbose - I'd like to abbreviate / shorten it - ideally, only the last line should be shown.
request
server.get_licenses("not-existing-id")
current print out in Jupyter Notebook
---------------------------------------------------------------------------
Fault Traceback (most recent call last)
<ipython-input-5-366cceb6869e> in <module>
----> 1 server.get_licenses("not-existing-id")
/usr/lib/python3.9/xmlrpc/client.py in __call__(self, *args)
1114 return _Method(self.__send, "%s.%s" % (self.__name, name))
1115 def __call__(self, *args):
-> 1116 return self.__send(self.__name, args)
1117
1118 ##
/usr/lib/python3.9/xmlrpc/client.py in __request(self, methodname, params)
1456 allow_none=self.__allow_none).encode(self.__encoding, 'xmlcharrefreplace')
1457
-> 1458 response = self.__transport.request(
1459 self.__host,
1460 self.__handler,
/usr/lib/python3.9/xmlrpc/client.py in request(self, host, handler, request_body, verbose)
1158 for i in (0, 1):
1159 try:
-> 1160 return self.single_request(host, handler, request_body, verbose)
1161 except http.client.RemoteDisconnected:
1162 if i:
/usr/lib/python3.9/xmlrpc/client.py in single_request(self, host, handler, request_body, verbose)
1174 if resp.status == 200:
1175 self.verbose = verbose
-> 1176 return self.parse_response(resp)
1177
1178 except Fault:
/usr/lib/python3.9/xmlrpc/client.py in parse_response(self, response)
1346 p.close()
1347
-> 1348 return u.close()
1349
1350 ##
/usr/lib/python3.9/xmlrpc/client.py in close(self)
660 raise ResponseError()
661 if self._type == "fault":
--> 662 raise Fault(**self._stack[0])
663 return tuple(self._stack)
664
Fault: <Fault 1: 'company id is not valid'>
my wish output
Fault: <Fault 1: 'company id is not valid'>

As it turns out, that's built into iPython, so you don't need to install or update anything.
Just put a single cell at the top of your notebook and run %xmode Minimal as the only input. You can also see the documentation with %xmode? or a lot of other "magic method" documentation with %quickref.

The following solution, using sys.excepthook works in a REPL...
code
import sys
def my_exc_handler(type, value, traceback):
print(repr(value), file=sys.stderr)
sys.excepthook = my_exc_handler
1 / 0
bash
❯ python3.9 main.py
ZeroDivisionError('division by zero')
... but unfortunately not in Jupyter Notebook - I still get the full traceback.
When I have a look at Python's documentation...
When an exception is raised and uncaught
... maybe the "uncaught" is the problem. When I have to guess, I think Jupyter Notebook catches all exceptions, and does the formatting and printing itself.

Why am I getting an assertion error when create Device Quantile Matrix?

I am using the following code to load a csv file into a dask cudf, and then creating a devicequantilematrix for xgboost which yields the error:
cluster = LocalCUDACluster(rmm_pool_size=parse_bytes("9GB"), n_workers=5, threads_per_worker=1)
client = Client(cluster)
ddb = dask_cudf.read_csv('/home/ubuntu/dataset.csv')
xTrain = ddb.iloc[:,20:]
yTrain = ddb.iloc[:,1:2]
dTrain = xgb.dask.DaskDeviceQuantileDMatrix(client=client, data=xTrain, label=yTrain)
error:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-16-2cca13ac807f> in <module>
----> 1 dTrain = xgb.dask.DaskDeviceQuantileDMatrix(client=client, data=xTrain, label=yTrain)
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/xgboost/dask.py in __init__(self, client, data, label, missing, weight, base_margin, label_lower_bound, label_upper_bound, feature_names, feature_types, max_bin)
508 label_upper_bound=label_upper_bound,
509 feature_names=feature_names,
--> 510 feature_types=feature_types)
511 self.max_bin = max_bin
512 self.is_quantile = True
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/xgboost/dask.py in __init__(self, client, data, label, missing, weight, base_margin, label_lower_bound, label_upper_bound, feature_names, feature_types)
229 base_margin=base_margin,
230 label_lower_bound=label_lower_bound,
--> 231 label_upper_bound=label_upper_bound)
232
233 def __await__(self):
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
835 else:
836 return sync(
--> 837 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
838 )
839
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
338 if error[0]:
339 typ, exc, tb = error[0]
--> 340 raise exc.with_traceback(tb)
341 else:
342 return result[0]
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/utils.py in f()
322 if callback_timeout is not None:
323 future = asyncio.wait_for(future, callback_timeout)
--> 324 result[0] = yield future
325 except Exception as exc:
326 error[0] = sys.exc_info()
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/xgboost/dask.py in map_local_data(self, client, data, label, weights, base_margin, label_lower_bound, label_upper_bound)
311
312 for part in parts:
--> 313 assert part.status == 'finished'
314
315 # Preserving the partition order for prediction.
AssertionError:
I have no idea what this error is caused by since it doesn't say anything other than "assertion error". I have a large dataset that is too big to read into a single GPU so I am using dask_cudf to split it up when I read it from disk, and then feeding it directly into the data structure required for XGBoost. I'm not sure whether its a dask_cudf problem or an XGBoost problem.
New error when I use the "wait" while persisting:
distributed.core - ERROR - 2154341415 exceeds max_bin_len(2147483647)
Traceback (most recent call last):
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/core.py", line 563, in handle_stream
handler(**merge(extra, msg))
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/scheduler.py", line 2382, in update_graph_hlg
dsk, dependencies, annotations = highlevelgraph_unpack(hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/highlevelgraph.py", line 161, in highlevelgraph_unpack
hlg = loads_msgpack(*dumped_hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/core.py", line 223, in loads_msgpack
payload, object_hook=msgpack_decode_default, use_list=False, **msgpack_opts
File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
ValueError: 2154341415 exceeds max_bin_len(2147483647)
tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tcp://127.0.0.1:43507' processes=4 threads=4, memory=49.45 GB>>
Traceback (most recent call last):
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/client.py", line 1177, in _heartbeat
self.scheduler_comm.send({"op": "heartbeat-client"})
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/batched.py", line 136, in send
raise CommClosedError
distributed.comm.core.CommClosedError
distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/core.py", line 491, in handle_comm
result = await result
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/scheduler.py", line 3247, in add_client
await self.handle_stream(comm=comm, extra={"client": client})
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/core.py", line 563, in handle_stream
handler(**merge(extra, msg))
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/scheduler.py", line 2382, in update_graph_hlg
dsk, dependencies, annotations = highlevelgraph_unpack(hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/highlevelgraph.py", line 161, in highlevelgraph_unpack
hlg = loads_msgpack(*dumped_hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/core.py", line 223, in loads_msgpack
payload, object_hook=msgpack_decode_default, use_list=False, **msgpack_opts
File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
ValueError: 2154341415 exceeds max_bin_len(2147483647)
tornado.application - ERROR - Exception in callback functools.partial(<function TCPServer._handle_connection.<locals>.<lambda> at 0x7f7058e87f80>, <Task finished coro=<BaseTCPListener._handle_stream() done, defined at /usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/comm/tcp.py:459> exception=ValueError('2154341415 exceeds max_bin_len(2147483647)')>)
Traceback (most recent call last):
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/tornado/tcpserver.py", line 331, in <lambda>
gen.convert_yielded(future), lambda f: f.result()
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/comm/tcp.py", line 476, in _handle_stream
await self.comm_handler(comm)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/core.py", line 491, in handle_comm
result = await result
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/scheduler.py", line 3247, in add_client
await self.handle_stream(comm=comm, extra={"client": client})
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/core.py", line 563, in handle_stream
handler(**merge(extra, msg))
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/scheduler.py", line 2382, in update_graph_hlg
dsk, dependencies, annotations = highlevelgraph_unpack(hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/highlevelgraph.py", line 161, in highlevelgraph_unpack
hlg = loads_msgpack(*dumped_hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/core.py", line 223, in loads_msgpack
payload, object_hook=msgpack_decode_default, use_list=False, **msgpack_opts
File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
ValueError: 2154341415 exceeds max_bin_len(2147483647)
---------------------------------------------------------------------------
CancelledError Traceback (most recent call last)
<ipython-input-9-e2b8073da6e7> in <module>
1 from dask.distributed import wait
----> 2 wait([xTrainDC,yTrainDC])
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/client.py in wait(fs, timeout, return_when)
4257 """
4258 client = default_client()
-> 4259 result = client.sync(_wait, fs, timeout=timeout, return_when=return_when)
4260 return result
4261
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
835 else:
836 return sync(
--> 837 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
838 )
839
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
338 if error[0]:
339 typ, exc, tb = error[0]
--> 340 raise exc.with_traceback(tb)
341 else:
342 return result[0]
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/utils.py in f()
322 if callback_timeout is not None:
323 future = asyncio.wait_for(future, callback_timeout)
--> 324 result[0] = yield future
325 except Exception as exc:
326 error[0] = sys.exc_info()
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
CancelledError:

I'm guessing it's something in the dask_cudf.read_csv('/home/ubuntu/dataset.csv') failing which causes the underlying future status to not be finished. Does the CSV fit in GPU memory across the GPUs you're using? Could you try the following code and report back the error message?
This will tell dask to compute the result of the read_csv and iloc functions and wait for the distributed result to be finished before moving onto creating the DMatrix.
from dask.distributed import wait
cluster = LocalCUDACluster(rmm_pool_size=parse_bytes("9GB"), n_workers=5, threads_per_worker=1)
client = Client(cluster)
ddb = dask_cudf.read_csv('/home/ubuntu/dataset.csv')
xTrain = ddb.iloc[:,20:].persist()
yTrain = ddb.iloc[:,1:2].persist()
wait([xTrain, yTrain])
dTrain = xgb.dask.DaskDeviceQuantileDMatrix(client=client, data=xTrain, label=yTrain)

pymongo error when writing

I am unable to do any writes to a remote mongodb database. I am able to connect and do lookups (e.g. find). I connect like this:
conn = pymongo.MongoClient(db_uri,slaveOK=True)
db = conn.test_database
coll = db.test_collection
But when I try to insert,
coll.insert({'a':1})
I run into an error:
---------------------------------------------------------------------------
AutoReconnect Traceback (most recent call last)
<ipython-input-56-d4ffb9e3fa79> in <module>()
----> 1 coll.insert({'a':1})
/usr/lib/python2.7/dist-packages/pymongo/collection.pyc in insert(self, doc_or_docs, manipulate, safe, check_keys, continue_on_error, **kwargs)
410 message._do_batched_insert(self.__full_name, gen(), check_keys,
411 safe, options, continue_on_error,
--> 412 self.uuid_subtype, client)
413
414 if return_one:
/usr/lib/python2.7/dist-packages/pymongo/mongo_client.pyc in _send_message(self, message, with_last_error, command, check_primary)
1126 except (ConnectionFailure, socket.error), e:
1127 self.disconnect()
-> 1128 raise AutoReconnect(str(e))
1129 except:
1130 sock_info.close()
AutoReconnect: not master
If I remove the slaveOK=True (setting it to it's default value of False) then I can still connect, but the reads (and writes) fail:
AutoReconnect Traceback (most recent call last)
<ipython-input-70-6671eea24f80> in <module>()
----> 1 coll.find_one()
/usr/lib/python2.7/dist-packages/pymongo/collection.pyc in find_one(self, spec_or_id, *args, **kwargs)
719 *args, **kwargs).max_time_ms(max_time_ms)
720
--> 721 for result in cursor.limit(-1):
722 return result
723 return None
/usr/lib/python2.7/dist-packages/pymongo/cursor.pyc in next(self)
1036 raise StopIteration
1037 db = self.__collection.database
-> 1038 if len(self.__data) or self._refresh():
1039 if self.__manipulate:
1040 return db._fix_outgoing(self.__data.popleft(),
/usr/lib/python2.7/dist-packages/pymongo/cursor.pyc in _refresh(self)
980 self.__skip, ntoreturn,
981 self.__query_spec(), self.__fields,
--> 982 self.__uuid_subtype))
983 if not self.__id:
984 self.__killed = True
/usr/lib/python2.7/dist-packages/pymongo/cursor.pyc in __send_message(self, message)
923 self.__tz_aware,
924 self.__uuid_subtype,
--> 925 self.__compile_re)
926 except CursorNotFound:
927 self.__killed = True
/usr/lib/python2.7/dist-packages/pymongo/helpers.pyc in _unpack_response(response, cursor_id, as_class, tz_aware, uuid_subtype, compile_re)
99 error_object = bson.BSON(response[20:]).decode()
100 if error_object["$err"].startswith("not master"):
--> 101 raise AutoReconnect(error_object["$err"])
102 elif error_object.get("code") == 50:
103 raise ExecutionTimeout(error_object.get("$err"),
AutoReconnect: not master and slaveOk=false
Am I connecting incorrectly? Is there a way to specify connecting to the primary replica?

AutoReconnect: not master means that your operation is failing because the node on which you are attempting to issue the command is not the primary of a replica set, where the command (e.g., a write operation) requires that node to be a primary. Setting slaveOK=True just enables you to read from a secondary node, where by default you would only be able to read from the primary.
MongoClient is automatically able to discover and connect to the primary if the replica set name is provided to the constructor with replicaSet=<replica set name>. See "Connecting to a Replica Set" in the PyMongo docs.
As an aside, slaveOK is deprecated, replaced by ReadPreference. You can specify a ReadPreference when creating the client or when issuing queries, if you want to target a node other than the primary.

I don't know It's related to this topic or not But when I searched about the below exception google leads me to the question. Maybe it'd be helpful.
pymongo.errors.NotMasterError: not master
In my case, My hard drive was full.
you can also figure it out with df -h command

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to debug a CommClosedError in Dask Gateway deployed in Kubernetes - python

Related

Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:32778)

Parallel loop in python with joblib throws weird error

How to abbreviate traceback in Jupyter Notebook?

Why am I getting an assertion error when create Device Quantile Matrix?

pymongo error when writing

Categories

Resources