Parallel loop in python with joblib throws weird error

Parallel loop in python with joblib throws weird error - python

i am trying to run a very simple parallel loop in python
from joblib import Parallel, delayed
my_array = np.zeros((2,3))
def foo(array,x):
for i in [0,1,2]:
array[x][i]=25
print(array, id(array), 'arrays in workers')
def main(array):
print(id(array), 'Original array')
inputs = [0,1]
if __name__ == '__main__':
Parallel(n_jobs=8, verbose = 0)((foo)(array,i) for i in inputs)
# print(my_array, id(array), 'Original array')
main(my_array)
which does alter the array in the end but i get the following error
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/john/.local/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
r = call_item()
File "/home/john/.local/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/john/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/home/john/.local/lib/python3.8/site-packages/joblib/parallel.py", line 252, in __call__
return [func(*args, **kwargs)
File "/home/john/.local/lib/python3.8/site-packages/joblib/parallel.py", line 253, in <listcomp>
for func, args, kwargs in self.items]
TypeError: cannot unpack non-iterable NoneType object
"""
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-74-e1b992b5617f> in <module>
15 # print(my_array, id(array), 'Original array')
16
---> 17 main(my_array)
<ipython-input-74-e1b992b5617f> in main(array)
12 inputs = [0,1]
13 if __name__ == '__main__':
---> 14 Parallel(n_jobs=8, verbose = 0)((foo)(array,i) for i in inputs)
15 # print(my_array, id(array), 'Original array')
16
~/.local/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1040
1041 with self._backend.retrieval_context():
-> 1042 self.retrieve()
1043 # Make sure that we get a last message telling us we are done
1044 elapsed_time = time.time() - self._start_time
~/.local/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
919 try:
920 if getattr(self._backend, 'supports_timeout', False):
--> 921 self._output.extend(job.get(timeout=self.timeout))
922 else:
923 self._output.extend(job.get())
~/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
/usr/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
442 raise CancelledError()
443 elif self._state == FINISHED:
--> 444 return self.__get_result()
445 else:
446 raise TimeoutError()
/usr/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
387 if self._exception:
388 try:
--> 389 raise self._exception
390 finally:
391 # Break a reference cycle with the exception in self._exception
TypeError: cannot unpack non-iterable NoneType object
Now, since the array has been altered, i can just wrap everything in a try, except syntax and pretend it works but i am curious as to how to actually make this error go away.
Thank you for your time
best

What you are missing is the delayed function in python joblib, putting the delayed in the parallel call statement executes your code without any error. e.g.
import numpy as np
from joblib import Parallel, delayed
my_array = np.zeros((2,3))
def foo(array, x):
for i in [0,1,2]:
array[x][i]=25
print(array, id(array), 'arrays in workers')
def main(array):
print(id(array), 'Original array')
inputs = [0, 1]
if __name__ == '__main__':
Parallel(n_jobs=8, verbose = 0, prefer='threads')([delayed(foo)(array, i) for i in inputs])
# print(my_array, id(array), 'Original array')
main(my_array)
The theoretical or technical details of this function is here, read the accepted answer to get knowhow about the role of delayed in your code.

Related

GridSearchCV & BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable

I'm running a GridSearchCV for NLP data, this is the code I'm using:
%%time
# Next we can specify the hyperparameters for each model
param_grid = [
{
'transformer': list_of_vecs,
'scaler': [StandardScaler()],
'model': [LogisticRegression()],
'model__penalty': ['l1', 'l2'],
'model__C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
},
{
'transformer': list_of_vecs,
'scaler': [StandardScaler()],
'model': [DecisionTreeClassifier()],
'model__max_depth': [2, 3, 4, 5, 6]
}
]
# Train the GridSearch
grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
fitted_grid = grid.fit(X_train, y_train)
I've already run the GridSearch successfully once without any issue with fewer hyperparameters just to make sure it would run, but I started to suddenly get this error after I added a few more model__parameters and it only appears after about an hour of the code running. Any idea how I can fix this?:
exception calling callback for <Future at 0x1da7efdba60 state=finished
raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback: """
Traceback (most recent call last): File
"C:\Users\Alfredo\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py",
line 407, in _process_worker File
"C:\Users\Alfredo\anaconda3\lib\multiprocessing\queues.py", line 117,
in get
res = self._recv_bytes() File "C:\Users\Alfredo\anaconda3\lib\multiprocessing\connection.py", line
221, in recv_bytes File
"C:\Users\Alfredo\anaconda3\lib\multiprocessing\connection.py", line
323, in _recv_bytes File
"C:\Users\Alfredo\anaconda3\lib\multiprocessing\connection.py", line
345, in _get_more_data MemoryError """
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File
"C:\Users\Alfredo\anaconda3\lib\site-packages\joblib\externals\loky_base.py",
line 625, in _invoke_callbacks
callback(self) File "C:\Users\Alfredo\anaconda3\lib\site-packages\joblib\parallel.py",
line 359, in call
self.parallel.dispatch_next() File "C:\Users\Alfredo\anaconda3\lib\site-packages\joblib\parallel.py",
line 794, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator): File "C:\Users\Alfredo\anaconda3\lib\site-packages\joblib\parallel.py",
line 861, in dispatch_one_batch
self._dispatch(tasks) File "C:\Users\Alfredo\anaconda3\lib\site-packages\joblib\parallel.py",
line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb) File "C:\Users\Alfredo\anaconda3\lib\site-packages\joblib_parallel_backends.py",
line 531, in apply_async
future = self._workers.submit(SafeFunction(func)) File "C:\Users\Alfredo\anaconda3\lib\site-packages\joblib\externals\loky\reusable_executor.py",
line 177, in submit
return super(_ReusablePoolExecutor, self).submit( File "C:\Users\Alfredo\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py",
line 1115, in submit
raise self._flags.broken joblib.externals.loky.process_executor.BrokenProcessPool: A task has
failed to un-serialize. Please ensure that the arguments of the
function are all picklable.
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback: """ Traceback (most recent call last): File "C:\Users\Alfredo\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py",
line 407, in _process_worker File
"C:\Users\Alfredo\anaconda3\lib\multiprocessing\queues.py", line 117,
in get
res = self._recv_bytes() File "C:\Users\Alfredo\anaconda3\lib\multiprocessing\connection.py", line
221, in recv_bytes File
"C:\Users\Alfredo\anaconda3\lib\multiprocessing\connection.py", line
323, in _recv_bytes File
"C:\Users\Alfredo\anaconda3\lib\multiprocessing\connection.py", line
345, in _get_more_data MemoryError """
The above exception was the direct cause of the following exception:
BrokenProcessPool Traceback (most recent call
last) in
~\anaconda3\lib\site-packages\sklearn\model_selection_search.py in
fit(self, X, y, groups, **fit_params)
889 return results
890
--> 891 self._run_search(evaluate_candidates)
892
893 # multimetric is determined here because in the case of a callable
~\anaconda3\lib\site-packages\sklearn\model_selection_search.py in
_run_search(self, evaluate_candidates) 1390 def _run_search(self, evaluate_candidates): 1391 """Search all candidates in param_grid"""
-> 1392 evaluate_candidates(ParameterGrid(self.param_grid)) 1393 1394
~\anaconda3\lib\site-packages\sklearn\model_selection_search.py in
evaluate_candidates(candidate_params, cv, more_results)
836 )
837
--> 838 out = parallel(
839 delayed(_fit_and_score)(
840 clone(base_estimator),
~\anaconda3\lib\site-packages\joblib\parallel.py in call(self,
iterable) 1054 1055 with
self._backend.retrieval_context():
-> 1056 self.retrieve() 1057 # Make sure that we get a last message telling us we are done 1058
elapsed_time = time.time() - self._start_time
~\anaconda3\lib\site-packages\joblib\parallel.py in retrieve(self)
933 try:
934 if getattr(self._backend, 'supports_timeout', False):
--> 935 self._output.extend(job.get(timeout=self.timeout))
936 else:
937 self._output.extend(job.get())
~\anaconda3\lib\site-packages\joblib_parallel_backends.py in
wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
~\anaconda3\lib\concurrent\futures_base.py in result(self, timeout)
443 raise CancelledError()
444 elif self._state == FINISHED:
--> 445 return self.__get_result()
446 else:
447 raise TimeoutError()
~\anaconda3\lib\concurrent\futures_base.py in __get_result(self)
388 if self._exception:
389 try:
--> 390 raise self._exception
391 finally:
392 # Break a reference cycle with the exception in self._exception
~\anaconda3\lib\site-packages\joblib\externals\loky_base.py in
_invoke_callbacks(self)
623 for callback in self._done_callbacks:
624 try:
--> 625 callback(self)
626 except BaseException:
627 LOGGER.exception('exception calling callback for %r', self)
~\anaconda3\lib\site-packages\joblib\parallel.py in call(self,
out)
357 with self.parallel._lock:
358 if self.parallel._original_iterator is not None:
--> 359 self.parallel.dispatch_next()
360
361
~\anaconda3\lib\site-packages\joblib\parallel.py in
dispatch_next(self)
792
793 """
--> 794 if not self.dispatch_one_batch(self._original_iterator):
795 self._iterating = False
796 self._original_iterator = None
~\anaconda3\lib\site-packages\joblib\parallel.py in
dispatch_one_batch(self, iterator)
859 return False
860 else:
--> 861 self._dispatch(tasks)
862 return True
863
~\anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self,
batch)
777 with self._lock:
778 job_idx = len(self._jobs)
--> 779 job = self._backend.apply_async(batch, callback=cb)
780 # A job can complete so quickly than its callback is
781 # called before we get here, causing self._jobs to
~\anaconda3\lib\site-packages\joblib_parallel_backends.py in
apply_async(self, func, callback)
529 def apply_async(self, func, callback=None):
530 """Schedule a func to be run"""
--> 531 future = self._workers.submit(SafeFunction(func))
532 future.get = functools.partial(self.wrap_future_result, future)
533 if callback is not None:
~\anaconda3\lib\site-packages\joblib\externals\loky\reusable_executor.py
in submit(self, fn, *args, **kwargs)
175 def submit(self, fn, *args, **kwargs):
176 with self._submit_resize_lock:
--> 177 return super(_ReusablePoolExecutor, self).submit(
178 fn, *args, **kwargs)
179
~\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py
in submit(self, fn, *args, **kwargs) 1113 with
self._flags.shutdown_lock: 1114 if self._flags.broken
is not None:
-> 1115 raise self._flags.broken 1116 if self._flags.shutdown: 1117 raise
ShutdownExecutorError(
BrokenProcessPool: A task has failed to un-serialize. Please ensure
that the arguments of the function are all picklable.

You can fix it by removing the n_jobs=-1. However, I am not sure how to fix and also allow parallel processing. Another thing you could try is to set the pre_dispatch. It controls the number of jobs that get dispatched during parallel execution. The default value is 2 times the n_jobs. Thus, it could be overloading your processing queue. I had a case like yours, and I have set the n_jobs = -1 and the pre_dispatch = '1*n_jobs'. This worked for me.

Why am I getting an assertion error when create Device Quantile Matrix?

I am using the following code to load a csv file into a dask cudf, and then creating a devicequantilematrix for xgboost which yields the error:
cluster = LocalCUDACluster(rmm_pool_size=parse_bytes("9GB"), n_workers=5, threads_per_worker=1)
client = Client(cluster)
ddb = dask_cudf.read_csv('/home/ubuntu/dataset.csv')
xTrain = ddb.iloc[:,20:]
yTrain = ddb.iloc[:,1:2]
dTrain = xgb.dask.DaskDeviceQuantileDMatrix(client=client, data=xTrain, label=yTrain)
error:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-16-2cca13ac807f> in <module>
----> 1 dTrain = xgb.dask.DaskDeviceQuantileDMatrix(client=client, data=xTrain, label=yTrain)
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/xgboost/dask.py in __init__(self, client, data, label, missing, weight, base_margin, label_lower_bound, label_upper_bound, feature_names, feature_types, max_bin)
508 label_upper_bound=label_upper_bound,
509 feature_names=feature_names,
--> 510 feature_types=feature_types)
511 self.max_bin = max_bin
512 self.is_quantile = True
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/xgboost/dask.py in __init__(self, client, data, label, missing, weight, base_margin, label_lower_bound, label_upper_bound, feature_names, feature_types)
229 base_margin=base_margin,
230 label_lower_bound=label_lower_bound,
--> 231 label_upper_bound=label_upper_bound)
232
233 def __await__(self):
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
835 else:
836 return sync(
--> 837 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
838 )
839
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
338 if error[0]:
339 typ, exc, tb = error[0]
--> 340 raise exc.with_traceback(tb)
341 else:
342 return result[0]
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/utils.py in f()
322 if callback_timeout is not None:
323 future = asyncio.wait_for(future, callback_timeout)
--> 324 result[0] = yield future
325 except Exception as exc:
326 error[0] = sys.exc_info()
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/xgboost/dask.py in map_local_data(self, client, data, label, weights, base_margin, label_lower_bound, label_upper_bound)
311
312 for part in parts:
--> 313 assert part.status == 'finished'
314
315 # Preserving the partition order for prediction.
AssertionError:
I have no idea what this error is caused by since it doesn't say anything other than "assertion error". I have a large dataset that is too big to read into a single GPU so I am using dask_cudf to split it up when I read it from disk, and then feeding it directly into the data structure required for XGBoost. I'm not sure whether its a dask_cudf problem or an XGBoost problem.
New error when I use the "wait" while persisting:
distributed.core - ERROR - 2154341415 exceeds max_bin_len(2147483647)
Traceback (most recent call last):
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/core.py", line 563, in handle_stream
handler(**merge(extra, msg))
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/scheduler.py", line 2382, in update_graph_hlg
dsk, dependencies, annotations = highlevelgraph_unpack(hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/highlevelgraph.py", line 161, in highlevelgraph_unpack
hlg = loads_msgpack(*dumped_hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/core.py", line 223, in loads_msgpack
payload, object_hook=msgpack_decode_default, use_list=False, **msgpack_opts
File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
ValueError: 2154341415 exceeds max_bin_len(2147483647)
tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tcp://127.0.0.1:43507' processes=4 threads=4, memory=49.45 GB>>
Traceback (most recent call last):
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/client.py", line 1177, in _heartbeat
self.scheduler_comm.send({"op": "heartbeat-client"})
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/batched.py", line 136, in send
raise CommClosedError
distributed.comm.core.CommClosedError
distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/core.py", line 491, in handle_comm
result = await result
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/scheduler.py", line 3247, in add_client
await self.handle_stream(comm=comm, extra={"client": client})
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/core.py", line 563, in handle_stream
handler(**merge(extra, msg))
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/scheduler.py", line 2382, in update_graph_hlg
dsk, dependencies, annotations = highlevelgraph_unpack(hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/highlevelgraph.py", line 161, in highlevelgraph_unpack
hlg = loads_msgpack(*dumped_hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/core.py", line 223, in loads_msgpack
payload, object_hook=msgpack_decode_default, use_list=False, **msgpack_opts
File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
ValueError: 2154341415 exceeds max_bin_len(2147483647)
tornado.application - ERROR - Exception in callback functools.partial(<function TCPServer._handle_connection.<locals>.<lambda> at 0x7f7058e87f80>, <Task finished coro=<BaseTCPListener._handle_stream() done, defined at /usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/comm/tcp.py:459> exception=ValueError('2154341415 exceeds max_bin_len(2147483647)')>)
Traceback (most recent call last):
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/tornado/tcpserver.py", line 331, in <lambda>
gen.convert_yielded(future), lambda f: f.result()
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/comm/tcp.py", line 476, in _handle_stream
await self.comm_handler(comm)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/core.py", line 491, in handle_comm
result = await result
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/scheduler.py", line 3247, in add_client
await self.handle_stream(comm=comm, extra={"client": client})
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/core.py", line 563, in handle_stream
handler(**merge(extra, msg))
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/scheduler.py", line 2382, in update_graph_hlg
dsk, dependencies, annotations = highlevelgraph_unpack(hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/highlevelgraph.py", line 161, in highlevelgraph_unpack
hlg = loads_msgpack(*dumped_hlg)
File "/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/protocol/core.py", line 223, in loads_msgpack
payload, object_hook=msgpack_decode_default, use_list=False, **msgpack_opts
File "msgpack/_unpacker.pyx", line 195, in msgpack._cmsgpack.unpackb
ValueError: 2154341415 exceeds max_bin_len(2147483647)
---------------------------------------------------------------------------
CancelledError Traceback (most recent call last)
<ipython-input-9-e2b8073da6e7> in <module>
1 from dask.distributed import wait
----> 2 wait([xTrainDC,yTrainDC])
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/client.py in wait(fs, timeout, return_when)
4257 """
4258 client = default_client()
-> 4259 result = client.sync(_wait, fs, timeout=timeout, return_when=return_when)
4260 return result
4261
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
835 else:
836 return sync(
--> 837 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
838 )
839
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
338 if error[0]:
339 typ, exc, tb = error[0]
--> 340 raise exc.with_traceback(tb)
341 else:
342 return result[0]
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/distributed/utils.py in f()
322 if callback_timeout is not None:
323 future = asyncio.wait_for(future, callback_timeout)
--> 324 result[0] = yield future
325 except Exception as exc:
326 error[0] = sys.exc_info()
/usr/local/share/anaconda3/envs/rapidsai/lib/python3.7/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
CancelledError:

I'm guessing it's something in the dask_cudf.read_csv('/home/ubuntu/dataset.csv') failing which causes the underlying future status to not be finished. Does the CSV fit in GPU memory across the GPUs you're using? Could you try the following code and report back the error message?
This will tell dask to compute the result of the read_csv and iloc functions and wait for the distributed result to be finished before moving onto creating the DMatrix.
from dask.distributed import wait
cluster = LocalCUDACluster(rmm_pool_size=parse_bytes("9GB"), n_workers=5, threads_per_worker=1)
client = Client(cluster)
ddb = dask_cudf.read_csv('/home/ubuntu/dataset.csv')
xTrain = ddb.iloc[:,20:].persist()
yTrain = ddb.iloc[:,1:2].persist()
wait([xTrain, yTrain])
dTrain = xgb.dask.DaskDeviceQuantileDMatrix(client=client, data=xTrain, label=yTrain)

how to debug a CommClosedError in Dask Gateway deployed in Kubernetes

I have deployed dask_gateway 0.8.0 (with dask==2.25.0 and distributed==2.25.0) in a Kubernetes cluster.
When I create a new cluster with:
cluster = gateway.new_cluster(public_address = gateway._public_address)
I get this error:
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 297, in _
handshake = await asyncio.wait_for(comm.read(), 1)
File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-f74f0/x86_64-centos7-gcc8-opt/lib/python3.6/asyncio/tasks.py", line 351, in wait_for
yield from waiter
concurrent.futures._base.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 304, in _
raise CommClosedError() from e
distributed.comm.core.CommClosedError
However, if I check the pods, the cluster has actually been created, and I can scale it up, and everything seems fine in the dashboard (I can even see the workers).
However, I cannot get the client:
> client = cluster.get_client()
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 297, in _
handshake = await asyncio.wait_for(comm.read(), 1)
File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-f74f0/x86_64-centos7-gcc8-opt/lib/python3.6/asyncio/tasks.py", line 351, in wait_for
yield from waiter
concurrent.futures._base.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 304, in _
raise CommClosedError() from e
distributed.comm.core.CommClosedError
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
321 if not comm:
--> 322 _raise(error)
323 except FatalCommClosedError:
~/.local/lib/python3.6/site-packages/distributed/comm/core.py in _raise(error)
274 )
--> 275 raise IOError(msg)
276
OSError: Timed out trying to connect to 'gateway://traefik-dask-gateway:80/jhub.0373ea68815d47fca6a6c489c8f7263a' after 100 s: connect() didn't finish in time
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
<ipython-input-19-affca45186d3> in <module>
----> 1 client = cluster.get_client()
~/.local/lib/python3.6/site-packages/dask_gateway/client.py in get_client(self, set_as_default)
1066 set_as_default=set_as_default,
1067 asynchronous=self.asynchronous,
-> 1068 loop=self.loop,
1069 )
1070 if not self.asynchronous:
~/.local/lib/python3.6/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, connection_limit, **kwargs)
743 ext(self)
744
--> 745 self.start(timeout=timeout)
746 Client._instances.add(self)
747
~/.local/lib/python3.6/site-packages/distributed/client.py in start(self, **kwargs)
948 self._started = asyncio.ensure_future(self._start(**kwargs))
949 else:
--> 950 sync(self.loop, self._start, **kwargs)
951
952 def __await__(self):
~/.local/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
337 if error[0]:
338 typ, exc, tb = error[0]
--> 339 raise exc.with_traceback(tb)
340 else:
341 return result[0]
~/.local/lib/python3.6/site-packages/distributed/utils.py in f()
321 if callback_timeout is not None:
322 future = asyncio.wait_for(future, callback_timeout)
--> 323 result[0] = yield future
324 except Exception as exc:
325 error[0] = sys.exc_info()
/cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/lib/python3.6/site-packages/tornado/gen.py in run(self)
1131
1132 try:
-> 1133 value = future.result()
1134 except Exception:
1135 self.had_exception = True
~/.local/lib/python3.6/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
1045
1046 try:
-> 1047 await self._ensure_connected(timeout=timeout)
1048 except (OSError, ImportError):
1049 await self._close()
~/.local/lib/python3.6/site-packages/distributed/client.py in _ensure_connected(self, timeout)
1103 try:
1104 comm = await connect(
-> 1105 self.scheduler.address, timeout=timeout, **self.connection_args
1106 )
1107 comm.name = "Client->Scheduler"
~/.local/lib/python3.6/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
332 backoff = min(backoff, 1) # wait at most one second
333 else:
--> 334 _raise(error)
335 else:
336 break
~/.local/lib/python3.6/site-packages/distributed/comm/core.py in _raise(error)
273 error,
274 )
--> 275 raise IOError(msg)
276
277 backoff = 0.01
OSError: Timed out trying to connect to 'gateway://traefik-dask-gateway:80/jhub.0373ea68815d47fca6a6c489c8f7263a' after 100 s: Timed out trying to connect to 'gateway://traefik-dask-gateway:80/jhub.0373ea68815d47fca6a6c489c8f7263a' after 100 s: connect() didn't finish in time
How do I debug this? Any pointer would be greatly appreciated.
I already tried increasing all the timeouts, but nothing changed:
os.environ["DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT"]="100s"
os.environ["DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP"]="600s"
os.environ["DASK_DISTRIBUTED__COMM__RETRY__DELAY__MIN"]="1s"
os.environ["DASK_DISTRIBUTED__COMM__RETRY__DELAY__MAX"]="60s"
I wrote a tutorial about the steps I took to deploy dask gateway, see https://zonca.dev/2020/08/dask-gateway-jupyterhub.html.
I am quite sure this was working fine a few weeks ago, but I cannot identify what changed...

You need to use compatible versions of dask and dask-distributed everywhere.
I believe this is an error related to an upgrade in the communications protocol for distributed. See https://github.com/dask/dask-gateway/issues/316#issuecomment-702947730
These are the pinned versions of the dependencies for the docker images as of Nov 10, 2020 (in conda environment.yml compatible format):
- python=3.7.7
- dask=2.21.0
- distributed=2.21.0
- cloudpickle=1.5.0
- toolz=0.10.0

Simple code for phi(k) correlation matrix in Python

I am looking for a simple way (2 or 3 lines of code) to generate a Phi(k) correlation matrix in Python.
That should be possible since pandas_profiling is doing it, and it works fine.
But I want to be able to do it without pandas_profiling which is too heavy and computes things I don't need.
pandas_profiling is using phik library.
I tried phik library (didn't find anything else)
I don't understand the error I got :
TypeError: sequence item 0: expected str instance, int found
I have no int in my dataframe.
Seems like a bug in phik, but then how does pandas profiling do, since it's using it too ?
What's happening here ?
Many thanks
I have this code :
import numpy as np
import pandas as pd
import phik
NB_SAMPLES = 200
NB_VARIABLES = 3
rand_mat = np.random.uniform(low=0.5, high=15, size=(NB_SAMPLES,NB_VARIABLES))
df = pd.DataFrame(rand_mat)
df['cat_column'] = pd.cut(df[0], bins=5, labels=['F1','F2','F3','F4','F5'])
print(df)
df.phik_matrix()
Result :
0 1 2 cat_column
0 0.911098 8.549206 9.270484 F1
1 13.591250 9.161498 5.614470 F5
2 3.308305 1.589402 5.394675 F1
3 12.031064 9.968686 7.519628 F5
4 14.427813 1.533533 2.352659 F5
.. ... ... ... ...
195 10.556285 3.541869 4.804826 F4
196 5.721784 11.783908 13.104844 F2
197 7.336637 14.512256 14.993096 F3
198 4.375895 11.881784 1.129816 F2
199 0.519900 6.624423 9.239070 F1
[200 rows x 4 columns]
interval_cols not set, guessing: [0, 1, 2]
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
r = call_item()
File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
return self.fn(*self.args, **self.kwargs)
File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 608, in __call__
return self.func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 256, in __call__
for func, args, kwargs in self.items]
File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 256, in <listcomp>
for func, args, kwargs in self.items]
File "/opt/conda/lib/python3.7/site-packages/phik/phik.py", line 162, in _calc_phik
combi = ':'.join(comb)
TypeError: sequence item 0: expected str instance, int found
"""
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-31-398c72b34799> in <module>
11 df['cat_column'] = pd.cut(df[0], bins=5, labels=['F1','F2','F3','F4','F5'])
12 print(df)
---> 13 df.phik_matrix()
/opt/conda/lib/python3.7/site-packages/phik/phik.py in phik_matrix(df, interval_cols, bins, quantile, noise_correction, dropna, drop_underflow, drop_overflow)
215 data_binned, binning_dict = bin_data(df_clean, cols=interval_cols_clean, bins=bins, quantile=quantile, retbins=True)
216 return phik_from_rebinned_df(data_binned, noise_correction, dropna=dropna, drop_underflow=drop_underflow,
--> 217 drop_overflow=drop_overflow)
218
219
/opt/conda/lib/python3.7/site-packages/phik/phik.py in phik_from_rebinned_df(data_binned, noise_correction, dropna, drop_underflow, drop_overflow)
145
146 phik_list = Parallel(n_jobs=NCORES)(delayed(_calc_phik)(co, data_binned[list(co)], noise_correction)
--> 147 for co in itertools.combinations_with_replacement(data_binned.columns.values, 2))
148
149 phik_overview = create_correlation_overview_table(dict(phik_list))
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
1015
1016 with self._backend.retrieval_context():
-> 1017 self.retrieve()
1018 # Make sure that we get a last message telling us we are done
1019 elapsed_time = time.time() - self._start_time
/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self)
907 try:
908 if getattr(self._backend, 'supports_timeout', False):
--> 909 self._output.extend(job.get(timeout=self.timeout))
910 else:
911 self._output.extend(job.get())
/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
560 AsyncResults.get from multiprocessing."""
561 try:
--> 562 return future.result(timeout=timeout)
563 except LokyTimeoutError:
564 raise TimeoutError()
/opt/conda/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
/opt/conda/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
TypeError: sequence item 0: expected str instance, int found

Try to reinstall the phik module as the following:
pip install phik==0.10.0
Then, your code together with sns.heatmap results the following:

pymc3: Disaster example with deterministic switchpoint function

I'm trying to reproduce coal mining example with deterministic function for switchpoint instead of using theano's switch function. Code:
%matplotlib inline
import matplotlib.pyplot as plt
import pymc3
import numpy as np
import theano.tensor as t
import theano
data = np.hstack((np.random.poisson(15,1000),np.random.poisson(2,100)))
plt.plot(data)
#theano.compile.ops.as_op(itypes=[t.lscalar, t.dscalar,t.dscalar],otypes=[t.dvector])
def rate1(sw,mu1,mu2):
n = len(data)
out = np.empty(n)
out[:sw] = mu1
out[sw:] = mu2
return out
with pymc3.Model() as dis:
switchpoint = pymc3.DiscreteUniform('switchpoint',lower=0, upper=len(data)-1)
mu1 = pymc3.Exponential('mu1', lam=1.)
mu2 = pymc3.Exponential('mu2',lam=1.)
disasters=pymc3.Poisson('disasters', mu=rate1, observed = data)
But this code rise an error:
--------------------------------------------------------------------------- KeyError Traceback (most recent call
last) c:\program files\git\theano\theano\tensor\type.py in
dtype_specs(self)
266 'complex64': (complex, 'theano_complex64', 'NPY_COMPLEX64')
--> 267 }[self.dtype]
268 except KeyError:
KeyError: 'object'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call
last) c:\program files\git\theano\theano\tensor\basic.py in
constant_or_value(x, rtype, name, ndim, dtype)
407 rval = rtype(
--> 408 TensorType(dtype=x_.dtype, broadcastable=bcastable),
409 x_.copy(),
c:\program files\git\theano\theano\tensor\type.py in init(self,
dtype, broadcastable, name, sparse_grad)
49 self.broadcastable = tuple(bool(b) for b in broadcastable)
---> 50 self.dtype_specs() # error checking is done there
51 self.name = name
c:\program files\git\theano\theano\tensor\type.py in dtype_specs(self)
269 raise TypeError("Unsupported dtype for %s: %s"
--> 270 % (self.class.name, self.dtype))
271
TypeError: Unsupported dtype for TensorType: object
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call
last) c:\program files\git\theano\theano\tensor\basic.py in
as_tensor_variable(x, name, ndim)
201 try:
--> 202 return constant(x, name=name, ndim=ndim)
203 except TypeError:
c:\program files\git\theano\theano\tensor\basic.py in constant(x,
name, ndim, dtype)
421 ret = constant_or_value(x, rtype=TensorConstant, name=name, ndim=ndim,
--> 422 dtype=dtype)
423
c:\program files\git\theano\theano\tensor\basic.py in
constant_or_value(x, rtype, name, ndim, dtype)
416 except Exception:
--> 417 raise TypeError("Could not convert %s to TensorType" % x, type(x))
418
TypeError: ('Could not convert FromFunctionOp{rate1} to TensorType',
)
During handling of the above exception, another exception occurred:
AsTensorError Traceback (most recent call
last) in ()
14 mu2 = pymc3.Exponential('mu2',lam=1.)
15 #rate1 = pymc3.switch(switchpoint >= np.arange(len(data)), mu1,mu2)
---> 16 disasters=pymc3.Poisson('disasters', mu=rate1, observed = data)
C:\Users\User\Anaconda3\lib\site-packages\pymc3\distributions\distribution.py
in new(cls, name, *args, **kwargs)
19 if isinstance(name, str):
20 data = kwargs.pop('observed', None)
---> 21 dist = cls.dist(*args, **kwargs)
22 return model.Var(name, dist, data)
23 elif name is None:
C:\Users\User\Anaconda3\lib\site-packages\pymc3\distributions\distribution.py
in dist(cls, *args, **kwargs)
32 def dist(cls, *args, **kwargs):
33 dist = object.new(cls)
---> 34 dist.init(*args, **kwargs)
35 return dist
36
C:\Users\User\Anaconda3\lib\site-packages\pymc3\distributions\discrete.py
in init(self, mu, *args, **kwargs)
185 super(Poisson, self).init(*args, **kwargs)
186 self.mu = mu
--> 187 self.mode = floor(mu).astype('int32')
188
189 def random(self, point=None, size=None, repeat=None):
c:\program files\git\theano\theano\gof\op.py in call(self,
*inputs, **kwargs)
598 """
599 return_list = kwargs.pop('return_list', False)
--> 600 node = self.make_node(*inputs, **kwargs)
601
602 if config.compute_test_value != 'off':
c:\program files\git\theano\theano\tensor\elemwise.py in
make_node(self, *inputs)
540 using DimShuffle.
541 """
--> 542 inputs = list(map(as_tensor_variable, inputs))
543 shadow = self.scalar_op.make_node(
544 *[get_scalar_type(dtype=i.type.dtype).make_variable()
c:\program files\git\theano\theano\tensor\basic.py in
as_tensor_variable(x, name, ndim)
206 except Exception:
207 str_x = repr(x)
--> 208 raise AsTensorError("Cannot convert %s to TensorType" % str_x, type(x))
209
210 # this has a different name, because _as_tensor_variable is the
AsTensorError: ('Cannot convert FromFunctionOp{rate1} to TensorType',
)
How i handle this?
The second thing - when i'm using the pymc3.switch function like this:
with pymc3.Model() as dis:
switchpoint = pymc3.DiscreteUniform('switchpoint',lower=0, upper=len(data)-1)
mu1 = pymc3.Exponential('mu1', lam=1.)
mu2 = pymc3.Exponential('mu2',lam=1.)
rate1 = pymc3.switch(switchpoint >= np.arange(len(data)), mu1,mu2)
disasters=pymc3.Poisson('disasters', mu=rate1, observed = data)
And next try to sample:
with dis:
step1 = pymc3.NUTS([mu1, mu2])
step2 = pymc3.Metropolis([switchpoint])
trace = pymc3.sample(10000, step = [step1,step2])
I get an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
c:\program files\git\theano\theano\compile\function_module.py in __call__(self, *args, **kwargs)
858 try:
--> 859 outputs = self.fn()
860 except Exception:
TypeError: expected type_num 9 (NPY_INT64) got 7
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-4-3247d908f897> in <module>()
2 step1 = pymc3.NUTS([mu1, mu2])
3 step2 = pymc3.Metropolis([switchpoint])
----> 4 trace = pymc3.sample(10000, step = [step1,step2])
C:\Users\User\Anaconda3\lib\site-packages\pymc3\sampling.py in sample(draws, step, start, trace, chain, njobs, tune, progressbar, model, random_seed)
153 sample_args = [draws, step, start, trace, chain,
154 tune, progressbar, model, random_seed]
--> 155 return sample_func(*sample_args)
156
157
C:\Users\User\Anaconda3\lib\site-packages\pymc3\sampling.py in _sample(draws, step, start, trace, chain, tune, progressbar, model, random_seed)
162 progress = progress_bar(draws)
163 try:
--> 164 for i, strace in enumerate(sampling):
165 if progressbar:
166 progress.update(i)
C:\Users\User\Anaconda3\lib\site-packages\pymc3\sampling.py in _iter_sample(draws, step, start, trace, chain, tune, model, random_seed)
244 if i == tune:
245 step = stop_tuning(step)
--> 246 point = step.step(point)
247 strace.record(point)
248 yield strace
C:\Users\User\Anaconda3\lib\site-packages\pymc3\step_methods\compound.py in step(self, point)
11 def step(self, point):
12 for method in self.methods:
---> 13 point = method.step(point)
14 return point
C:\Users\User\Anaconda3\lib\site-packages\pymc3\step_methods\arraystep.py in step(self, point)
116 bij = DictToArrayBijection(self.ordering, point)
117
--> 118 apoint = self.astep(bij.map(point))
119 return bij.rmap(apoint)
120
C:\Users\User\Anaconda3\lib\site-packages\pymc3\step_methods\metropolis.py in astep(self, q0)
123
124
--> 125 q_new = metrop_select(self.delta_logp(q,q0), q, q0)
126
127 if q_new is q:
c:\program files\git\theano\theano\compile\function_module.py in __call__(self, *args, **kwargs)
869 node=self.fn.nodes[self.fn.position_of_error],
870 thunk=thunk,
--> 871 storage_map=getattr(self.fn, 'storage_map', None))
872 else:
873 # old-style linkers raise their own exceptions
c:\program files\git\theano\theano\gof\link.py in raise_with_op(node, thunk, exc_info, storage_map)
312 # extra long error message in that case.
313 pass
--> 314 reraise(exc_type, exc_value, exc_trace)
315
316
C:\Users\User\Anaconda3\lib\site-packages\six.py in reraise(tp, value, tb)
656 value = tp()
657 if value.__traceback__ is not tb:
--> 658 raise value.with_traceback(tb)
659 raise value
660
c:\program files\git\theano\theano\compile\function_module.py in __call__(self, *args, **kwargs)
857 t0_fn = time.time()
858 try:
--> 859 outputs = self.fn()
860 except Exception:
861 if hasattr(self.fn, 'position_of_error'):
TypeError: expected type_num 9 (NPY_INT64) got 7
Apply node that caused the error: Elemwise{Composite{Switch(GE(i0, i1), i2, i3)}}(InplaceDimShuffle{x}.0, TensorConstant{[ 0 1..1098 1099]}, InplaceDimShuffle{x}.0, InplaceDimShuffle{x}.0)
Toposort index: 11
Inputs types: [TensorType(int64, (True,)), TensorType(int32, vector), TensorType(float64, (True,)), TensorType(float64, (True,))]
Inputs shapes: [(1,), (1100,), (1,), (1,)]
Inputs strides: [(4,), (4,), (8,), (8,)]
Inputs values: [array([549]), 'not shown', array([ 1.07762995]), array([ 1.01502801])]
Outputs clients: [[Elemwise{eq,no_inplace}(Elemwise{Composite{Switch(GE(i0, i1), i2, i3)}}.0, TensorConstant{(1,) of 0}), Elemwise{Composite{Switch(GE(i0, i1), ((Switch(i2, i3, (i4 * log(i0))) - i5) - i0), i3)}}[(0, 0)](Elemwise{Composite{Switch(GE(i0, i1), i2, i3)}}.0, TensorConstant{(1,) of 0}, InplaceDimShuffle{x}.0, TensorConstant{(1,) of -inf}, TensorConstant{[ 13. 13... 0. 1.]}, TensorConstant{[ 22.55216... ]})]]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
Being simple analyst, should i learn all this stuff about theano to being able to work with my statistical problems? Is a new mcmc sampler with gradient feature is only one thing that should motivates me to switch from pymc2 to pymc3?

For your first question, it looks like you're trying to pass a theano function as a variable. You need to call the function with the other variables as arguments, which will then return a theano variable. Try changing your line to
disasters=pymc3.Poisson('disasters', mu=rate1(switchpoint, mu1, mu2), observed = data)
I couldn't reproduce the error in your second part; the sampling worked just fine for me.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel loop in python with joblib throws weird error - python

Related

GridSearchCV & BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable

Why am I getting an assertion error when create Device Quantile Matrix?

how to debug a CommClosedError in Dask Gateway deployed in Kubernetes

Simple code for phi(k) correlation matrix in Python

pymc3: Disaster example with deterministic switchpoint function

Categories

Resources