Dask Distributed: Getting some errors after computations - python

I am running Dask Distributed on Linux CentOS 7, with a Python 3.6.2 installation. My computation seems to be getting fine (I am still improving my code, but I am able to have some results), but I keep getting some python errors apparently linked to tornado module. I am only launching a one node standalone Dask distributed cluster.
Here is the most common example:
Exception in thread Client loop:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/site-packages/tornado/ioloop.py", line 832, in start
self._run_callback(self._callbacks.popleft())
AttributeError: 'NoneType' object has no attribute 'popleft'
And here is another one:
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f752ce6d6a0>>
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
return self.callback()
File "/usr/local/lib/python3.6/site-packages/distributed/stealing.py", line 248, in balance
sat = s.rprocessing[key]
KeyError: 'read-block-9024000000-e3fefd2110094168cc0505db69b326e0'
Do you have any idea why? Should I close some connections or stop the standalone cluster?

Yes, if you don't close down the Tornado IOLoop before exiting the process then it can die in an unpleasant way. Fortunately this shouldn't affect your application, except by looking unpleasant.
You might submit a bug report about this, it's still something that we should fix.

Related

GUnicorn and shared dictionary on REST API: "Ran out of input" Error on high load

I am using a manager.dict to synchronize some data between multiple workers of an API served with GUnicorn (with Meinheld workers). While this works fine for a few concurrent queries, it breaks when I fire about 100 queries simultaneously at the API and I get displayed the following stack trace:
2020-07-16 12:35:38,972-app.api.my_resource-ERROR-140298393573184-on_post-175-Ran out of input
Traceback (most recent call last):
File "/app/api/my_resource.py", line 163, in on_post
results = self.do_something(a, b, c, **d)
File "/app/user_data/data_lookup.py", line 39, in lookup_something
return (a in self._shared_dict
File "<string>", line 2, in __contains__
File "/usr/local/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod
kind, result = conn.recv()
File "/usr/local/lib/python3.6/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
EOFError: Ran out of input
2020-07-16 12:35:38,972-app.api.my_resource-ERROR-140298393573184-on_post-175-unpickling stack underflow
Traceback (most recent call last):
File "/app/api/my_resource.py", line 163, in on_post
results = self.do_something(a, b, c, **d)
File "/app/user_data/data_lookup.py", line 39, in lookup_something
return (a in self._shared_dict
File "<string>", line 2, in __contains__
File "/usr/local/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod
kind, result = conn.recv()
File "/usr/local/lib/python3.6/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
_pickle.UnpicklingError: unpickling stack underflow
My API framework is falcon. I have a dictionary containing user data that can be updated via POST requests. The architecture should be simple, so I chose Manager.dict() from the multiprocessing package to store the data. When doing other queries, this some input will be checked against the contents of this dictionary (if a in self._shared_dict: ...). This is where the above-mentioned errors occur.
Why is this problem happening? It seems to be tied to the manager.dict. Besides, when I do debugging in PyCharm, it also happens that the debugger does not evaluate any variables and often just hangs infinitely somewhere in multiprocessing code waiting for data.
It seems to have something to do with the Meinheld workers. When I configure GUnicorn to use the default sync worker class, this error does not occur anymore. Hence, Python multiprocessing and the Meinheld package seem not to work well in my setting.

"Dictionary size changed during iteration" from Pebble ProcessPool

We've some parallel processing code which is built around Pebble, it's been working robustly for quite some time but we seem to have run into some odd edge-case.
Based on the exception trace (and the rock-simple code feeding it) I suspect that it's actually a bug in Pebble but who knows.
The code feeding the process pool is pretty trivial:
pool = ProcessPool(max_workers=10, max_tasks=10)
for path in filepaths:
try:
future = pool.schedule(function=self.analyse_file, args(path), timeout=30)
future.add_done_callback(self.process_result)
exception Exception as e:
print("Exception fired:" + e) # NOT where the exception is firing
pool.close()
pool.join()
So in essence, we schedule a bunch of stuff to run, close out the pool then wait for the pool to complete the scheduled tasks. NOTE: the exception is not being thrown in the schedule loop, it gets fired AFTER we call join().
This is the exception stack trace:
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 150, in task_scheduler_loop
pool_manager.schedule(task)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 198, in schedule
self.worker_manager.dispatch(task)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 327, in dispatch
self.pool_channel.send(WorkerTask(task.id, task.payload))
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/channel.py", line 66, in send
return self.writer.send(obj)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
RuntimeError: dictionary changed size during iteration
I think it's got to be some weird race condition, as the code will work flawlessly on some data sets but fail at what appears to be a random point on another dataset.
We were using pebble 4.3.1 when we first ran into the issue (same version we'd had since the beginning), tried upgrading to 4.5.0, no change.
Has anybody run into similar issues with Pebble in the past? If so what was your fix?

Python shelve key error from threading?

** Importantly, this error does not occur all the time. It pops up every now and then, so I don't believe my code is at fault here.**
Hello, I am using threading along with Python 3.7 shelves. I have a dictionary object inside my shelf. Is this an error caused by threading/simultaneous access? (Using python's Threading library).
Relevant code is merely:
requestLibrary[requestID]
I am certain that this works, the key has been placed inside the library. But for some instances, the I get this key error. I suspect it has to do with the threading.
Exception in thread Thread-14:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/shelve.py", line 111, in __getitem__
value = self.cache[key]
KeyError: '4fGdb'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/lme/Workflow/Source/busDriver.py", line 189, in busDriver
while 'workflowEndingPoint'.lower() not in str(requestLibrary[requestID]).lower():
File "/usr/local/lib/python3.6/shelve.py", line 113, in __getitem__
f = BytesIO(self.dict[key.encode(self.keyencoding)])
KeyError: b'4fGdb'

Python: paramiko SCPException connection timeout error

From one Windows server i have a python script that connects to a remote Linux host and transfers some data via SSH/SCP. The script is scheduled to execute every morning via the WindowsTaskScheduler of the local server.
The problem i have is that sometimes (not Always strangely enough, and the last days this happens more often) the execution never completes as i get a connection timeout error. From the log of the script:
*Traceback (most recent call last):
File "D:\App\Anaconda3\lib\site-packages\paramiko\channel.py", line 665, in recv out = self.in_buffer.read(nbytes, self.timeout)
File "D:\App\Anaconda3\lib\site-packages\paramiko\buffered_pipe.py", line160, in read raise PipeTimeout() paramiko.buffered_pipe.PipeTimeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\App\Anaconda3\lib\site-packages\scp.py", line 314, in _recv_confirm
msg = self.channel.recv(512)
File "D:\App\Anaconda3\lib\site-packages\paramiko\channel.py", line 667, in recv
raise socket.timeout()socket.timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "script.py", line 235, in <module>
copy_file_to_remote(LOCAL_FOLDER, file_path, DESTINATION_FOLDER, ssh)
File "script.py", line 184, in copy_file_to_remote
scp.put(win_path, linux_path)
File "D:\App\Anaconda3\lib\site-packages\scp.py", line 154, in put
self._send_files(files)
File "D:\App\Anaconda3\lib\site-packages\scp.py", line 255, in _send_files
self._recv_confirm()
File "D:\App\Anaconda3\lib\site-packages\scp.py", line 316, in _recv_confirm
raise SCPException('Timout waiting for scp response')
scp.SCPException: Timout waiting for scp response*
My question is if it is possible to maximize the connection timeout limit in the ssh/scp funtions used in the script or in general how can i make my script to reestablish the connection, keep it open with keepalives or something similar.
It would be also nice if there is way to know at which side of the connection the problem exists, local server or remote machine. This could help a lot for troubleshooting. Any ideas/help very much appreciated!

Finding exception in python multiprocessing

I have a bit of python code that looks like this:
procs = cpu_count()-1
if serial or procs == 1:
results = map(do_experiment, experiments)
else:
pool = Pool(processes=procs)
results = pool.map(do_experiment, experiments)
It runs fine when I set the serial flag, but it gives the following error when the Pool is used. When I try to print something from do_experiment nothing shows up, so I can't try/catch there and print a stack trace.
Exception in thread Thread-2:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 530, in __bootstrap_inner
self.run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 483, in run
self.__target(*self.__args, **self.__kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 285, in _handle_tasks
put(task)
TypeError: 'NoneType' object is not callable
What is a good way to proceed debugging this?
I went back in my git history until I found a commit where things were still working.
I added a class to my code that extends dict so that keys can be accessed with a . (so dict.foo in stead of dict["foo"]. Multiprocessing did not take kindly to this, using an ordinary dict solved the problem.

Categories