Running long blocking calculations in parallel in twisted

Running long blocking calculations in parallel in twisted - python

I am trying to learn twisted framework. But, I am not able to get a handle of it.
Say, I have this function.
def long_blocking_call(arg1, arg2):
# do something
time.sleep(5) # simulate blocking call
return result
results = []
for k, v in args.iteritems():
r = long_blocking_call(k,v)
results.append(r)
But, I was wondering how can I leverage deferToThread (or something else in twisted world) to run the long_blocking_call in "parallel"
I found this example: Periodically call deferToThread
But, I am not exactly sure if that is running things in parallel?

deferToThread uses Python's built-in threading support to run the function passed to it in a separate thread (from a thread pool).
So deferToThread has all of the same properties as the built-in threading module when it comes to parallelism. On CPython, threads can run in parallel as long as only one of them is holding the Global Interpreter Lock.
Since there is no universal cause of "blocking" there is also no universal solution to "blocking" - so there's no way to say whether deferToThread will result in parallel execution or not in general. However, a general rule of thumb is that if the blocking comes from I/O it probably will and if it comes from computation it probably won't.
Of course, if it comes from I/O, you might be better off using some other feature from Twisted instead of multithreading.

Related

What is the difference between using Python's threading versus async/await

Im trying to get more familiar in the usage of asyncio in python3, but I dont see when i should use async/await or threading. Is there a difference or is one just easier to use than the other.
So for example between these two functions, is one better than the other?
Just generic code
def func1()
def func2()
threads = [threading.Thread(target=func1), threading.Thread(target=func2)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
versus
async def func1()
async def func2()
await asyncio.gather(func1(), func2())

My advice is to use threads in all cases, unless you're writing a high-performance concurrent application and benchmarking shows that threading alone is too slow.
In principle there are a lot of things to like about coroutines. They're easier to reason about since every sequence of nonblocking operations can be treated as atomic (at least if you don't combine them with threads); and because they're cheap, you can spin them up willy-nilly for better separation of concerns, without worrying about killing your performance.
However, the actual implementation of coroutines in Python is a mess. The biggest problem is that every function that might block, or that might call any code that might block, or that might call any code that ... might call any code that might block, has to be rewritten with the async and await keywords. This means that if you want to use a library that blocks, or that calls back into your code and you want to block in the callbacks, then you just can't use that library at all. There are duplicate copies of a bunch of libraries in the CPython distribution now for this reason, and even duplicate copies of built-in syntax (async for, etc.), but you can't expect most of the modules available through pip to be maintained in two versions.
Threading doesn't have this problem. If a library wasn't designed to be thread safe, you can still use it in a multithreaded program, either by confining its use to one thread or by protecting all uses with a lock.
So threading is overall far easier to use in Python despite its problems.
There are some third party coroutine solutions that avoid the "async infection" problem, such as greenlet, but you still won't be able to use them with libraries that block internally unless they're specially designed to work with coroutines.

With asyncio, a piece of code can take back control using await. With threads, this is handled by the Python scheduler. Multithreading puts into place a locking mechanism to prevent issues with shared memory. Another advantage of multithreading is it is able to use several cores.
Here is a great article if you want to read more:
https://medium.com/#nhumrich/asynchronous-python-45df84b82434

Twisted threading and Thread Pool Difference

What is the difference between using
from twisted.internet import reactor, threads
and just using
import thread
using a thread pool?
What is the twisted thing actually doing? Also, is it safe to use twisted threads?

What is the difference
With twisted.internet.threads, Twisted will manage the thread and a thread pool for you. This puts less of a burden on devs and allows devs to focus more on the business logic instead of dealing with the idiosyncrasies of threaded code. If you import thread yourself, then you have to manage threads, get the results from threads, ensure results are synchronized, make sure too many threads don't start up at once, fire a callback once the threads are complete, etc.
What is the twisted thing actually doing?
It depends on what "thing" you're talking about. Can you be more specific? Twisted has various thread functions you can leverage and each may function slightly different from each other.
And is it safe to use twisted threads.
It's absolutely safe! I'd say it's more safe than managing threads yourself. Take a look at all the functionality that Twisted's thread provides, then think about if you had to write this code yourself. If you've ever worked with threads, you'll know what it starts off simple enough, but as your application grows and if you didn't make good decisions about threads, then your application can become very complicated and messy. In general, Twisted will handle the threads in a uniform way and in a way that devs would expect a well behaved threaded app to behave.
References
https://twistedmatrix.com/documents/current/core/howto/threading.html

How do I handle multiple requests to python program using zeromq and threading or async?

I have a little program, which does some calculations in background, when I call it through zerorpc module in python 2.7.
Here is my code:
is_busy = False
class Server(object):
def calculateSomeStuff(self):
global is_busy
if (is_busy):
return 'I am busy!'
is_busy = True
# calculate some stuff
is_busy = False
print 'Done!'
return
def getIsBusy(self):
return is_busy
s = zerorpc.Server(Server())
s.bind("tcp://0.0.0.0:66666")
s.run()
What should I change to make this program returning is_busy when I call .getIsBusy() method, after .calculateSomeStuff() has started doing it's job?
As I know, there is no way to make it asynchronous in python 2.

You need multi-threading for real concurrency and exploit more than one CPU core if this is what you are after. See the Python threading module, GIL-lock details & possible workarounds and literature.
If you want a cooperative solution, read on.
zerorpc uses gevent for asynchronous input/output. With gevent you write coroutines (also called greenlet or userland threads) which are all running cooperatively on a single thread. The thread in which the gevent input output loop is running. The gevent ioloop takes care of resuming coroutines waiting for some I/O event.
The key here is the word cooperative. Compare that to threads running on a single CPU/core machine. Effectively there is nothing concurrent,but the operating system will preempt ( verb:
take action in order to prevent (an anticipated event) from happening ) a running thread to execute the next on and so on so that every threads gets a fair chance of moving forward.
This happens fast enough so that it feels like all threads are running at the same time.
If you write your code cooperatively with the gevent input/output loop, you can achieve the same effect by being careful of calling gevent.sleep(0) often enough to give a chance for the gevent ioloop to run other coroutines.
It's literally cooperative multithrading. I've heard it was like that in Windows 2 or something.
So, in your example, in the heavy computation part, you likely have some loop going on. Make sure to call gevent.sleep(0) a couple times per second and you will have the illusion of multi-threading.
I hope my answer wasn't too confusing.

How to limit function execution time in python multithreading environment?

I have a script which runs quite a lot of concurrent threads (at least 200). Every thread does some quite complex evaluations, which can take unpredictably lot of time. The evaluation method is implemented in C and I can't change it. I want to limit the method execution time for every thread. Please advise.

From what I understand of your problem, it might be a good case for using multiprocessing instead of multithreading. Multiprocessing will allow you to make use of all the available resources on the system - and then some, if you're not careful.
Threads don't actually run in parallel, so unless you're doing a lot of waiting for I/O or something like that, it would make more sense to call it from a separate process. You could use the Python multiprocessing library to call it from a Python script, or you could use a wrapper written in C and use some form of interprocess communication. The second option will avoid the overhead of launching another Python instance just to run some C code.
You could call time.sleep (or perform other tasks and check the system clock for elapsed time), and then check for results after the desired interval, permitting any processes that haven't finished to continue running while you make use of the results. Or, if you don't care at that point, you can send a signal to kill the process.

Python: time a method call and stop it if time is exceeded

I need to dynamically load code (comes as source), run it and get the results. The code that I load always includes a run method, which returns the needed results. Everything looks ridiculously easy, as usual in Python, since I can do
exec(source) #source includes run() definition
result = run(params)
#do stuff with result
The only problem is, the run() method in the dynamically generated code can potentially not terminate, so I need to only run it for up to x seconds. I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it (or can I). Performance is also an issue to consider, since all of this is happening in a long while loop
Any suggestions on how to proceed?
Edit: to clear things up per dcrosta's request: the loaded code is not untrusted, but generated automatically on the machine. The purpose for this is genetic programming.

The only "really good" solutions -- imposing essentially no overhead -- are going to be based on SIGALRM, either directly or through a nice abstraction layer; but as already remarked Windows does not support this. Threads are no use, not because it's hard to get results out (that would be trivial, with a Queue!), but because forcibly terminating a runaway thread in a nice cross-platform way is unfeasible.
This leaves high-overhead multiprocessing as the only viable cross-platform solution. You'll want a process pool to reduce process-spawning overhead (since presumably the need to kill a runaway function is only occasional, most of the time you'll be able to reuse an existing process by sending it new functions to execute). Again, Queue (the multiprocessing kind) makes getting results back easy (albeit with a modicum more caution than for the threading case, since in the multiprocessing case deadlocks are possible).
If you don't need to strictly serialize the executions of your functions, but rather can arrange your architecture to try two or more of them in parallel, AND are running on a multi-core machine (or multiple machines on a fast LAN), then suddenly multiprocessing becomes a high-performance solution, easily paying back for the spawning and IPC overhead and more, exactly because you can exploit as many processors (or nodes in a cluster) as you can use.

You could use the multiprocessing library to run the code in a separate process, and call .join() on the process to wait for it to finish, with the timeout parameter set to whatever you want. The library provides several ways of getting data back from another process - using a Value object (seen in the Shared Memory example on that page) is probably sufficient. You can use the terminate() call on the process if you really need to, though it's not recommended.

You could also use Stackless Python, as it allows for cooperative scheduling of microthreads. Here you can specify a maximum number of instructions to execute before returning. Setting up the routines and getting the return value out is a little more tricky though.

I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it
If the timeout expires, that means the method didn't finish, so there's no result to get. If you have incremental results, you can store them somewhere and read them out however you like (keeping threadsafety in mind).
Using SIGALRM-based systems is dicey, because it can deliver async signals at any time, even during an except or finally handler where you're not expecting one. (Other languages deal with this better, unfortunately.) For example:
try:
# code
finally:
cleanup1()
cleanup2()
cleanup3()
A signal passed up via SIGALRM might happen during cleanup2(), which would cause cleanup3() to never be executed. Python simply does not have a way to terminate a running thread in a way that's both uncooperative and safe.
You should just have the code check the timeout on its own.
import threading
from datetime import datetime, timedelta
local = threading.local()
class ExecutionTimeout(Exception): pass
def start(max_duration = timedelta(seconds=1)):
local.start_time = datetime.now()
local.max_duration = max_duration
def check():
if datetime.now() - local.start_time > local.max_duration:
raise ExecutionTimeout()
def do_work():
start()
while True:
check()
# do stuff here
return 10
try:
print do_work()
except ExecutionTimeout:
print "Timed out"
(Of course, this belongs in a module, so the code would actually look like "timeout.start()"; "timeout.check()".)
If you're generating code dynamically, then generate a timeout.check() call at the start of each loop.

Consider using the stopit package that could be useful in some cases you need timeout control. Its doc emphasizes the limitations.
https://pypi.python.org/pypi/stopit

a quick google for "python timeout" reveals a TimeoutFunction class

Executing untrusted code is dangerous, and should usually be avoided unless it's impossible to do so. I think you're right to be worried about the time of the run() method, but the run() method could do other things as well: delete all your files, open sockets and make network connections, begin cracking your password and email the result back to an attacker, etc.
Perhaps if you can give some more detail on what the dynamically loaded code does, the SO community can help suggest alternatives.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.