Compare methods to terminate running a function after time period - python

I have a program, which opens a lot of urls and downloads pictures .
I have a function of the program, which manages link's opening and pictures downloading, which contains a for loop and performs some operations on the priority queue. I want to run this function, but no longer than the set time period. For example if this function is running longer than 1 hour I want to terminate it and run the rest of the program (other functions).
I was trying to find some solutions, and I found two question here on stack.
The first solution use only time module First solution
The second use also the multiprocessing module
Second solution. Can some one suggest which one will be more appropriate to use in my program? I will write a pseudocode of my function:
def fun():
for link in linkList:
if link not in queue:
queue.push(link)
else:
queue.updatePriority(link)
if queue:
top = queue.pop()
fun(top)
This function is called in other function:
def run(startLink):
fun(startLink)
And the run() function is called in other module.
Which method is better to use with a program which contains a lot of modules and performs a lot of

The asyncio module is ideal for this task.
You can create a future, then use asyncio.wait which supports a timeout parameter.

Using multiprocessing here would be a little bit tricky, because fun is consuming a priority queue (I'm assuming a Queue.PriorityQueue) that is coming from some other part of the program. That queue cannot easily be passed between processes - you would need to create a custom multiprocessing.BaseManager subclass, register the Queue.PriorityQueue class with it, and start up the Manager server, instantiate a PriorityQueue on the server, and use a Proxy to that instance everywhere you interact with the queue. That's a lot of overhead, and also hurts performance a bit.
Since it appears you don't actually want any concurrency here - you want the rest of the program to stop while fun is running - I don't think there's a compelling reason to use multiprocessing. Instead, I think using the time-based solution makes more sense.

Related

Use twisted with custom main loop [duplicate]

I have an existing program that has its own main loop, and does computations based on input it receives - let's say from the user, to make it simple. I want to now do the computations remotely instead of locally, and I decided to implement the RPCs in Twisted.
Ideally I just want to change one of my functions, say doComputation(), to make a call to twisted to perform the RPC, get the results, and return. The rest of the program should stay the same. How can I accomplish this, though? Twisted hijacks the main loop when I call reactor.run(). I also read that you don't really have threads in twisted, that all the tasks run in sequence, so it seems I can't just create a LoopingCall and run my main loop that way.
You have a couple of different options, depending on what sort of main loop your existing program has.
If it's a mainloop from a GUI library, Twisted may already have support for it. In that case, you can just go ahead and use it.
You could also write your own reactor. There isn't a lot of great documentation for this, but you can look at the way that qtreactor implements a reactor plugin externally to Twisted.
You can also write a minimal reactor using threadedselectreactor. The documentation for this is also sparse, but the wxpython reactor is implemented using it. Personally I wouldn't recommend this approach as it is difficult to test and may result in confusing race conditions, but it does have the advantage of letting you leverage almost all of Twisted's default networking code with only a thin layer of wrapping.
If you are really sure that you don't want your doComputation to be asynchronous, and you want your program to block while waiting for Twisted to answer, do the following:
start Twisted in another thread before your main loop starts up, with something like twistedThread = Thread(target=reactor.run); twistedThread.start()
instantiate an object to do your RPC communication (let's say, RPCDoer) in your own main loop's thread, so that you have a reference to it. Make sure to actually kick off its Twisted logic with reactor.callFromThread so you don't need to wrap all of its Twisted API calls.
Implement RPCDoer.doRPC to return a Deferred, using only Twisted API calls (i.e. don't call into your existing application code, so you don't need to worry about thread safety for your application objects; pass doRPC all the information that it needs as arguments).
You can now implement doComputation like this:
def doComputation(self):
rpcResult = blockingCallFromThread(reactor, self.myRPCDoer.doRPC)
return self.computeSomethingFrom(rpcResult)
Remember to call reactor.callFromThread(reactor.stop); twistedThread.join() from your main-loop's shutdown procedure, otherwise you may see some confusing tracebacks or log messages on exit.
Finally, one option that you should really consider, especially in the long term: dump your existing main loop, and figure out a way to just use Twisted's. In my experience this is the right answer for 9 out of 10 askers of questions like this. I'm not saying that this is always the way to go - there are plenty of cases where you really need to keep your own main loop, or where it's just way too much effort to get rid of the existing loop. But, maintaining your own loop is work too. Keep in mind that the Twisted loop has been extensively tested by millions of users and used in a huge variety of environments. If your loop is also extremely mature, that may not be a big deal, but if you're writing a small, new program, the difference in reliability may be significant.
It seems like the correct and very simple answer here is a LoopingCall:
http://www.saltycrane.com/blog/2008/10/running-functions-periodically-using-twisteds-loopingcall/
from datetime import datetime
from twisted.internet.task import LoopingCall
from twisted.internet import reactor
def doComputation():
print "Custom fn run at", datetime.now()
lc = LoopingCall(doComputation)
lc.start(0.1) # run your own loop 10 times a second
# put your other twisted here
reactor.run()

using multiple threads in Python

I'm trying to solve a problem, where I have many (on the order of ten thousand) URLs, and need to download the content from all of them. I've been doing this in a "for link in links:" loop up till now, but the amount of time it's taking is now too long. I think it's time to implement a multithreaded or multiprocessing approach. My question is, what is the best approach to take?
I know about the Global Interpreter Lock, but since my problem is network-bound, not CPU-bound, I don't think that will be an issue. I need to pass data back from each thread/process to the main thread/process. I don't need help implementing whatever approach (Terminate multiple threads when any thread completes a task covers that), I need advice on which approach to take. My current approach:
data_list = get_data(...)
output = []
for datum in data:
output.append(get_URL_data(datum))
return output
There's no other shared state.
I think the best approach would be to have a queue with all the data in it, and have several worker threads pop from the input queue, get the URL data, then push onto an output queue.
Am I right? Is there anything I'm missing? This is my first time implementing multithreaded code in any language, and I know it's generally a Hard Problem.
For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!
The best approach I can think of in your use case will be to use a thread pool and maintain a work queue. The threads in the thread pool get work from the work queue, do the work and then go get some more work. This way you can finely control the number of threads working on your URLs.
So, create a WorkQueue, which in your case is basically a list containing the URLs that need to be downloaded.
Create a thread pool, which create the number of threads you specify, fetches work from the WorkQueue and assigns it to a thread. Each time a thread finishes and returns you check if the work queues has more work and accordingly assign work to that thread again. You may also want to put a hook so that every time work is added to the work queue, your threads assigns it to a free thread if available.
The fastest and most efficient method of doing IO bound tasks like this is an asynchronous event loop. The libcurl can do this, and there is a Python wrapper for that called pycurl. Using it's "multi" interface you can do high-performance client activities. I have done over 1000 simultaneous fetchs as fast as one.
However, the API is quite low-level and difficult to use. There is a simplifying wrapper here, which you can use as an example.

Python : run multiple queries in parallel and get the first finished

I try to create a Python script that performs queries to multiple sites. The script works well (I use urllib2) but just for one link. For multiples sites, I make multiple requests one after the other but it is not very powerful.
What is the ideal solution (the threads I guess) to run multiple queries in parallel and stop others when a query returns a specific string please ?
I found this question but I have not found how to change it to stop the remaining threads... :
Python urllib2.urlopen() is slow, need a better way to read several urls
Thank you in advance !
(sorry if I made mistakes in English, I'm French ^^)
You can use Twisted to deal with multiple requests concurrently. Internally it will use epoll (or iocp or kqueue depending on the platform) to get notified of tcp availability efficently, which is cheaper than using threads. Once one request matches, you cancel the others.
Here is the Twisted http agent tutorial.
Usually this is implemented with the following pattern (sorry, my Python skills are not so good).
You have a class named Runner. This class has long running method, which gets the information you need. Also, it has a Cancel method, which interrupts the long running method in some way (you can make the url request object a class member field, so the cancel class calls the equivalent of request.terminate()).
The long running method need to accept a callback function, which to signal when done.
Then, before you start your many threads, you create instances of all these objects of that class, and keep them in a list. In the same loop you can start these long running methods, passing a callback method of your main program.
And, in the callback method, you just go trough the list of all threaded classes and call their cancel method.
Please, edit my answer with any Python specific implementation :)
You can run your queries with the multiprocessing library, poll for results, and shutdown queries you no longer need. Documentation for the module includes information on the Process class which has a terminate() method. If you wish to limit the number of requests sent out, check out options for pooling.

Use my own main loop in twisted

I have an existing program that has its own main loop, and does computations based on input it receives - let's say from the user, to make it simple. I want to now do the computations remotely instead of locally, and I decided to implement the RPCs in Twisted.
Ideally I just want to change one of my functions, say doComputation(), to make a call to twisted to perform the RPC, get the results, and return. The rest of the program should stay the same. How can I accomplish this, though? Twisted hijacks the main loop when I call reactor.run(). I also read that you don't really have threads in twisted, that all the tasks run in sequence, so it seems I can't just create a LoopingCall and run my main loop that way.
You have a couple of different options, depending on what sort of main loop your existing program has.
If it's a mainloop from a GUI library, Twisted may already have support for it. In that case, you can just go ahead and use it.
You could also write your own reactor. There isn't a lot of great documentation for this, but you can look at the way that qtreactor implements a reactor plugin externally to Twisted.
You can also write a minimal reactor using threadedselectreactor. The documentation for this is also sparse, but the wxpython reactor is implemented using it. Personally I wouldn't recommend this approach as it is difficult to test and may result in confusing race conditions, but it does have the advantage of letting you leverage almost all of Twisted's default networking code with only a thin layer of wrapping.
If you are really sure that you don't want your doComputation to be asynchronous, and you want your program to block while waiting for Twisted to answer, do the following:
start Twisted in another thread before your main loop starts up, with something like twistedThread = Thread(target=reactor.run); twistedThread.start()
instantiate an object to do your RPC communication (let's say, RPCDoer) in your own main loop's thread, so that you have a reference to it. Make sure to actually kick off its Twisted logic with reactor.callFromThread so you don't need to wrap all of its Twisted API calls.
Implement RPCDoer.doRPC to return a Deferred, using only Twisted API calls (i.e. don't call into your existing application code, so you don't need to worry about thread safety for your application objects; pass doRPC all the information that it needs as arguments).
You can now implement doComputation like this:
def doComputation(self):
rpcResult = blockingCallFromThread(reactor, self.myRPCDoer.doRPC)
return self.computeSomethingFrom(rpcResult)
Remember to call reactor.callFromThread(reactor.stop); twistedThread.join() from your main-loop's shutdown procedure, otherwise you may see some confusing tracebacks or log messages on exit.
Finally, one option that you should really consider, especially in the long term: dump your existing main loop, and figure out a way to just use Twisted's. In my experience this is the right answer for 9 out of 10 askers of questions like this. I'm not saying that this is always the way to go - there are plenty of cases where you really need to keep your own main loop, or where it's just way too much effort to get rid of the existing loop. But, maintaining your own loop is work too. Keep in mind that the Twisted loop has been extensively tested by millions of users and used in a huge variety of environments. If your loop is also extremely mature, that may not be a big deal, but if you're writing a small, new program, the difference in reliability may be significant.
It seems like the correct and very simple answer here is a LoopingCall:
http://www.saltycrane.com/blog/2008/10/running-functions-periodically-using-twisteds-loopingcall/
from datetime import datetime
from twisted.internet.task import LoopingCall
from twisted.internet import reactor
def doComputation():
print "Custom fn run at", datetime.now()
lc = LoopingCall(doComputation)
lc.start(0.1) # run your own loop 10 times a second
# put your other twisted here
reactor.run()

Python: time a method call and stop it if time is exceeded

I need to dynamically load code (comes as source), run it and get the results. The code that I load always includes a run method, which returns the needed results. Everything looks ridiculously easy, as usual in Python, since I can do
exec(source) #source includes run() definition
result = run(params)
#do stuff with result
The only problem is, the run() method in the dynamically generated code can potentially not terminate, so I need to only run it for up to x seconds. I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it (or can I). Performance is also an issue to consider, since all of this is happening in a long while loop
Any suggestions on how to proceed?
Edit: to clear things up per dcrosta's request: the loaded code is not untrusted, but generated automatically on the machine. The purpose for this is genetic programming.
The only "really good" solutions -- imposing essentially no overhead -- are going to be based on SIGALRM, either directly or through a nice abstraction layer; but as already remarked Windows does not support this. Threads are no use, not because it's hard to get results out (that would be trivial, with a Queue!), but because forcibly terminating a runaway thread in a nice cross-platform way is unfeasible.
This leaves high-overhead multiprocessing as the only viable cross-platform solution. You'll want a process pool to reduce process-spawning overhead (since presumably the need to kill a runaway function is only occasional, most of the time you'll be able to reuse an existing process by sending it new functions to execute). Again, Queue (the multiprocessing kind) makes getting results back easy (albeit with a modicum more caution than for the threading case, since in the multiprocessing case deadlocks are possible).
If you don't need to strictly serialize the executions of your functions, but rather can arrange your architecture to try two or more of them in parallel, AND are running on a multi-core machine (or multiple machines on a fast LAN), then suddenly multiprocessing becomes a high-performance solution, easily paying back for the spawning and IPC overhead and more, exactly because you can exploit as many processors (or nodes in a cluster) as you can use.
You could use the multiprocessing library to run the code in a separate process, and call .join() on the process to wait for it to finish, with the timeout parameter set to whatever you want. The library provides several ways of getting data back from another process - using a Value object (seen in the Shared Memory example on that page) is probably sufficient. You can use the terminate() call on the process if you really need to, though it's not recommended.
You could also use Stackless Python, as it allows for cooperative scheduling of microthreads. Here you can specify a maximum number of instructions to execute before returning. Setting up the routines and getting the return value out is a little more tricky though.
I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it
If the timeout expires, that means the method didn't finish, so there's no result to get. If you have incremental results, you can store them somewhere and read them out however you like (keeping threadsafety in mind).
Using SIGALRM-based systems is dicey, because it can deliver async signals at any time, even during an except or finally handler where you're not expecting one. (Other languages deal with this better, unfortunately.) For example:
try:
# code
finally:
cleanup1()
cleanup2()
cleanup3()
A signal passed up via SIGALRM might happen during cleanup2(), which would cause cleanup3() to never be executed. Python simply does not have a way to terminate a running thread in a way that's both uncooperative and safe.
You should just have the code check the timeout on its own.
import threading
from datetime import datetime, timedelta
local = threading.local()
class ExecutionTimeout(Exception): pass
def start(max_duration = timedelta(seconds=1)):
local.start_time = datetime.now()
local.max_duration = max_duration
def check():
if datetime.now() - local.start_time > local.max_duration:
raise ExecutionTimeout()
def do_work():
start()
while True:
check()
# do stuff here
return 10
try:
print do_work()
except ExecutionTimeout:
print "Timed out"
(Of course, this belongs in a module, so the code would actually look like "timeout.start()"; "timeout.check()".)
If you're generating code dynamically, then generate a timeout.check() call at the start of each loop.
Consider using the stopit package that could be useful in some cases you need timeout control. Its doc emphasizes the limitations.
https://pypi.python.org/pypi/stopit
a quick google for "python timeout" reveals a TimeoutFunction class
Executing untrusted code is dangerous, and should usually be avoided unless it's impossible to do so. I think you're right to be worried about the time of the run() method, but the run() method could do other things as well: delete all your files, open sockets and make network connections, begin cracking your password and email the result back to an attacker, etc.
Perhaps if you can give some more detail on what the dynamically loaded code does, the SO community can help suggest alternatives.

Categories