Advice on backgrounding a task with variables?

Advice on backgrounding a task with variables? - python

I have a python webapp which accepts some data via POST. The method which is called can take a while to complete (30-60s), so I would like to "background" the method so I can respond to the user with a "processing" message.
The data is quite sensitive, so I'd prefer not to use any queue-based solutions. I also want to ensure that the backgrounded method doesn't get interrupted should the webapp fail in any way.
My first thought is to fork a process, however I'm unsure how I can pass variables to a process.
I've used Gevent before, which has a handy method: gevent.spawn(function, *args, **kwargs). Is there anything like this that I could use at the process-level?
Any other advice?

The simplest approach would be to use a thread. Pass data to and from a thread with a Queue.

Related

threading.local from a different thread

I'm trying to make a threaded cgi webserver similar to this; however, I'm stuck on how to set local data in the handler for a different thread. Is it possible to set threading.local data, such as a dict, for a thread other than the handler. To be more specific I want to have the request parameters, headers, etc available from a cgi file that was started with subprocess.run. The bottom of the do_GET in this file on github is what I use now, but that can only serve one client at a time. I want to replace this part because I want multiple connections/threads at once, and I need different data in each connection/thread.
Is there a way to edit/set threading.local data from a different thread. Or if there is a better way to achieve what I am trying, please let me know. If you know that this is definently impossible, say so.
Thanks in advance!

Without seeing what test code you have, and knowing what you've tried so far, I can't tell you exactly what you need to succeed. That said, I can tell you that trying to edit information in a threading.local() object from another thread is not the cleanest path to take.
Generally, the best way to send calls to other threads is through threading.Event() objects. Usually, a thread listens to an Event() object and does an action based on that. In this case, I could see having a handler set an event in the case of a GET request.
Then, in the thread that is writing the cgi file, have a function that, when the Event() object is set, records the data you need and unsets the Event() object.
So, in pseudo-code:
import threading
evt = threading.Event()
def noteTaker(evt):
while True:
if evt.wait():
modifyDataYouNeed()
f.open()
f.write()
f.close()
evt.clear()
def do_GET(evt):
print "so, a query hit your webserver"
evt.set()
print "and noteTaker was just called"
So, while I couldn't answer your question directly, I hope this helps some on how threads communicate and will help you infer what you need :)
threading information (as I'm sure you've read already, but for the sake of diligence) is here

Python multiprocessing - function-like communication between two processes

I've got the following problem:
I have two different classes; let's call them the interface and worker. The interface is supposed to accept requests from outside, and multiplexes them to several workers.
Contrary to almost every example I have found, I have several peculiarities:
The workers are not supposed to be recreated for every request.
The workers are different; a request for workers[0] cannot be answered by workers[1]. This multiplexing is done in interface.
I have a number of function-like calls which are difficult to model via events or simple queues.
There are a few different requests, which would make one queue per request difficult.
For example, assume that each worker is storing a single integer number (let's say the number of calls this worker received). In non-parallel processing, I'd use something like this:
class interface(object):
workers = None #set somewhere else.
def get_worker_calls(self, worker_id):
return self.workers[worker_id].get_calls()
class worker(object)
calls = 0
def get_calls(self):
self.calls += 1
return self.calls
This, obviously, doesn't work. What does?
Or, maybe more relevantly, I don't have experience with multiprocessing. Is there a design paradigm I'm missing that would easily solve the above?
Thanks!
For reference, I have considered several approaches, and I was unable to find a good one:
Use one request and answer queue. I've discarded this idea since that'd either block interface'for the answer-time of the current worker (making it badly scalable), or would require me sending around extra information.
Use of one request queue. Each message contains a pipe to return the answer to that request. After fixing the issue with being unable to send pipes via pipes, I've run into problems with pipe closing unless sending both ends over the connection.
Use of one request queue. Each message contains a queue to return the answer to that request. Fails since I cannot send queues via queues, but the reduction trick doesn't work.
The above also applies to the respective Manager-generated objects.

Multiprocessing means you have 2+ separated processes running. There is no way to access memory from one process to another directly (as with multithreading).
Your best shot is to use some kind of external Queue mechanism, you can start with Celery or RQ. RQ is simpler but celery has built-in monitoring.
But you have to know that Multiprocessing will work only if Celery/RQ are able to "pack" the needed functions/classes and send them to other process. Therefore you have to use __main__ level functions (that are in top of file, not belongs to any class).
You can always implement it yourself, Redis is very simple, ZeroMQ and RabbitMQ are also good.
Beaver library is good example of how to deal with multiprocessing in python using ZeroMQ queue.

Compare methods to terminate running a function after time period

I have a program, which opens a lot of urls and downloads pictures .
I have a function of the program, which manages link's opening and pictures downloading, which contains a for loop and performs some operations on the priority queue. I want to run this function, but no longer than the set time period. For example if this function is running longer than 1 hour I want to terminate it and run the rest of the program (other functions).
I was trying to find some solutions, and I found two question here on stack.
The first solution use only time module First solution
The second use also the multiprocessing module
Second solution. Can some one suggest which one will be more appropriate to use in my program? I will write a pseudocode of my function:
def fun():
for link in linkList:
if link not in queue:
queue.push(link)
else:
queue.updatePriority(link)
if queue:
top = queue.pop()
fun(top)
This function is called in other function:
def run(startLink):
fun(startLink)
And the run() function is called in other module.
Which method is better to use with a program which contains a lot of modules and performs a lot of

The asyncio module is ideal for this task.
You can create a future, then use asyncio.wait which supports a timeout parameter.

Using multiprocessing here would be a little bit tricky, because fun is consuming a priority queue (I'm assuming a Queue.PriorityQueue) that is coming from some other part of the program. That queue cannot easily be passed between processes - you would need to create a custom multiprocessing.BaseManager subclass, register the Queue.PriorityQueue class with it, and start up the Manager server, instantiate a PriorityQueue on the server, and use a Proxy to that instance everywhere you interact with the queue. That's a lot of overhead, and also hurts performance a bit.
Since it appears you don't actually want any concurrency here - you want the rest of the program to stop while fun is running - I don't think there's a compelling reason to use multiprocessing. Instead, I think using the time-based solution makes more sense.

Python : run multiple queries in parallel and get the first finished

I try to create a Python script that performs queries to multiple sites. The script works well (I use urllib2) but just for one link. For multiples sites, I make multiple requests one after the other but it is not very powerful.
What is the ideal solution (the threads I guess) to run multiple queries in parallel and stop others when a query returns a specific string please ?
I found this question but I have not found how to change it to stop the remaining threads... :
Python urllib2.urlopen() is slow, need a better way to read several urls
Thank you in advance !
(sorry if I made mistakes in English, I'm French ^^)

You can use Twisted to deal with multiple requests concurrently. Internally it will use epoll (or iocp or kqueue depending on the platform) to get notified of tcp availability efficently, which is cheaper than using threads. Once one request matches, you cancel the others.
Here is the Twisted http agent tutorial.

Usually this is implemented with the following pattern (sorry, my Python skills are not so good).
You have a class named Runner. This class has long running method, which gets the information you need. Also, it has a Cancel method, which interrupts the long running method in some way (you can make the url request object a class member field, so the cancel class calls the equivalent of request.terminate()).
The long running method need to accept a callback function, which to signal when done.
Then, before you start your many threads, you create instances of all these objects of that class, and keep them in a list. In the same loop you can start these long running methods, passing a callback method of your main program.
And, in the callback method, you just go trough the list of all threaded classes and call their cancel method.
Please, edit my answer with any Python specific implementation :)

You can run your queries with the multiprocessing library, poll for results, and shutdown queries you no longer need. Documentation for the module includes information on the Process class which has a terminate() method. If you wish to limit the number of requests sent out, check out options for pooling.

Is there any way to make an asynchronous function call from Python [Django]?

I am creating a Django application that does various long computations with uploaded files. I don't want to make the user wait for the file to be handled - I just want to show the user a page reading something like 'file is being parsed'.
How can I make an asynchronous function call from a view?
Something that may look like that:
def view(request):
...
if form.is_valid():
form.save()
async_call(handle_file)
return render_to_response(...)

Rather than trying to manage this via subprocesses or threads, I recommend you separate it out completely. There are two approaches: the first is to set a flag in a database table somewhere, and have a cron job running regularly that checks the flag and performs the required operation.
The second option is to use a message queue. Your file upload process sends a message on the queue, and a separate listener receives the message and does what's needed. I've used RabbitMQ for this sort of thing, but others are available.
Either way, your user doesn't have to wait for the process to finish, and you don't have to worry about managing subprocesses.

I have tried to do the same and failed after multiple attempt due of the nature of django and other asynchronous call.
The solution I have come up which could be a bit over the top for you is to have another asynchronous server in the background processing messages queues from the web request and throwing some chunked javascript which get parsed directly from the browser in an asynchronous way (ie: ajax).
Everything is made transparent for the end user via mod_proxy setting.

Unless you specifically need to use a separate process, which seems to be the gist of the other questions S.Lott is indicating as duplicate of yours, the threading module from the Python standard library (documented here) may offer the simplest solution. Just make sure that handle_file is not accessing any globals that might get modified, nor especially modifying any globals itself; ideally it should communicate with the rest of your process only through Queue instances; etc, etc, all the usual recommendations about threading;-).

threading will break runserver if I'm not mistaken. I've had good luck with multiprocess in request handlers with mod_wsgi and runserver. Maybe someone can enlighten me as to why this is bad:
def _bulk_action(action, objs):
# mean ponies here
def bulk_action(request, t):
...
objs = model.objects.filter(pk__in=pks)
if request.method == 'POST':
objs.update(is_processing=True)
from multiprocessing import Process
p = Process(target=_bulk_action,args=(action,objs))
p.start()
return HttpResponseRedirect(next_url)
context = {'t': t, 'action': action, 'objs': objs, 'model': model}
return render_to_response(...)
http://docs.python.org/library/multiprocessing.html
New in 2.6

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Advice on backgrounding a task with variables? - python

The simplest approach would be to use a thread. Pass data to and from a thread with a Queue.

Related

threading.local from a different thread

Python multiprocessing - function-like communication between two processes

Compare methods to terminate running a function after time period

Python : run multiple queries in parallel and get the first finished

Is there any way to make an asynchronous function call from Python [Django]?

Categories

Resources