Share a resource between two processes - python

I want to know the best practices followed to share a queue (resource) between two processes in Python. Here is a what each process is doing:
Process_1: continuously gets data (in json format) from a streaming api
Process_2: is a daemon (similar to Sander Marechal's code) which commits data (one at a time) into a database
So, Process_1 (or Producer) puts a unit of data onto this shared resource, and Process_2 (or Consumer) will poll this shared resource for any new units of data, and store them in a DB, if any.
There are some options which came to my mind:
Using pickle (drawback: extra overhead of pickling and de-pickling)
Passing data via stdout of Process_1
to stdin of Process_2 (drawback: none, but not sure how to implement this with a daemon)
Using the pool object in the multiprocessing library (drawback: not sure how to code this as one process is a daemon)
I would like an optimal solution practiced in this regard, with some code :). Thanks.

multiprocessing.pool isn't what you want in this case - it is useful for having multiple units of work done 'in the background' (concurrently), not so much for managing a shared resource. Since you appear to have the format of the communications worked out, and they communicate in only one direction, what you want is a multiprocessing.Queue - the documentation has a good example of how to use it - you will want your Process_1 putting data into the Queue as needed, and Process_2 calling q.get() in an infinite loop. This will cause the Consumer to block when there is nothing to do, rather than busy-waiting as you suggest (which can waste processor cycles). The issue that this leaves is closing the daemon - possibly the best way is to have the Producer put a sentinel value at the end of the queue, to ensure that the Consumer deals with all requests. Other alternatives include things like trying to forcibly kill the process when the child exits, but this is error-prone.
Note that this assumes that the Producer spawns the Consumer (or vice versa) - if the Consumer is a long-running daemon that can deal with multiple relatively short-lived Producers, the situation becomes quite a bit harder - there isn't, to my knowledge, any cross-platform high-level IPC module; the most portable (and generally easiest) way to handle this may be to use the filesystem as a queue - have a spool folder somewhere that the Producers write a text file to for each request; the Consumer can then process these at its leisure - however, this isn't without its own issues: you would need to ensure that the Consumer doesn't try to open a half-written instruction file, that the Producers aren't stepping on each other's toes, and that the Producers and the Consumer agree on the ordering of requests.

Related

Python multiprocessing - function-like communication between two processes

I've got the following problem:
I have two different classes; let's call them the interface and worker. The interface is supposed to accept requests from outside, and multiplexes them to several workers.
Contrary to almost every example I have found, I have several peculiarities:
The workers are not supposed to be recreated for every request.
The workers are different; a request for workers[0] cannot be answered by workers[1]. This multiplexing is done in interface.
I have a number of function-like calls which are difficult to model via events or simple queues.
There are a few different requests, which would make one queue per request difficult.
For example, assume that each worker is storing a single integer number (let's say the number of calls this worker received). In non-parallel processing, I'd use something like this:
class interface(object):
workers = None #set somewhere else.
def get_worker_calls(self, worker_id):
return self.workers[worker_id].get_calls()
class worker(object)
calls = 0
def get_calls(self):
self.calls += 1
return self.calls
This, obviously, doesn't work. What does?
Or, maybe more relevantly, I don't have experience with multiprocessing. Is there a design paradigm I'm missing that would easily solve the above?
Thanks!
For reference, I have considered several approaches, and I was unable to find a good one:
Use one request and answer queue. I've discarded this idea since that'd either block interface'for the answer-time of the current worker (making it badly scalable), or would require me sending around extra information.
Use of one request queue. Each message contains a pipe to return the answer to that request. After fixing the issue with being unable to send pipes via pipes, I've run into problems with pipe closing unless sending both ends over the connection.
Use of one request queue. Each message contains a queue to return the answer to that request. Fails since I cannot send queues via queues, but the reduction trick doesn't work.
The above also applies to the respective Manager-generated objects.
Multiprocessing means you have 2+ separated processes running. There is no way to access memory from one process to another directly (as with multithreading).
Your best shot is to use some kind of external Queue mechanism, you can start with Celery or RQ. RQ is simpler but celery has built-in monitoring.
But you have to know that Multiprocessing will work only if Celery/RQ are able to "pack" the needed functions/classes and send them to other process. Therefore you have to use __main__ level functions (that are in top of file, not belongs to any class).
You can always implement it yourself, Redis is very simple, ZeroMQ and RabbitMQ are also good.
Beaver library is good example of how to deal with multiprocessing in python using ZeroMQ queue.

Worker pool where certain tasks can only be done by certain workers

I have a lot of tasks that I'd like to execute a few at a time. The normal solution for this is a thread pool. However, my tasks need resources that only certain threads have. So I can't just farm a task out to any old thread; the thread has to have the resource the task needs.
It seems like there should be a concurrency pattern for this, but I can't seem to find it. I'm implementing this in Python 2 with multiprocessing, so answers in those terms would be great, but a generic solution is fine. In my case the "threads" are actually separate OS processes and the resources are network connections (and no, it's not a server, so (e)poll/select is not going to help). In general, a thread/process can hold several resources.
Here is a naive solution: put the tasks in a work queue and turn my thread pool loose on it. Have each thread check, "Can I do this task?" If yes, do it; if no, put it back in the queue. However, if each task can only be done by one of N threads, then I'm doing ~2N expensive, wasted accesses to a shared queue just to get one unit of work.
Here is my current thought: have a shared work queue for each resource. Farm out tasks to the matching queue. Each thread checks the queue(s) it can handle.
Ideas?
A common approach to this is to not allocate resources to threads and queue the appropriate resource in with the data, though I appreciate that this is not always possible if a resource is bound to a particular thread.
The idea of using a queue per resource with threads only popping objects from the queues containing objects it can handle may work.
It may be possible to use a semaphore+concurrentQueue array, indexed by resource, for signaling such threads and also providing a priority system, so eliminating most of the polling and wasteful requeueing. I will have to think a bit more about that - it kinda depends on how the resources map to the threads.

Python multiple processes instead of threads?

I am working on a web backend that frequently grabs realtime market data from the web, and puts the data in a MySQL database.
Currently I have my main thread push tasks into a Queue object. I then have about 20 threads that read from that queue, and if a task is available, they execute it.
Unfortunately, I am running into performance issues, and after doing a lot of research, I can't make up my mind.
As I see it, I have 3 options:
Should I take a distributed task approach with something like Celery?
Should I switch to JPython or IronPython to avoid the GIL issues?
Or should I simply spawn different processes instead of threads using processing?
If I go for the latter, how many processes is a good amount? What is a good multi process producer / consumer design?
Thanks!
Maybe you should use an event-driven approach, and use an event-driven oriented frameworks like twisted(python) or node.js(javascript), for example this frameworks make use of the UNIX domain sockets, so your consumer listens at some port, and your event generator object pushes all the info to the consumer, so your consumer don't have to check every time to see if there's something in the queue.
First, profile your code to determine what is bottlenecking your performance.
If each of your threads are frequently writing to your MySQL database, the problem may be disk I/O, in which case you should consider using an in-memory database and periodically write it to disk.
If you discover that CPU performance is the limiting factor, then consider using the multiprocessing module instead of the threading module. Use a multiprocessing.Queue object to push your tasks. Also make sure that your tasks are big enough to keep each core busy for a while, so that the granularity of communication doesn't kill performance. If you are currently using threading, then switching to multiprocessing would be the easiest way forward for now.

What is the optimal way to organize infinitely looped work queue?

I have about 1000-10000 jobs which I need to run on a constant basis each minute or so. Sometimes new job comes in or other needs to be cancelled but it's rare event. Jobs are tagged and must be disturbed among workers each of them processes only jobs of specific kind.
For now I want to use cron and load whole database of jobs in some broker -- RabbitMQ or beanstalkd (haven't decided which one to use though).
But this approach seems ugly to me (using timer to simulate infinity, loading the whole database, etc) and has the disadvantage: for example if some kind of jobs are processed slower than added into the queue it may be overwhelmed and message broker will eat all ram, swap and then just halt.
Is there any other possibilities? Am I not using right patterns for a job? (May be I don't need queue or something..?)
p.s. I'm using python if this is important.
You create your initial batch of jobs and add them to the queue.
You have n-consumers of the queue each running the jobs. Adding consumers to the queue simply round-robins the distribution of jobs to each listening consumer, giving you arbitrary horizontal scalability.
Each job can, upon completion, be responsible for resubmitting itself back to the queue. This means that your job queue won't grow beyond the length that it was when you initialised it.
The master job can, if need be, spawn sub-jobs and add them to the queue.
For different types of jobs it is probably a good idea to use different queues. That way you can balance the load more effectively by having different quantities/horsepower of workers running the jobs from the different queues.
The fact that you are running Python isn't important here, it's the pattern, not the language that you need to nail first.
You can use asynchronous framework, e.g. Twisted
I don't think either it's a good idea to run script by cron daemon each minute (and you mentioned reasons), so I offer you Twisted. It doesn't give you benefit with scheduling, but you get flexibility in process management and memory sharing

Should I use epoll or just blocking recv in threads?

I'm trying to write a scalable custom web server.
Here's what I have so far:
The main loop and request interpreter are in Cython. The main loop accepts connections and assigns the sockets to one of the processes in the pool (has to be processes, threads won't get any benefit from multi-core hardware because of the GIL).
Each process has a thread pool. The process assigns the socket to a thread.
The thread calls recv (blocking) on the socket and waits for data. When some shows up, it gets piped into the request interpreter, and then sent via WSGI to the application running in that thread.
Now I've heard about epoll and am a little confused. Is there any benefit to using epoll to get socket data and then pass that directly to the processes? Or should I just go the usual route of having each thread wait on recv?
PS: What is epoll actually used for? It seems like multithreading and blocking fd calls would accomplish the same thing.
If you're already using multiple threads, epoll doesn't offer you much additional benefit.
The point of epoll is that a single thread can listen for activity on many file selectors simultaneously (and respond to events on each as they occur), and thus provide event-driven multitasking without requiring the spawning of additional threads. Threads are relatively cheap (compared to spawning processes), but each one does require some overhead (after all, they each have to maintain a call stack).
If you wanted to, you could rewrite your pool processes to be single-threaded using epoll, which would reduce your overall thread usage count, but of course you'd have to consider whether that's something you care about or not - in general, for low numbers of simultaneous requests on each worker, the overhead of spawning threads wouldn't matter, but if you want each worker to be able to handle 1000s of open connections, that overhead can become significant (and that's where epoll shines).
But...
What you're describing sounds suspiciously like you're basically reinventing the wheel - your:
main loop and request interpreter
pool of processes
sounds almost exactly like:
nginx (or any other load balancer/reverse proxy)
A pre-forking tornado app
Tornado is a single-threaded web server python module using epoll, and it has the capability built-in for pre-forking (meaning that it spawns multiple copies of itself as separate processes, effectively creating a process pool). Tornado is based on the tech created to power Friendfeed - they needed a way to handle huge numbers of open connections for long-polling clients looking for new real-time updates.
If you're doing this as a learning process, then by all means, reinvent away! It's a great way to learn. But if you're actually trying to build an application on top of these kinds of things, I'd highly recommend considering using the existing, stable, communally-developed projects - it'll save you a lot of time, false starts, and potential gotchas.
(P.S. I approve of your avatar. <3)
The epoll function (and the other functions in the same family poll and select) allow you to write single threading networking code that manage multiple networking connection. Since there is no threading, there is no need fot synchronisation as would be required in a multi-threaded program (this can be difficult to get right).
On the other hand, you'll need to have an explicit state machine for each connection. In a threaded program, this state machine is implicit.
Those function just offer another way to multiplex multiple connexion in a process. Sometimes it is easier not to use threads, other times you're already using threads, and thus it is easier just to use blocking sockets (which release the GIL in Python).

Categories