How to refactor Python code dependant on fork() copying state

How to refactor Python code dependant on fork() copying state - python

I'm working on a large-ish Python code base that's been around for over a decade now. The application in question makes use of forking for it's parallelism.
The basic premise is that the user asks the program to build a particular target, we figure out a dependancy graph for the target, then from topological partitions in the build graph figure out some tasks we can perform in parallel. We then fork some processes to perform those tasks (from the partition) in parallel.
It all kind of works. However I'd like to refactor it NOT to depend on fork(). In particular, it's the dependancy on state from the master process being available in the child processes that's a problem.
There are a couple of motivating factors for the refactoring:
I'd like to have the code as similar as possible between Linux
and Windows (currently on Windows we perform a non forking build,
thus no parallelism)
The forking is a little ugly with respect to
other refactoring I want to do (basically, I'd like to have more
centralised control and monitoring of building). Instead of forking,
I'd like to go through the Python Multiprocessing module (which I've
used in the past with good results).
The problem is quite a lot of data structures that are currently used by the forked processes (which were set up by the master process) cannot easily be serialized (nor can they be inferred for construction by the child process). Open file descriptors is one such example, dependancy on object identity (build graph) is another.
Basically, I'm looking for advice on how to best approach this problem holistically.

I propose following paradigm
Master is a single process and does all the dependency resolution, graph partitioning, etc, down to single, individual job. Thus there is only one copy of system state.
These leaf jobs are offloaded using subprocess or multiprocessing or os.system.
The simpler offload mechanism, the more platform independence :)
Leaves are of course asynchronous, thus you need a framework for handling asynchronous notifications -- you can use gevent or some library that implements futures. If you are truly hardcore, twisted. Python 3.x also brings in asyncio that may be useful.
You can also use a resource / executor pool with ad-hoc notifications, e.g. post-order tranversal which I think can be implemented relatively simply using a recursive function, or in your case, recursive generator.

Related

Best practice for communication between different modules in python

I revisited the development of an application in python I wrote 5 years ago (It was my first interesting project/application in python 2.7, and as I revisit it now I am struggling to make heads or tails). The application had the following units/modules:
got data from a camera (Data Aquisition module - DAQ)
processed the data and determined whether motor should be moved (Control)
Moved an actuator (Actuator)
displayed the data and activated switches on several tk windows. (Display)
Back then (in python 2.7) I developed separate modules for each unit mentioned above. Different threads were spawned for each unit I used a Queue to pass the image data, and the control commands between the Control and the Actuator module.
As the options (and the buzzwords) for Interprocess Communication have multiplied, I was hoping to get an idea of which concepts should I look into. e.g. in 3.10 there is the asyncio with concepts like Coroutines and Tasks, Synchronization Primitives, Subprocesses, Queues.
My main goal is to be able to write separate units that run independently, so that (in theory) it is easier to debug the unit (or even write unit tests).
UPDATE:
As it was mentioned in the comment, what I am describing is communication between threads of the same process, therefore the IPC tag might not be appropriate. I will elaborate below why I chose the IPC tag.
Although I previously developed the software with a single process and communication between the different threads, I recall that this was not optimal because of GIL. GIL imposed sleeping on the Data Acquisition thread at random intervals which made the data collection random at high sampling rates (the application could work adequately because of the low requirements).
Ideally, I would like to investigate a scheme where I could have a separate process collecting data. To my understanding different processes run on different processors and therefore should not be affected by GIL.

Interprocess Communication specifically refers to multiple separate processes communicating. It sounds like you have multiple modules communicating within a single process. (asyncio and threads are intra-process, they do not span processes.) You can use queues if you need for that.
If you need something to be a separate process (e.g. to sidestep the GIL), you can use the multiprocessing module and its objects and primitives for communication. You should howver be aware that there's somewhat significant overhead in serialization there that you wouldn't have in a single process.

Multithreading With Very Large Number of Threads

I'm working on simulating a mesh network with a large number of nodes. The nodes pass data between different master nodes throughout the network.
Each master comes live once a second to receive the information, but the slave nodes don't know when the master is up or not, so when they have information to send, they try and do so every 5 ms for 1 second to make sure they can find the master.
Running this on a regular computer with 1600 nodes results in 1600 threads and the performance is extremely bad.
What is a good approach to handling the threading so each node acts as if it is running on its own thread?
In case it matters, I'm building the simulation in python 2.7, but I'm open to changing to something else if that makes sense.

For one, are you really using regular, default Python threads available in the default Python 2.7 interpreter (CPython), and is all of your code in Python? If so, you are probably not actually using multiple CPU cores because of the global interpreter lock CPython has (see https://wiki.python.org/moin/GlobalInterpreterLock). You could maybe try running your code under Jython, just to check if performance will be better.
You should probably rethink your application architecture and switch to manually scheduling events instead of using threads, or maybe try using something like greenlets (https://stackoverflow.com/a/15596277/1488821), but that would probably mean less precise timings because of lack of parallelism.

To me, 1600 threads sounds like a lot but not excessive given that it's a simulation. If this were a production application it would probably not be production-worthy.
A standard machine should have no trouble handling 1600 threads. As to the OS this article could provide you with some insights.
When it comes to your code a Python script is not a native application but an interpreted script and as such will require more CPU resources to execute.
I suggest you try implementing the simulation in C or C++ instead which will produce a native application which should execute more efficiently.

Do not use threading for that. If sticking to Python, let the nodes perform their actions one by one. If the performance you get doing so is OK, you will not have to use C/C++. If the actions each node perform are simple, that may work. Anyway, there is no reason to use threads in Python at all. Python threads are usable mostly for making blocking I/O not to block your program, not for multiple CPU kernels utilization.
If you want to really use parallel processing and to write your nodes as if they were really separated and exchanging only using messages, you may use Erlang (http://www.erlang.org/). It is a functional language very well suited for executing parallel processes and making them exchange messages. Erlang processes do not map to OS threads, and you may create thousands of them. However, Erlang is a purely functional language and may seem extremely strange if you have never used such languages. And it also is not very fast, so, like Python, it is unlikely to handle 1600 actions every 5ms unless the actions are rather simple.
Finally, if you do not get desired performance using Python or Erlang, you may move to C or C++. However, still do not use 1600 threads. In fact, using threads to gain performance is reasonable only if the number of threads does not dramatically exceed number of CPU kernels. A reactor pattern (with several reactor threads) is what you may need in that case (http://en.wikipedia.org/wiki/Reactor_pattern). There is an excellent implementation of the reactor pattern in boost.asio library. It is explained here: http://www.gamedev.net/blog/950/entry-2249317-a-guide-to-getting-started-with-boostasio/

Some random thoughts here:
I did rather well with several hundred threads working like this in Java; it can be done with the right language. (But I haven't tried this in Python.)
In any language, you could run the master node code in one thread; just have it loop continuously, running the code for each master in each cycle. You'll lose the benefits of multiple cores that way, though. On the other hand, you'll lose the problems of multithreading, too. (You could have, say, 4 such threads, utilizing the cores but getting the multithreading headaches back. It'll keep the thread-overhead down, too, but then there's blocking...)
One big problem I had was threads blocking each other. Enabling 100 threads to call the same method on the same object at the same time without waiting for each other requires a bit of thought and even research. I found my multithreading program at first often used only 25% of a 4-core CPU even when running flat out. This might be one reason you're running slow.
Don't have your slave nodes repeat sending data. The master nodes should come alive in response to data coming in, or have some way of storing it until they do come alive, or some combination.
It does pay to have more threads than cores. Once you have two threads, they can block each other (and will if they share any data). If you have code to run that won't block, you want to run it in its own thread so it won't be waiting for code that does block to unblock and finish. I found once I had a few threads, they started to multiply like crazy--hence my hundreds-of-threads program. Even when 100 threads block at one spot despite all my brilliance, there's plenty of other threads to keep the cores busy!

Maximally simple django timed/scheduled tasks (e.g.: reminders)?

Question is relevant to this and this;
the difference is, I'd prefer something with possibly more precision and low load (per-minute cron job isn't preferable for those) and with minimal overhead (i.e. installing celery with rabbitmq seems like a big overkill).
An example task for such is personal reminders server (with reminders that could be edited over web and sent out through e-mail or XMPP).
I'm probably looking for something more like node.js's setTimeout but for django (and though I might prefer to implement reminders in node.js anyway, it's still a possibly interesting question).
For example, it's possible to start new threads in django app (with functions consisting of sleep() and send()); in what ways this can be bad?

The problem with using threads for this solution are the typical problems with Python threads that always drive people towards multi-process solutions instead. The problem is compounded here by the fact your thread isn't driven by the normal request-response cycle. This is summarized nicely by Malcolm Tredinnick here:
Have to disagree. Threads are not a good solution to this problem. The
issue is process management. As written, your threads will never be
rejoined. Webserver processes have a lifecycle uncontrollable by you
(the MaxRequestsPerChild Apache parameter and similar things in other
servers) and you are messing with that by using threads.
If you need a process with a lifecycle that is not matched by the
request-response path — something long running and independent of the
response — a completely separate process is definitely the right model
to use. Using a thread is tying it to the response lifecycle, which
wil have unintended side-effects.
A possible solution for you might be to have a long running process performing your tasks which gets a wake-up signal from a light cron process.
Another possibility would be build something using 0mq, which is much lighter than AMQP style queues (at the cost of some features of course). Tarek Ziade is working on a Mozilla project called powerhose that uses 0mq, looks super simple, and has a heartbeat capability with resolution to the second.

Twisted Threading + MapReduce on a single node/server?

I'm confused about Twisted threading.
I've heard and read more than a few articles, books, and sat through a few presentations on the subject of threading vs processes in Python. It just seems to me that unless one is doing lots of IO or wanting to utilize shared memory across jobs, then the right choice is to use multiprocessing.
However, from what I've seen so far, it seems like Twisted uses Threads (pThreads from the python threading module). And Twisted seems to perform really really well in processing lots of data.
I've got a fairly large number of processes that I'd like to distribute processing to using the MapReduce pattern in Python on a single node/server. They don't do any IO really, they just do a lot of processing.
Is the Twisted reactor the right tool for this job?

The short answer to your question: no, twisted threading is not the right solution for heavy processing.
If you have a lot of processing to do, twisted's threading will still be subject to the GIL (Global Interpreter Lock). Without going into a long in depth explanation, the GIL is what allows only one thread at a time to execute python code. What this means in effect is you will not be able to take advantage of multiple cores with a single multi-threaded twisted process. That said, some C modules (such as bits of SciPy) can release the GIL and run multi-threaded, though the python code associated is still effectively single-threaded.
What twisted's threading is mainly useful for is using it along with blocking I/O based modules. A prime example of this is database API's, because the db-api spec doesn't account for asynchronous use cases, and most database modules adhere to the spec. Thusly, to use PostgreSQL for example from a twisted app, one has to either block or use something like twisted.enterprise.adbapi which is a wrapper that uses twisted.internet.threads.deferToThread to allow a SQL query to execute while other stuff is going on. This can allow other python code to run because the socket module (among most others involving operating system I/O) will release the GIL while in a system call.
That said, you can use twisted to write a network application talking to many twisted (or non-twisted, if you'd like) workers. Each worker could then work on little bits of work, and you would not be restricted by the GIL, because each worker would be its own completely isolated process. The master process can then make use of many of twisted's asynchronous primitives. For example you could use a DeferredList to wait on a number of results coming from any number of workers, and then run a response handler when all of the Deferred's complete. (thus allowing you to do your map call) If you want to go down this route, I recommend looking at twisted.protocols.amp, which is their Asynchronous Message Protocol, and can be used very trivially to implement a network-based RPC or map-reduce.
The downside of running many disparate processes versus something like multiprocessing is that
you lose out on simple process management, and
the subprocesses can't share memory as if they would if they were forked on a unix system.
Though for modern systems, 2) is rarely a problem unless you are running hundreds of subprocesses. And problem 1) can be solved by using a process management system like supervisord
Edit For more on python and the GIL, you should watch Dave Beazley's talks on the subject ( website , video, slides )

Threading in Python [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
What are the modules used to write multi-threaded applications in Python? I'm aware of the basic concurrency mechanisms provided by the language and also of Stackless Python, but what are their respective strengths and weaknesses?

In order of increasing complexity:
Use the threading module
Pros:
It's really easy to run any function (any callable in fact) in its
own thread.
Sharing data is if not easy (locks are never easy :), at
least simple.
Cons:
As mentioned by Juergen Python threads cannot actually concurrently access state in the interpreter (there's one big lock, the infamous Global Interpreter Lock.) What that means in practice is that threads are useful for I/O bound tasks (networking, writing to disk, and so on), but not at all useful for doing concurrent computation.
Use the multiprocessing module
In the simple use case this looks exactly like using threading except each task is run in its own process not its own thread. (Almost literally: If you take Eli's example, and replace threading with multiprocessing, Thread, with Process, and Queue (the module) with multiprocessing.Queue, it should run just fine.)
Pros:
Actual concurrency for all tasks (no Global Interpreter Lock).
Scales to multiple processors, can even scale to multiple machines.
Cons:
Processes are slower than threads.
Data sharing between processes is trickier than with threads.
Memory is not implicitly shared. You either have to explicitly share it or you have to pickle variables and send them back and forth. This is safer, but harder. (If it matters increasingly the Python developers seem to be pushing people in this direction.)
Use an event model, such as Twisted
Pros:
You get extremely fine control over priority, over what executes when.
Cons:
Even with a good library, asynchronous programming is usually harder than threaded programming, hard both in terms of understanding what's supposed to happen and in terms of debugging what actually is happening.
In all cases I'm assuming you already understand many of the issues involved with multitasking, specifically the tricky issue of how to share data between tasks. If for some reason you don't know when and how to use locks and conditions you have to start with those. Multitasking code is full of subtleties and gotchas, and it's really best to have a good understanding of concepts before you start.

You've already gotten a fair variety of answers, from "fake threads" all the way to external frameworks, but I've seen nobody mention Queue.Queue -- the "secret sauce" of CPython threading.
To expand: as long as you don't need to overlap pure-Python CPU-heavy processing (in which case you need multiprocessing -- but it comes with its own Queue implementation, too, so you can with some needed cautions apply the general advice I'm giving;-), Python's built-in threading will do... but it will do it much better if you use it advisedly, e.g., as follows.
"Forget" shared memory, supposedly the main plus of threading vs multiprocessing -- it doesn't work well, it doesn't scale well, never has, never will. Use shared memory only for data structures that are set up once before you spawn sub-threads and never changed afterwards -- for everything else, make a single thread responsible for that resource, and communicate with that thread via Queue.
Devote a specialized thread to every resource you'd normally think to protect by locks: a mutable data structure or cohesive group thereof, a connection to an external process (a DB, an XMLRPC server, etc), an external file, etc, etc. Get a small thread pool going for general purpose tasks that don't have or need a dedicated resource of that kind -- don't spawn threads as and when needed, or the thread-switching overhead will overwhelm you.
Communication between two threads is always via Queue.Queue -- a form of message passing, the only sane foundation for multiprocessing (besides transactional-memory, which is promising but for which I know of no production-worthy implementations except In Haskell).
Each dedicated thread managing a single resource (or small cohesive set of resources) listens for requests on a specific Queue.Queue instance. Threads in a pool wait on a single shared Queue.Queue (Queue is solidly threadsafe and won't fail you in this).
Threads that just need to queue up a request on some queue (shared or dedicated) do so without waiting for results, and move on. Threads that eventually DO need a result or confirmation for a request queue a pair (request, receivingqueue) with an instance of Queue.Queue they just made, and eventually, when the response or confirmation is indispensable in order to proceed, they get (waiting) from their receivingqueue. Be sure you're ready to get error-responses as well as real responses or confirmations (Twisted's deferreds are great at organizing this kind of structured response, BTW!).
You can also use Queue to "park" instances of resources which can be used by any one thread but never be shared among multiple threads at one time (DB connections with some DBAPI compoents, cursors with others, etc) -- this lets you relax the dedicated-thread requirement in favor of more pooling (a pool thread that gets from the shared queue a request needing a queueable resource will get that resource from the apppropriate queue, waiting if necessary, etc etc).
Twisted is actually a good way to organize this minuet (or square dance as the case may be), not just thanks to deferreds but because of its sound, solid, highly scalable base architecture: you may arrange things to use threads or subprocesses only when truly warranted, while doing most things normally considered thread-worthy in a single event-driven thread.
But, I realize Twisted is not for everybody -- the "dedicate or pool resources, use Queue up the wazoo, never do anything needing a Lock or, Guido forbid, any synchronization procedure even more advanced, such as semaphore or condition" approach can still be used even if you just can't wrap your head around async event-driven methodologies, and will still deliver more reliability and performance than any other widely-applicable threading approach I've ever stumbled upon.

It depends on what you're trying to do, but I'm partial to just using the threading module in the standard library because it makes it really easy to take any function and just run it in a separate thread.
from threading import Thread
def f():
...
def g(arg1, arg2, arg3=None):
....
Thread(target=f).start()
Thread(target=g, args=[5, 6], kwargs={"arg3": 12}).start()
And so on. I often have a producer/consumer setup using a synchronized queue provided by the Queue module
from Queue import Queue
from threading import Thread
q = Queue()
def consumer():
while True:
print sum(q.get())
def producer(data_source):
for line in data_source:
q.put( map(int, line.split()) )
Thread(target=producer, args=[SOME_INPUT_FILE_OR_SOMETHING]).start()
for i in range(10):
Thread(target=consumer).start()

Kamaelia is a python framework for building applications with lots of communicating processes.
(source: kamaelia.org) Kamaelia - Concurrency made useful, fun
In Kamaelia you build systems from simple components that talk to each other. This speeds development, massively aids maintenance and also means you build naturally concurrent software. It's intended to be accessible by any developer, including novices. It also makes it fun :)
What sort of systems? Network servers, clients, desktop applications, pygame based games, transcode systems and pipelines, digital TV systems, spam eradicators, teaching tools, and a fair amount more :)
Here's a video from Pycon 2009. It starts by comparing Kamaelia to Twisted and Parallel Python and then gives a hands on demonstration of Kamaelia.
Easy Concurrency with Kamaelia - Part 1 (59:08)
Easy Concurrency with Kamaelia - Part 2 (18:15)

Regarding Kamaelia, the answer above doesn't really cover the benefit here. Kamaelia's approach provides a unified interface, which is pragmatic not perfect, for dealing with threads, generators & processes in a single system for concurrency.
Fundamentally it provides a metaphor of a running thing which has inboxes, and outboxes. You send messages to outboxes, and when wired together, messages flow from outboxes to inboxes. This metaphor/API remains the same whether you're using generators, threads or processes, or speaking to other systems.
The "not perfect" part is due to syntactic sugar not being added as yet for inboxes and outboxes (though this is under discussion) - there is a focus on safety/usability in the system.
Taking the producer consumer example using bare threading above, this becomes this in Kamaelia:
Pipeline(Producer(), Consumer() )
In this example it doesn't matter if these are threaded components or otherwise, the only difference is between them from a usage perspective is the baseclass for the component. Generator components communicate using lists, threaded components using Queue.Queues and process based using os.pipes.
The reason behind this approach though is to make it harder to make hard to debug bugs. In threading - or any shared memory concurrency you have, the number one problem you face is accidentally broken shared data updates. By using message passing you eliminate one class of bugs.
If you use bare threading and locks everywhere you're generally working on the assumption that when you write code that you won't make any mistakes. Whilst we all aspire to that, it's very rare that will happen. By wrapping up the locking behaviour in one place you simplify where things can go wrong. (Context handlers help, but don't help with accidental updates outside the context handler)
Obviously not every piece of code can be written as message passing and shared style which is why Kamaelia also has a simple software transactional memory (STM), which is a really neat idea with a nasty name - it's more like version control for variables - ie check out some variables, update them and commit back. If you get a clash you rinse and repeat.
Relevant links:
Europython 09 tutorial
Monthly releases
Mailing list
Examples
Example Apps
Reusable components (generator & thread)
Anyway, I hope that's a useful answer. FWIW, the core reason behind Kamaelia's setup is to make concurrency safer & easier to use in python systems, without the tail wagging the dog. (ie the big bucket of components
I can understand why the other Kamaelia answer was modded down, since even to me it looks more like an ad than an answer. As the author of Kamaelia it's nice to see enthusiasm though I hope this contains a bit more relevant content :-)
And that's my way of saying, please take the caveat that this answer is by definition biased, but for me, Kamaelia's aim is to try and wrap what is IMO best practice. I'd suggest trying a few systems out, and seeing which works for you. (also if this is inappropriate for stack overflow, sorry - I'm new to this forum :-)

I would use the Microthreads (Tasklets) of Stackless Python, if I had to use threads at all.
A whole online game (massivly multiplayer) is build around Stackless and its multithreading principle -- since the original is just to slow for the massivly multiplayer property of the game.
Threads in CPython are widely discouraged. One reason is the GIL -- a global interpreter lock -- that serializes threading for many parts of the execution. My experiance is, that it is really difficult to create fast applications this way. My example codings where all slower with threading -- with one core (but many waits for input should have made some performance boosts possible).
With CPython, rather use seperate processes if possible.

If you really want to get your hands dirty, you can try using generators to fake coroutines. It probably isn't the most efficient in terms of work involved, but coroutines do offer you very fine control of co-operative multitasking rather than pre-emptive multitasking you'll find elsewhere.
One advantage you'll find is that by and large, you will not need locks or mutexes when using co-operative multitasking, but the more important advantage for me was the nearly-zero switching speed between "threads". Of course, Stackless Python is said to be very good for that as well; and then there's Erlang, if it doesn't have to be Python.
Probably the biggest disadvantage in co-operative multitasking is the general lack of workaround for blocking I/O. And in the faked coroutines, you'll also encounter the issue that you can't switch "threads" from anything but the top level of the stack within a thread.
After you've made an even slightly complex application with fake coroutines, you'll really begin to appreciate the work that goes into process scheduling at the OS level.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.